On Data Clustering Analysis: Scalability, Constraints, and Validation

Zaïane, Osmar R.; Foss, Andrew; Lee, Chi-Hoon; Wang, Weinan

doi:10.1007/3-540-47887-6_4

Osmar R. Zaïane⁴,
Andrew Foss⁴,
Chi-Hoon Lee⁴ &
…
Weinan Wang⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2294 Accesses
29 Citations

Abstract

Clustering is the problem of grouping data based on similarity. While this problem has attracted the attention of many researchers for many years, we are witnessing a resurgence of interest in new clustering techniques. In this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal R., Gehrke J., Gunopulos D. and Raghavan P. (1998) Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACMSIGMOD Int. Conf. Management of Data, pp 94–105.
Google Scholar
Ester M., Kriegel H.-P., Sander J. and Xu X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM-SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp 226–231.
Google Scholar
Estivill-Castro V. and Lee I. (2000) Autoclust+: Automatic clustering of pointdata sets in the presence of obstacles. In Int. Workshop on Temporal and Spatio-Temporal Data Mining, pp 133–146.
Google Scholar
Estivill-Castro V. and Lee I. (2000) Autoclust: Automatic clustering via boundary extraction for mining massive point-data sets. In Proc. 5th International Conference on Geocomputation.
Google Scholar
Foss A., Wang W. and Zaïane O. R. (2001) A non-parametric approach to web log analysis. In Proc. of Workshop on Web Mining in First International SIAM Conference on Data Mining, pp 41–50.
Google Scholar
Foss A. and Zaiäne O. R. (2002) TURN* unsupervised clustering of spatial data, submitted to ACM-SIKDD Intl. Conf. on Knowledge Discovery and Data Mining, July 2002.
Google Scholar
Gath I. and Geva A. (1989) Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7).
Google Scholar
Guha S., Rastogi R. and Shim K. (1999) ROCK: a robust clustering algorithm for categorical attributes. In 15th ICDE Int’l Conf. on Data Engineering.
Google Scholar
Halkidi M., Vazirgiannis M. and Batistakis I. (2000) Quality scheme assessment in the clustering process. In Proc. of PKDD, Lyon, France.
Google Scholar
Han J. and Kamber M. (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.
Google Scholar
Huang Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, v2 pp283–304.
Article Google Scholar
MacQueen J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob..
Google Scholar
Sammon J. Jr. (1969) A non-linear mapping for data structure analysis. IEEE Trans. Computers, v18 pp401–409.
Article Google Scholar
Karypis G., Han E.-H. and Kumar V. (1999) Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8) pp68–75.
Google Scholar
Kohonen T. (1995) Self-Organizing Maps, Springer-Verlag.
Google Scholar
Kaufman L. and Rousseeuw P. J. (1990) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons.
Google Scholar
Halkidi M., Batistakis Y. and Vazirgiannis M. (2001) On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3) pp 107–145.
Article MATH Google Scholar
Ankerst M., Breunig M, Kriegel H.-P. and Sander J. (1999) Optics: Ordering points to identify the clustering structure. In Proc. ACM-SIGMOD Conf. on Management of Data, pp 49–60.
Google Scholar
Pal N. R. and Biswas J. (1997) Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6).
Google Scholar
Ng R. and Han J. (1994) Efficient and effective clustering method for spatial data mining. In Proc. Conf. on Very Large Data Bases, pp 144–155.
Google Scholar
Guha S., Rastogi R. and Shim K. (1998) CURE: An efficient clustering algorithm for large databases. In Proc. ACM-SIGMOD Conf. on Management of Data.
Google Scholar
Schwenker F., Kestler H. and Palm G. (2000) An algorithm for adaptive clustering and visualisation of highdimensional data sets. In H.-J. L. G. della Riccia, R. Kruse, editor, Computational Intelligence in Data Mining, pp 127–140. Springer, Wien, New York.
Google Scholar
Sharma S. (1996) Applied Multivariate Techniques. John Willey & Sons.
Google Scholar
Sheikholeslami G., Chatterjee S. and Zhang A. (1998) Wavecluster: a multiresolution clustering approach for very large spatial databases. In Proc. 24th Conf. on Very Large Data Bases.
Google Scholar
Smyth P. (1996) Clustering using monte carlo cross-validation. Proc. ACMSIGKDD Int. Conf. Knowledge Discovery and Data Mining.
Google Scholar
Steinbach M., Karypis G. and Kumar V. (2000) A comparison of document clustering techniques. In SIGKDD Workshop on Text Mining.
Google Scholar
Theodoridis S. and Koutroubas K. (1999) Pattern recognition, Academic Press.
Google Scholar
Tung A. K. H., Hou J. and Han J. (2001) Spatial clustering in the presence of obstacles. In Proc. ICDE Int. Conf. On Data Engineering.
Google Scholar
Tung A. K. H., Ng R., Lakshmanan L. V. S. and Han J. (2001) Constraint-based clustering in large databases. In Proc. ICDT, pp 405–419.
Google Scholar
Xie X. and Beni G. (1991) A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4).
Google Scholar
Zaïane O. R., Foss A., Lee C.-H. and Wang W. (2002) Data clustering analysis from simple groupings to scalable clustering with constraints. Technical Report, TR02-03, Department of Computing Science, University of Alberta.
Google Scholar
Zaïane O. R. and Lee C.-H. (2002) Clustering spatial data in the presence of obstacles and crossings: a density-based approach. submitted to IDEAS Intl. Database Engineering and Applications Symposium.
Google Scholar
Zhang T., Ramakrishnan R. and Livny M. (1996) BIRCH: an efficient data clustering method for very large databases. In Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp 103–114.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alberta, Edmonton, Alberta, Canada
Osmar R. Zaïane, Andrew Foss, Chi-Hoon Lee & Weinan Wang

Authors

Osmar R. Zaïane
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Foss
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Hoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Weinan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaïane, O.R., Foss, A., Lee, CH., Wang, W. (2002). On Data Clustering Analysis: Scalability, Constraints, and Validation. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_4

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_4
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics