Skip to main content

On Data Clustering Analysis: Scalability, Constraints, and Validation

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Abstract

Clustering is the problem of grouping data based on similarity. While this problem has attracted the attention of many researchers for many years, we are witnessing a resurgence of interest in new clustering techniques. In this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agrawal R., Gehrke J., Gunopulos D. and Raghavan P. (1998) Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACMSIGMOD Int. Conf. Management of Data, pp 94–105.

    Google Scholar 

  2. Ester M., Kriegel H.-P., Sander J. and Xu X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM-SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp 226–231.

    Google Scholar 

  3. Estivill-Castro V. and Lee I. (2000) Autoclust+: Automatic clustering of pointdata sets in the presence of obstacles. In Int. Workshop on Temporal and Spatio-Temporal Data Mining, pp 133–146.

    Google Scholar 

  4. Estivill-Castro V. and Lee I. (2000) Autoclust: Automatic clustering via boundary extraction for mining massive point-data sets. In Proc. 5th International Conference on Geocomputation.

    Google Scholar 

  5. Foss A., Wang W. and Zaïane O. R. (2001) A non-parametric approach to web log analysis. In Proc. of Workshop on Web Mining in First International SIAM Conference on Data Mining, pp 41–50.

    Google Scholar 

  6. Foss A. and Zaiäne O. R. (2002) TURN* unsupervised clustering of spatial data, submitted to ACM-SIKDD Intl. Conf. on Knowledge Discovery and Data Mining, July 2002.

    Google Scholar 

  7. Gath I. and Geva A. (1989) Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7).

    Google Scholar 

  8. Guha S., Rastogi R. and Shim K. (1999) ROCK: a robust clustering algorithm for categorical attributes. In 15th ICDE Int’l Conf. on Data Engineering.

    Google Scholar 

  9. Halkidi M., Vazirgiannis M. and Batistakis I. (2000) Quality scheme assessment in the clustering process. In Proc. of PKDD, Lyon, France.

    Google Scholar 

  10. Han J. and Kamber M. (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.

    Google Scholar 

  11. Huang Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, v2 pp283–304.

    Article  Google Scholar 

  12. MacQueen J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob..

    Google Scholar 

  13. Sammon J. Jr. (1969) A non-linear mapping for data structure analysis. IEEE Trans. Computers, v18 pp401–409.

    Article  Google Scholar 

  14. Karypis G., Han E.-H. and Kumar V. (1999) Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8) pp68–75.

    Google Scholar 

  15. Kohonen T. (1995) Self-Organizing Maps, Springer-Verlag.

    Google Scholar 

  16. Kaufman L. and Rousseeuw P. J. (1990) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons.

    Google Scholar 

  17. Halkidi M., Batistakis Y. and Vazirgiannis M. (2001) On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3) pp 107–145.

    Article  MATH  Google Scholar 

  18. Ankerst M., Breunig M, Kriegel H.-P. and Sander J. (1999) Optics: Ordering points to identify the clustering structure. In Proc. ACM-SIGMOD Conf. on Management of Data, pp 49–60.

    Google Scholar 

  19. Pal N. R. and Biswas J. (1997) Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6).

    Google Scholar 

  20. Ng R. and Han J. (1994) Efficient and effective clustering method for spatial data mining. In Proc. Conf. on Very Large Data Bases, pp 144–155.

    Google Scholar 

  21. Guha S., Rastogi R. and Shim K. (1998) CURE: An efficient clustering algorithm for large databases. In Proc. ACM-SIGMOD Conf. on Management of Data.

    Google Scholar 

  22. Schwenker F., Kestler H. and Palm G. (2000) An algorithm for adaptive clustering and visualisation of highdimensional data sets. In H.-J. L. G. della Riccia, R. Kruse, editor, Computational Intelligence in Data Mining, pp 127–140. Springer, Wien, New York.

    Google Scholar 

  23. Sharma S. (1996) Applied Multivariate Techniques. John Willey & Sons.

    Google Scholar 

  24. Sheikholeslami G., Chatterjee S. and Zhang A. (1998) Wavecluster: a multiresolution clustering approach for very large spatial databases. In Proc. 24th Conf. on Very Large Data Bases.

    Google Scholar 

  25. Smyth P. (1996) Clustering using monte carlo cross-validation. Proc. ACMSIGKDD Int. Conf. Knowledge Discovery and Data Mining.

    Google Scholar 

  26. Steinbach M., Karypis G. and Kumar V. (2000) A comparison of document clustering techniques. In SIGKDD Workshop on Text Mining.

    Google Scholar 

  27. Theodoridis S. and Koutroubas K. (1999) Pattern recognition, Academic Press.

    Google Scholar 

  28. Tung A. K. H., Hou J. and Han J. (2001) Spatial clustering in the presence of obstacles. In Proc. ICDE Int. Conf. On Data Engineering.

    Google Scholar 

  29. Tung A. K. H., Ng R., Lakshmanan L. V. S. and Han J. (2001) Constraint-based clustering in large databases. In Proc. ICDT, pp 405–419.

    Google Scholar 

  30. Xie X. and Beni G. (1991) A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4).

    Google Scholar 

  31. Zaïane O. R., Foss A., Lee C.-H. and Wang W. (2002) Data clustering analysis from simple groupings to scalable clustering with constraints. Technical Report, TR02-03, Department of Computing Science, University of Alberta.

    Google Scholar 

  32. Zaïane O. R. and Lee C.-H. (2002) Clustering spatial data in the presence of obstacles and crossings: a density-based approach. submitted to IDEAS Intl. Database Engineering and Applications Symposium.

    Google Scholar 

  33. Zhang T., Ramakrishnan R. and Livny M. (1996) BIRCH: an efficient data clustering method for very large databases. In Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp 103–114.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zaïane, O.R., Foss, A., Lee, CH., Wang, W. (2002). On Data Clustering Analysis: Scalability, Constraints, and Validation. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-47887-6_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43704-8

  • Online ISBN: 978-3-540-47887-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics