Clustering Methods

Rokach, Lior; Maimon, Oded

doi:10.1007/0-387-25465-X_15

Lior Rokach² &
Oded Maimon²

22k Accesses
549 Citations
3 Altmetric

Abstract

This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Sultan K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28:1443–1451, 1995.
Article Google Scholar
Al-Sultan K. S., Khan M. M.: Computational experience on four algorithms for the hard clustering problem. Pattern Recognition Letters 17(3): 295–308, 1996.
Article Google Scholar
Banfield J. D. and Raftery A. E.. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821, 1993.
MathSciNet MATH Google Scholar
Bentley J. L. and Friedman J. H., Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27(2):97–105, February 1978. 275
Google Scholar
Bonner, R., On Some Clustering Techniques. IBM journal of research and development, 8:22–32, 1964.
Article MATH Google Scholar
Can F., Incremental clustering for dynamic information processing, in ACM Transactions on Information Systems, no. 11, pp 143–164, 1993.
Article Google Scholar
Cheeseman P., Stutz J.: Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining 1996: 153–180
Google Scholar
Dhillon I. and Modha D., Concept Decomposition for Large Sparse Text Data Using Clustering. Machine Learning. 42, pp.143–175. (2001).
Article MATH Google Scholar
Dempster A.P., Laird N.M., and Rubin D.B., Maximum likelihood from incomplete data using the EM algorithm. Journal of the Royal Statistical Society, 39(B), 1977.
Google Scholar
Duda, P. E. Hart and D. G. Stork, Pattern Classification, Wiley, New York, 2001.
Google Scholar
Ester M., Kriegel H.P., Sander S., and Xu X., A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226–231, Menlo Park, CA, 1996. AAAI, AAAI Press.
Google Scholar
Estivill-Castro, V. and Yang, J. A Fast and robust general purpose clustering algorithm. Pacific Rim International Conference on Artificial Intelligence, pp. 208–218, 2000.
Google Scholar
Fraley C. and Raftery A.E., “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, Technical Report No. 329. Department of Statistics University of Washington, 1998.
Google Scholar
Fisher, D., 1987, Knowledge acquisition via incremental conceptual clustering, in machine learning 2, pp. 139–172.
Google Scholar
Fortier, JJ. and Solomon, H. 1996. Clustering procedures. In proceedings of the Multivariate Analysis,’ 66, PR. Krishnaiah (Ed.), pp. 493–506.
Google Scholar
Gluck, M. and Corter, J., 1985. Information, uncertainty, and the utility of categories. Proceedings of the Seventh Annual Conference of the Cognitive Science Society (pp. 283–287). Irvine, California: Lawrence Erlbaum Associates.
Google Scholar
Guha, S., Rastogi, R. and Shim, K. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73–84, New York, 1998.
Google Scholar
Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
Google Scholar
Hartigan, J. A. Clustering algorithms. John Wiley and Sons., 1975.
Google Scholar
Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 1998.
Google Scholar
Hoppner F., Klawonn F., Kruse R., Runkler T., Fuzzy Cluster Analysis, Wiley, 2000.
Google Scholar
Hubert, L. and Arabie, P., 1985 Comparing partitions. Journal of Classification, 5. 193–218.
Article Google Scholar
Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing Surveys, Vol. 31, No.3, September 1999.
Google Scholar
Kaufman, L. and Rousseeuw, P.J., 1987, Clustering by Means of Medoids, In Y. Dodge, editor, Statistical Data Analysis, based on the LI Norm, pp. 405–416, Elsevier/North Holland, Amsterdam.
Google Scholar
Kim, D.J., Park, Y.W. and Park,. A novel validity index for determination of the optimal number of clusters. IEICE Trans. Inf., Vol. E84-D, no.2, 2001, 281–285.
Google Scholar
King, B. Step-wise Clustering Procedures, J. Am. Stat. Assoc. 69, pp. 86–101, 1967.
Article Google Scholar
Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD, 16–22, San Diego, CA.
Google Scholar
Marcotorchino, J.F. and Michaud, P. Optimisation en Analyse Ordinale des Donns. Masson, Paris
Google Scholar
Mishra, S. K. and Raghavan, V. V., An empirical study of the performance of heuristic methods for clustering. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 425436, 1994.
Google Scholar
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms which use cluster centers. Comput. J. 26 354–359, 1984.
Google Scholar
Ng, R. and Han, J. 1994. Very large data bases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB94, Santiago, Chile, Sept.), VLDB Endowment, Berkeley, CA, 144155.
Google Scholar
Rand, W. M., Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66: 846–850, 1971.
Article Google Scholar
Ray, S., and Turi, R.H. Determination of Number of Clusters in K-Means Clustering and Application in Color Image Segmentation. Monash university, 1999.
Google Scholar
Selim, S.Z., and Ismail, M.A. K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. In IEEE transactions on pattern analysis and machine learning, vol. PAMI-6, no. 1, January, 1984.
Google Scholar
Selim, S. Z. AND Al-Sultan, K. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recogn. 24,10 (1991), 10031008.
Article MathSciNet Google Scholar
Sneath, P., and Sokal, R. Numerical Taxonomy. W.H. Freeman Co., San Francisco, CA, 1973.
MATH Google Scholar
Strehl A. and Ghosh J., Clustering Guidance and Quality Evaluation Using Relationship-based Visualization, Proceedings of Intelligent Engineering Systems Through Artificial Neural Networks, 2000, St. Louis, Missouri, USA, pp 483–488.
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In Proc. AAAI Workshop on AI for Web Search, pp 58–64, 2000.
Google Scholar
Tibshirani, R., Walther, G. and Hastie, T., 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep. 208, Dept. of Statistics, Stanford University.
Google Scholar
Tyron R. C. and Bailey D.E. Cluster Analysis. McGraw-Hill, 1970.
Google Scholar
Urquhart, R. Graph-theoretical clustering, based on limited neighborhood sets. Pattern recognition, vol. 15, pp. 173–187, 1982.
Article MATH Google Scholar
Veyssieres, M.P. and Plant, R.E. Identification of vegetation state and transition domains in California’s hardwood rangelands. University of California, 1998.
Google Scholar
Wallace C. S. and Dowe D. L., Intrinsic classification by mml — the snob program. In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, pages 37–44, 1994.
Google Scholar
Wang, X. and Yu, Q. Estimate the number of clusters in web documents via gap statistic. May 2001.
Google Scholar
Ward, J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.
Article MathSciNet Google Scholar
Zahn, C. T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE trans. Comput. C-20 (Apr.), 68–86, 1971.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering, Tel-Aviv University, USA
Lior Rokach & Oded Maimon

Authors

Lior Rokach
View author publications
You can also search for this author in PubMed Google Scholar
Oded Maimon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Industrial Engineering, Tel-Aviv University, 69978, Ramat-Aviv, Israel
Oded Maimon & Lior Rokach &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rokach, L., Maimon, O. (2005). Clustering Methods. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_15

Download citation

DOI: https://doi.org/10.1007/0-387-25465-X_15
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-24435-8
Online ISBN: 978-0-387-25465-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics