Abstract
General purpose and highly applicable clustering methods are required for knowledge discovery. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-MEANS has several disadvantages derived from its statistical simplicity. We propose algorithms that remain very efficient, generally applicable, multidimensional but are more robust to noise and outliers. We achieve this by using medians rather than means as estimators of centers of clusters. Comparison with k-Means, EM and Gibbs sampling demonstrates the advantages of our algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M.S. Aldenderfer and R.K. Blashfield. Cluster Analysis. Sage, USA, 1984.
C. Bajaj. Geometric Optimization and Computational Complexity. PhD thesis, D. Computer Science, Cornell University, NY, 1984.
C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. J. Symbolic Computation, 2:99–102, 1986.
P.S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. R. Agrawal and P. Stolorz, eds, Fourth Int. Conf. on Knowledge Discovery and Data Mining, pages 9–15. AAAI Press, 1998.
P.S. Bradley, O.L. Mangasarian, and W.N. Street. Clustering via concave minimization. Advances in neural information processing systems, 9:368–, 1997.
V. Cherkassky and F. Muller. Learning from Data — Concept, Theory and Methods. Wiley, NY, 1998.
R. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, NY, 1973.
V. Estivill-Castro and M.E. Houle. Roboust clustering of large geo-referenced data sets. 3rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD-99), 327–337. Springer-Verlag LNAI 1574, 1999.
U. Fayyad, C. Reina, and P.S. Bradley. Initialization of iterative refinement clustering algorithms. R. Agrawal and P. Stolorz, eds., Fourth Int. Conf. on Knowledge Discovery and Data Mining, 194–198. AAAI Press, 1998.
C. Fraley and A.E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. Computer J., 41(8):578–588, 1998.
R.L. Francis. Facility layout and location: An analytical approach. Prentice-Hall, NJ, 1974.
L. Kaufman and P.J. Rousseuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, NY, 1990.
H.W. Kuhn. A note on Fermat’s problem. Mathematical Programming, 4(1):98–107, 1973.
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. J. Bocca, M. Jarke, and C. Zaniolo, eds., 20th Conf. on Very Large Data Bases (VLDB), 144–155, 1994. Santiago, Chile, Morgan Kaufmann.
J.J. Oliver, R.A. Baxter, and C.S. Wallace. Unsupervised learning using MML. 13th Machine Learning Conf., 364–372, CA, 1996. Morgan Kaufmann.
M.L. Overton. A quadratically convergent method for minimizing a sum of Euclidean norms. Mathematical Programming, 27:34–63, 1983.
G.W. Rogers, B.C. Wallet, and E.J. Wegman. A mixed measure formulation of the EM algorithm for huge data set applications. L. Billard and N.I. Fisher, eds., 28th Symposium on the Interface between Computer Science and Statistics, 492–497, Sydney, 1997.
P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley, NY, 1987.
R.J. Schalkoff. Pattern Recognition — Statistical, Structural and Neural Approaches. Wiley, NY, 1992.
R. Sedgewick. Algorithms. Addison-Wesley, MA, 1988.
S.Z. Selim and M.A. Ismail. k-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE T. Pattern Analysis and Machine Intelligence, PAMI-6(1):81–86, 1984.
A.F.M. Smith and G.O. Roberts. Bayesian computation via the Gibbs sampler and reated Markov chain Monte Carlo methods. J. Royal Statistical Society B, 55(1):2–23, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Estivill-Castro, V., Yang, J. (2000). Fast and Robust General Purpose Clustering Algorithms. In: Mizoguchi, R., Slaney, J. (eds) PRICAI 2000 Topics in Artificial Intelligence. PRICAI 2000. Lecture Notes in Computer Science(), vol 1886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44533-1_24
Download citation
DOI: https://doi.org/10.1007/3-540-44533-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67925-7
Online ISBN: 978-3-540-44533-3
eBook Packages: Springer Book Archive