Abstract
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
Similar content being viewed by others
References
Banfield, J.D. and Raftery, A.E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.
Battiti, R. and Tecchiolli, G. 1994. The reactive tabu search. ORSA Journal on Computing, 6:126–140.
Bradley, P.S. and Fayyad, U.M. 1998. Refining initial points for k-means clustering. In Proc. 15th Int. Conf. on Machine Learning, (J. Shavlik (Ed)). San Francisco: Morgan Kaufman, pp. 91–99.
Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 153–180.
Dasgupta, A. and Raftery, A. 1995. Detecting features in spatial point processes with clutter via model-based clustering. Technical Report No. 195, Department of Statistics, University of Washington.
Dempster, A.P., Laird, N.M., and Robin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39:1–38.
Everitt, B.S. and Hand, D.J. 1981. Finite Mixture Distributions. Monographs on Applied Probability and Statistics. New York, NY: Chapman and Hall Ltd.
Fayyad, U.M. 1997.“Editorial”, Data Mining and Knowledge Discovery, 1:5–10.
Fayyad, U.M. 1991. On the induction of decision trees for multiple concept learning. Ph.D. Thesis, EECS Department. The University of Michigan, Ann Arbor.
Fayyad, U.M., Djorgovski, S.G., and Weir, N. 1996. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 471–493.
Fayyad, U.M., Reina, C., and Bradley, P.S. 1998. Initialization of iterative refinement clustering algorithms. In Proc 4th Int. Conf. on Knowledge Discovery and Data Mining KDD-98, (R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.)). Menlo Park, CA: AAAI Press, pp. 193–198.
Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report No. 329, Department of Statistics, University of Washington, Box 354322, Seattle, QA 98195-4322.
Hawkins, D.M. 1981. A new test for multivariate normality and homoscedasticity. Technometrics, 23:105–110.
Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. 1982. Cluster analysis. In Topics in Applied Multivariate Analysis, (D.M. Hawkins (Ed.)). Cambridge: Cambridge University Press, pp. 303–356.
Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons.
Kendall, M.G. and Stuart, A. 1963. The Advanced Theory of Statistics III. Griffin, London.
Kirkpaterick, S., Gelatt, J., and Vecchi, M.P. 1983. Optimization by simulated annealing. Science, 220:671–680.
McLachlan, G.J. 1987.On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics, 36:318–324.
McLachlan, G.J. and Basford, K. 1988. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker.
McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.
McLachlan, G.J. and Peel, D. 1998. MIXFIT: An algorithm for the automatic fitting and testing of normal mixture models. Center for Statistics, Department of Mathematics, University of Queensland, St. Lucia, Queensland 4072, Australia.
Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data from the digitized palomar sky survey. Department of Astronomy, California Institute of Technology, Pasadena, CA 91125.
Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24:1327–1345.
Rocke, D.M. 1998. Constructive Statistics: Estimators, Algorithms, and Asymptotics. Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616.
Rocke, D.M. and Woodruff, D.L. 1996. Identification of outliers in multivariate data. Journal of the American Statistical Association, 91:1047–1061.
Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Techometrics, 41:212–223.
Smyth, P. 1996. Clustering using Monte Carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press.
Titterington, D.M., Smith, A.F.M., and Makov, U.E. 1985. Statistical Analysis of Finite Mixture Distributions.
Weir, N., Fayyad, U.M., and Djorgovski, S.G. 1995. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 109:2401–2414.
White, R.L. 1997. Object classification in astronomical images. In Statistical Challenges in Modern Astronomy II, G.J. Babu and E.D. Feigelson(Ed.)). New York: Springer-Verlag, pp. 135–148.
Wolf, J. 1971. A Monte Carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Report STB 72-2, San Diego: U.S. Naval Personnel and Training Research Laboratory.
Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896.
Wu, C.F.J. 1983. On convergence properties of the EM algorithm for Gaussian mixtures. Annals of Statistics, 11:95–103.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rocke, D.M., Dai, J. Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data. Data Mining and Knowledge Discovery 7, 215–232 (2003). https://doi.org/10.1023/A:1022497517599
Issue Date:
DOI: https://doi.org/10.1023/A:1022497517599