Skip to main content
Log in

Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Banfield, J.D. and Raftery, A.E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.

    Google Scholar 

  • Battiti, R. and Tecchiolli, G. 1994. The reactive tabu search. ORSA Journal on Computing, 6:126–140.

    Google Scholar 

  • Bradley, P.S. and Fayyad, U.M. 1998. Refining initial points for k-means clustering. In Proc. 15th Int. Conf. on Machine Learning, (J. Shavlik (Ed)). San Francisco: Morgan Kaufman, pp. 91–99.

    Google Scholar 

  • Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 153–180.

    Google Scholar 

  • Dasgupta, A. and Raftery, A. 1995. Detecting features in spatial point processes with clutter via model-based clustering. Technical Report No. 195, Department of Statistics, University of Washington.

  • Dempster, A.P., Laird, N.M., and Robin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39:1–38.

    Google Scholar 

  • Everitt, B.S. and Hand, D.J. 1981. Finite Mixture Distributions. Monographs on Applied Probability and Statistics. New York, NY: Chapman and Hall Ltd.

    Google Scholar 

  • Fayyad, U.M. 1997.“Editorial”, Data Mining and Knowledge Discovery, 1:5–10.

    Google Scholar 

  • Fayyad, U.M. 1991. On the induction of decision trees for multiple concept learning. Ph.D. Thesis, EECS Department. The University of Michigan, Ann Arbor.

  • Fayyad, U.M., Djorgovski, S.G., and Weir, N. 1996. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 471–493.

    Google Scholar 

  • Fayyad, U.M., Reina, C., and Bradley, P.S. 1998. Initialization of iterative refinement clustering algorithms. In Proc 4th Int. Conf. on Knowledge Discovery and Data Mining KDD-98, (R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.)). Menlo Park, CA: AAAI Press, pp. 193–198.

    Google Scholar 

  • Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report No. 329, Department of Statistics, University of Washington, Box 354322, Seattle, QA 98195-4322.

  • Hawkins, D.M. 1981. A new test for multivariate normality and homoscedasticity. Technometrics, 23:105–110.

    Google Scholar 

  • Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. 1982. Cluster analysis. In Topics in Applied Multivariate Analysis, (D.M. Hawkins (Ed.)). Cambridge: Cambridge University Press, pp. 303–356.

    Google Scholar 

  • Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons.

    Google Scholar 

  • Kendall, M.G. and Stuart, A. 1963. The Advanced Theory of Statistics III. Griffin, London.

    Google Scholar 

  • Kirkpaterick, S., Gelatt, J., and Vecchi, M.P. 1983. Optimization by simulated annealing. Science, 220:671–680.

    Google Scholar 

  • McLachlan, G.J. 1987.On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics, 36:318–324.

    Google Scholar 

  • McLachlan, G.J. and Basford, K. 1988. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker.

    Google Scholar 

  • McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.

    Google Scholar 

  • McLachlan, G.J. and Peel, D. 1998. MIXFIT: An algorithm for the automatic fitting and testing of normal mixture models. Center for Statistics, Department of Mathematics, University of Queensland, St. Lucia, Queensland 4072, Australia.

  • Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data from the digitized palomar sky survey. Department of Astronomy, California Institute of Technology, Pasadena, CA 91125.

  • Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24:1327–1345.

    Google Scholar 

  • Rocke, D.M. 1998. Constructive Statistics: Estimators, Algorithms, and Asymptotics. Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616.

  • Rocke, D.M. and Woodruff, D.L. 1996. Identification of outliers in multivariate data. Journal of the American Statistical Association, 91:1047–1061.

    Google Scholar 

  • Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Techometrics, 41:212–223.

    Google Scholar 

  • Smyth, P. 1996. Clustering using Monte Carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Titterington, D.M., Smith, A.F.M., and Makov, U.E. 1985. Statistical Analysis of Finite Mixture Distributions.

  • Weir, N., Fayyad, U.M., and Djorgovski, S.G. 1995. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 109:2401–2414.

    Google Scholar 

  • White, R.L. 1997. Object classification in astronomical images. In Statistical Challenges in Modern Astronomy II, G.J. Babu and E.D. Feigelson(Ed.)). New York: Springer-Verlag, pp. 135–148.

    Google Scholar 

  • Wolf, J. 1971. A Monte Carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Report STB 72-2, San Diego: U.S. Naval Personnel and Training Research Laboratory.

  • Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896.

    Google Scholar 

  • Wu, C.F.J. 1983. On convergence properties of the EM algorithm for Gaussian mixtures. Annals of Statistics, 11:95–103.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rocke, D.M., Dai, J. Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data. Data Mining and Knowledge Discovery 7, 215–232 (2003). https://doi.org/10.1023/A:1022497517599

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022497517599

Navigation