Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Rocke, David M.; Dai, Jian

doi:10.1023/A:1022497517599

Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Published: April 2003

Volume 7, pages 215–232, (2003)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

David M. Rocke¹ &
Jian Dai¹

768 Accesses
23 Citations
Explore all metrics

Abstract

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised Nested Algorithm for Classification Based on K-Means

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

A Geometric Clustering Algorithm and Its Applications to Structural Data

References

Banfield, J.D. and Raftery, A.E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.
Google Scholar
Battiti, R. and Tecchiolli, G. 1994. The reactive tabu search. ORSA Journal on Computing, 6:126–140.
Google Scholar
Bradley, P.S. and Fayyad, U.M. 1998. Refining initial points for k-means clustering. In Proc. 15th Int. Conf. on Machine Learning, (J. Shavlik (Ed)). San Francisco: Morgan Kaufman, pp. 91–99.
Google Scholar
Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 153–180.
Google Scholar
Dasgupta, A. and Raftery, A. 1995. Detecting features in spatial point processes with clutter via model-based clustering. Technical Report No. 195, Department of Statistics, University of Washington.
Dempster, A.P., Laird, N.M., and Robin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39:1–38.
Google Scholar
Everitt, B.S. and Hand, D.J. 1981. Finite Mixture Distributions. Monographs on Applied Probability and Statistics. New York, NY: Chapman and Hall Ltd.
Google Scholar
Fayyad, U.M. 1997.“Editorial”, Data Mining and Knowledge Discovery, 1:5–10.
Google Scholar
Fayyad, U.M. 1991. On the induction of decision trees for multiple concept learning. Ph.D. Thesis, EECS Department. The University of Michigan, Ann Arbor.
Fayyad, U.M., Djorgovski, S.G., and Weir, N. 1996. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, (U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Cambridge, MA: The MIT Press, pp. 471–493.
Google Scholar
Fayyad, U.M., Reina, C., and Bradley, P.S. 1998. Initialization of iterative refinement clustering algorithms. In Proc 4th Int. Conf. on Knowledge Discovery and Data Mining KDD-98, (R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.)). Menlo Park, CA: AAAI Press, pp. 193–198.
Google Scholar
Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report No. 329, Department of Statistics, University of Washington, Box 354322, Seattle, QA 98195-4322.
Hawkins, D.M. 1981. A new test for multivariate normality and homoscedasticity. Technometrics, 23:105–110.
Google Scholar
Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. 1982. Cluster analysis. In Topics in Applied Multivariate Analysis, (D.M. Hawkins (Ed.)). Cambridge: Cambridge University Press, pp. 303–356.
Google Scholar
Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons.
Google Scholar
Kendall, M.G. and Stuart, A. 1963. The Advanced Theory of Statistics III. Griffin, London.
Google Scholar
Kirkpaterick, S., Gelatt, J., and Vecchi, M.P. 1983. Optimization by simulated annealing. Science, 220:671–680.
Google Scholar
McLachlan, G.J. 1987.On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics, 36:318–324.
Google Scholar
McLachlan, G.J. and Basford, K. 1988. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker.
Google Scholar
McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions. New York: John Wiley & Sons.
Google Scholar
McLachlan, G.J. and Peel, D. 1998. MIXFIT: An algorithm for the automatic fitting and testing of normal mixture models. Center for Statistics, Department of Mathematics, University of Queensland, St. Lucia, Queensland 4072, Australia.
Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data from the digitized palomar sky survey. Department of Astronomy, California Institute of Technology, Pasadena, CA 91125.
Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24:1327–1345.
Google Scholar
Rocke, D.M. 1998. Constructive Statistics: Estimators, Algorithms, and Asymptotics. Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616.
Rocke, D.M. and Woodruff, D.L. 1996. Identification of outliers in multivariate data. Journal of the American Statistical Association, 91:1047–1061.
Google Scholar
Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Techometrics, 41:212–223.
Google Scholar
Smyth, P. 1996. Clustering using Monte Carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press.
Google Scholar
Titterington, D.M., Smith, A.F.M., and Makov, U.E. 1985. Statistical Analysis of Finite Mixture Distributions.
Weir, N., Fayyad, U.M., and Djorgovski, S.G. 1995. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 109:2401–2414.
Google Scholar
White, R.L. 1997. Object classification in astronomical images. In Statistical Challenges in Modern Astronomy II, G.J. Babu and E.D. Feigelson(Ed.)). New York: Springer-Verlag, pp. 135–148.
Google Scholar
Wolf, J. 1971. A Monte Carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Report STB 72-2, San Diego: U.S. Naval Personnel and Training Research Laboratory.
Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896.
Google Scholar
Wu, C.F.J. 1983. On convergence properties of the EM algorithm for Gaussian mixtures. Annals of Statistics, 11:95–103.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Image Processing and Integrated Computing, University of California, Davis, CA, 95616, USA
David M. Rocke & Jian Dai

Authors

David M. Rocke
View author publications
You can also search for this author in PubMed Google Scholar
Jian Dai
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rocke, D.M., Dai, J. Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data. Data Mining and Knowledge Discovery 7, 215–232 (2003). https://doi.org/10.1023/A:1022497517599

Download citation

Issue Date: April 2003
DOI: https://doi.org/10.1023/A:1022497517599

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Abstract

Access this article

Similar content being viewed by others

Supervised Nested Algorithm for Classification Based on K-Means

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

A Geometric Clustering Algorithm and Its Applications to Structural Data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Abstract

Access this article

Similar content being viewed by others

Supervised Nested Algorithm for Classification Based on K-Means

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

A Geometric Clustering Algorithm and Its Applications to Structural Data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation