Skip to main content
Log in

Cluster analysis of massive datasets in astronomy

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Clusters of galaxies are a useful proxy to trace the distribution of mass in the universe. By measuring the mass of clusters of galaxies on different scales, one can follow the evolution of the mass distribution (Martínez and Saar, Statistics of the Galaxy Distribution, 2002). It can be shown that finding galaxy clusters is equivalent to finding density contour clusters (Hartigan, Clustering Algorithms, 1975): connected components of the level set S c ≡{f>c} where f is a probability density function. Cuevas et al. (Can. J. Stat. 28, 367–382, 2000; Comput. Stat. Data Anal. 36, 441–459, 2001) proposed a nonparametric method for density contour clusters, attempting to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bahcall, N., et al.: The cluster mass function from early Sloan digital sky survey data: cosmological implications. Astrophys. J. 585, 182–190 (2003)

    Article  Google Scholar 

  • Báillo, A., Cuesta-Albertos, J.A., Cuevas, A.: Convergence rates in nonparametric estimation of level sets. Stat. Probab. Lett. 53, 27–35 (2001)

    Article  MATH  Google Scholar 

  • Chaudhuri, A.R., Basu, A., Bhandari, S., Chaudhuri, B.: An efficient approach to consistent set estimation. Sankhyā Ser. B 61, 496–513 (1999)

    MATH  Google Scholar 

  • Cole, S., Hatton, S., Weinberg, D.H., Frenk, C.S.: Mock 2df and sdss galaxy redshift surveys. Mon. Not. Roy. Astron. Soc. 300, 945–966 (1998)

    Google Scholar 

  • Cooray, A., Sheth, R.K.: Halo models of large scale structure. Phys. Rep. 372, 1–129 (2002)

    Article  MATH  Google Scholar 

  • Cuevas, A., Fraiman, R.: A plugin approach to support estimation. Ann. Stat. 25, 2300–2312 (1997)

    Article  MATH  Google Scholar 

  • Cuevas, A., Rodriguez-Casal, A.: On boundary estimation. Adv. Appl. Probab. 36, 340–354 (2004)

    Article  MATH  Google Scholar 

  • Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000)

    Article  MATH  Google Scholar 

  • Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001)

    Article  MATH  Google Scholar 

  • Dekel, A., Lahav, O.: Stochastic non-linear galaxy biasing. Astrophys. J. 520, 24–34 (1999)

    Article  Google Scholar 

  • Devroye, L., Wise, G.: Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38, 480–488 (1980)

    Article  MATH  Google Scholar 

  • Dodelson, S.: Modern Cosmology. Academic, New York (2003)

    Google Scholar 

  • Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)

    Article  MATH  Google Scholar 

  • Genovese, C.R., Miller, C.J., Nichol, R.C., Arjunwadkar, M., Wasserman, L.: Nonparametric inference for the cosmic microwave background. Stat. Sci. 19, 308–321 (2004)

    Article  MATH  Google Scholar 

  • Gray, A., Moore, A.: Rapid evaluation of multiple density models. Artif. Intell. Stat. (2003)

  • Hall, P., Wand, M.: On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996)

    Article  MATH  Google Scholar 

  • Hartigan, J.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  • Hartigan, J.: Statistical clustering. Tech. Rep., Department of Statistics, Yale University (2000)

  • Jang, W.: Nonparametric density estimation and galaxy clustering. In: Statistical Challenges in Astronomy, pp. 443–445. Springer, New York (2003)

    Chapter  Google Scholar 

  • Jang, W.: Nonparametric density estimation and clustering in astronomical sky surveys. Comput. Stat. Data Anal. 50, 760–774 (2006)

    Article  Google Scholar 

  • Jang, W., Genovese, C., Wasserman, L.: Nonparametric confidence sets for densities. Tech. Rep. 795, Department of Statistics, Carnegie Mellon University (2004)

  • Kaiser, N.: Clustering in real space and in redshift space. Mon. Not. Roy. Astron. Soc. 227, 1–21 (1987)

    Google Scholar 

  • Korostelev, A., Tsybakov, A.: Minimax Theory of Image Reconstruction. Springer, New York (1993)

    MATH  Google Scholar 

  • Martínez, V., Saar, E.: Statistics of the Galaxy Distribution. Chapman and Hall, London (2002)

    Google Scholar 

  • McLachlan, G., Peel, D.: Finite Mixture Model. Wiley, New York (2000)

    Google Scholar 

  • Moore, A.: Very fast em-based mixture model clustering using multiresolution kd-trees. In: Advances in Neural Information Processing Systems, pp. 543–549 (1999)

  • Narasimhan, G., Zhu, J., Zachariasen, M.: Experiments with computing geometric minimum spanning trees. In: Proceedings of ALENEX’00. Lecture Notes in Computer Science, pp. 183–196. Springer, New York (2000)

    Google Scholar 

  • Nichol, R.: Private communication (2006)

  • Nichol, R.C., Connolly, A.J., Moore, A.W., Schneider, J., Genovese, C., Wasserman, L.: Computational AstroStatistics: fast algorithms and efficient statistics for density estimation in large astronomical datasets. In: Proceedings of Virtual Observatories of the Future. ASP Conference Series, vol. 225, p. 265. San Francisco (2001)

  • Parkinson, D., Mukherjee, P., Liddle, A.R.: Bayesian model selection analysis of WMAP3. Phys. Rev. D 73, 123523 (2006)

    Article  Google Scholar 

  • Press, W.H., Schechter, P.: Formation of galaxies and clusters of galaxies by self-similar gravitational condensation. Astrophys. J. 187, 425–438 (1974)

    Article  Google Scholar 

  • R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, http://www.R-project.org (2006)

  • Reichart, D., Nichol, R., Castander, F., Burker, D., Romer, A.K., Holden, B., Collins, C., Ulmer, M.: A deficit high-redshift, high-luminosity x-ray clusters: evidence for a high value of ω m ? Astrophys. J. 518, 521–532 (1999)

    Article  Google Scholar 

  • Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice-Hall, Upper Saddle River (2002)

    Google Scholar 

  • Scott, C.D., Nowak, R.D.: Learning minimum volume sets. J. Mach. Learn. Res. 7, 665–704 (2006)

    Google Scholar 

  • Silverman, B.W.: Algorithm AS 176: Kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93–99 (1982)

    Article  MATH  Google Scholar 

  • Steinwart, I., Hush, D., Scovel, C.: A classification framework for anomaly detection. J. Mach. Learn. Res. 6, 211–232 (2005)

    Google Scholar 

  • Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)

    Article  MATH  Google Scholar 

  • Wand, M.: Fast computation of multivariate kernel estimators. J. Comput. Graph. Stat. 3, 433–445 (1994)

    Article  Google Scholar 

  • Wand, M., Jones, M.: Kernel Smoothing. Chapman and Hall, London (1995)

    MATH  Google Scholar 

  • Willett, R., Nowak, R.: Minimax optimal level set estimation. Unpublished manuscript, http://www.ee.duke/edu/~willett/ (2006)

  • Wong, W.-K., Moore, A.: Efficient algorithms for non-parametric clustering with clutter. Comput. Sci. Stat. 34, 541–553 (2002)

    Google Scholar 

  • Zhou, R., Hansen, E.A.: A breadth-first approach to memory-efficient graph search. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, MA (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Woncheol Jang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jang, W., Hendry, M. Cluster analysis of massive datasets in astronomy. Stat Comput 17, 253–262 (2007). https://doi.org/10.1007/s11222-007-9027-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-007-9027-x

Keywords

Navigation