Statistics and Computing

, Volume 17, Issue 3, pp 253–262

Cluster analysis of massive datasets in astronomy



Clusters of galaxies are a useful proxy to trace the distribution of mass in the universe. By measuring the mass of clusters of galaxies on different scales, one can follow the evolution of the mass distribution (Martínez and Saar, Statistics of the Galaxy Distribution, 2002). It can be shown that finding galaxy clusters is equivalent to finding density contour clusters (Hartigan, Clustering Algorithms, 1975): connected components of the level set Sc≡{f>c} where f is a probability density function. Cuevas et al. (Can. J. Stat. 28, 367–382, 2000; Comput. Stat. Data Anal. 36, 441–459, 2001) proposed a nonparametric method for density contour clusters, attempting to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data.


Density contour cluster Level set Clustering Fast Fourier transform 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bahcall, N., et al.: The cluster mass function from early Sloan digital sky survey data: cosmological implications. Astrophys. J. 585, 182–190 (2003) CrossRefGoogle Scholar
  2. Báillo, A., Cuesta-Albertos, J.A., Cuevas, A.: Convergence rates in nonparametric estimation of level sets. Stat. Probab. Lett. 53, 27–35 (2001) MATHCrossRefGoogle Scholar
  3. Chaudhuri, A.R., Basu, A., Bhandari, S., Chaudhuri, B.: An efficient approach to consistent set estimation. Sankhyā Ser. B 61, 496–513 (1999) MATHGoogle Scholar
  4. Cole, S., Hatton, S., Weinberg, D.H., Frenk, C.S.: Mock 2df and sdss galaxy redshift surveys. Mon. Not. Roy. Astron. Soc. 300, 945–966 (1998) Google Scholar
  5. Cooray, A., Sheth, R.K.: Halo models of large scale structure. Phys. Rep. 372, 1–129 (2002) MATHCrossRefGoogle Scholar
  6. Cuevas, A., Fraiman, R.: A plugin approach to support estimation. Ann. Stat. 25, 2300–2312 (1997) MATHCrossRefGoogle Scholar
  7. Cuevas, A., Rodriguez-Casal, A.: On boundary estimation. Adv. Appl. Probab. 36, 340–354 (2004) MATHCrossRefGoogle Scholar
  8. Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000) MATHCrossRefGoogle Scholar
  9. Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001) MATHCrossRefGoogle Scholar
  10. Dekel, A., Lahav, O.: Stochastic non-linear galaxy biasing. Astrophys. J. 520, 24–34 (1999) CrossRefGoogle Scholar
  11. Devroye, L., Wise, G.: Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38, 480–488 (1980) MATHCrossRefGoogle Scholar
  12. Dodelson, S.: Modern Cosmology. Academic, New York (2003) Google Scholar
  13. Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) MATHCrossRefGoogle Scholar
  14. Genovese, C.R., Miller, C.J., Nichol, R.C., Arjunwadkar, M., Wasserman, L.: Nonparametric inference for the cosmic microwave background. Stat. Sci. 19, 308–321 (2004) MATHCrossRefGoogle Scholar
  15. Gray, A., Moore, A.: Rapid evaluation of multiple density models. Artif. Intell. Stat. (2003) Google Scholar
  16. Hall, P., Wand, M.: On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996) MATHCrossRefGoogle Scholar
  17. Hartigan, J.: Clustering Algorithms. Wiley, New York (1975) MATHGoogle Scholar
  18. Hartigan, J.: Statistical clustering. Tech. Rep., Department of Statistics, Yale University (2000) Google Scholar
  19. Jang, W.: Nonparametric density estimation and galaxy clustering. In: Statistical Challenges in Astronomy, pp. 443–445. Springer, New York (2003) CrossRefGoogle Scholar
  20. Jang, W.: Nonparametric density estimation and clustering in astronomical sky surveys. Comput. Stat. Data Anal. 50, 760–774 (2006) CrossRefGoogle Scholar
  21. Jang, W., Genovese, C., Wasserman, L.: Nonparametric confidence sets for densities. Tech. Rep. 795, Department of Statistics, Carnegie Mellon University (2004) Google Scholar
  22. Kaiser, N.: Clustering in real space and in redshift space. Mon. Not. Roy. Astron. Soc. 227, 1–21 (1987) Google Scholar
  23. Korostelev, A., Tsybakov, A.: Minimax Theory of Image Reconstruction. Springer, New York (1993) MATHGoogle Scholar
  24. Martínez, V., Saar, E.: Statistics of the Galaxy Distribution. Chapman and Hall, London (2002) Google Scholar
  25. McLachlan, G., Peel, D.: Finite Mixture Model. Wiley, New York (2000) Google Scholar
  26. Moore, A.: Very fast em-based mixture model clustering using multiresolution kd-trees. In: Advances in Neural Information Processing Systems, pp. 543–549 (1999) Google Scholar
  27. Narasimhan, G., Zhu, J., Zachariasen, M.: Experiments with computing geometric minimum spanning trees. In: Proceedings of ALENEX’00. Lecture Notes in Computer Science, pp. 183–196. Springer, New York (2000) Google Scholar
  28. Nichol, R.: Private communication (2006) Google Scholar
  29. Nichol, R.C., Connolly, A.J., Moore, A.W., Schneider, J., Genovese, C., Wasserman, L.: Computational AstroStatistics: fast algorithms and efficient statistics for density estimation in large astronomical datasets. In: Proceedings of Virtual Observatories of the Future. ASP Conference Series, vol. 225, p. 265. San Francisco (2001) Google Scholar
  30. Parkinson, D., Mukherjee, P., Liddle, A.R.: Bayesian model selection analysis of WMAP3. Phys. Rev. D 73, 123523 (2006) CrossRefGoogle Scholar
  31. Press, W.H., Schechter, P.: Formation of galaxies and clusters of galaxies by self-similar gravitational condensation. Astrophys. J. 187, 425–438 (1974) CrossRefGoogle Scholar
  32. R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, (2006)
  33. Reichart, D., Nichol, R., Castander, F., Burker, D., Romer, A.K., Holden, B., Collins, C., Ulmer, M.: A deficit high-redshift, high-luminosity x-ray clusters: evidence for a high value of ω m? Astrophys. J. 518, 521–532 (1999) CrossRefGoogle Scholar
  34. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice-Hall, Upper Saddle River (2002) Google Scholar
  35. Scott, C.D., Nowak, R.D.: Learning minimum volume sets. J. Mach. Learn. Res. 7, 665–704 (2006) Google Scholar
  36. Silverman, B.W.: Algorithm AS 176: Kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93–99 (1982) MATHCrossRefGoogle Scholar
  37. Steinwart, I., Hush, D., Scovel, C.: A classification framework for anomaly detection. J. Mach. Learn. Res. 6, 211–232 (2005) Google Scholar
  38. Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003) MATHCrossRefGoogle Scholar
  39. Wand, M.: Fast computation of multivariate kernel estimators. J. Comput. Graph. Stat. 3, 433–445 (1994) CrossRefGoogle Scholar
  40. Wand, M., Jones, M.: Kernel Smoothing. Chapman and Hall, London (1995) MATHGoogle Scholar
  41. Willett, R., Nowak, R.: Minimax optimal level set estimation. Unpublished manuscript, (2006)
  42. Wong, W.-K., Moore, A.: Efficient algorithms for non-parametric clustering with clutter. Comput. Sci. Stat. 34, 541–553 (2002) Google Scholar
  43. Zhou, R., Hansen, E.A.: A breadth-first approach to memory-efficient graph search. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, MA (2006) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Epidemiology and BiostatisticsUniversity of GeorgiaAthensUSA
  2. 2.Department of Physics and AstronomyUniversity of GlasgowGlasgowUK

Personalised recommendations