Skip to main content

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

  • Conference paper
Scientific and Statistical Database Management (SSDBM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6187))

Abstract

Scientists’ ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Becla, J., Lim, K.T.: Report from the SciDB meeting (a.k.a. extremely large database workshop) (2008), http://xldb.slac.stanford.edu/download/attachments/4784226/sciDB2008_report.pdf

  2. Sloan Digital Sky Survey, http://cas.sdss.org

  3. The Large Hadron Collider, http://lhc.web.cern.ch/lhc/

  4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of the 6th OSDI Symp. (2004)

    Google Scholar 

  5. Hadoop, http://hadoop.apache.org/

  6. Hadoop Hive, http://hadoop.apache.org/hive/

  7. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. of the SIGMOD Conf., pp. 1099–1110 (2008)

    Google Scholar 

  8. ISO/IEC 9075-*:2003: Database Languages - SQL. ISO, Geneva, Switzerland

    Google Scholar 

  9. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)

    Google Scholar 

  10. Stonebraker, et al.: Requirements for science data bases and SciDB. In: Fourth CIDR Conf., perspectives (2009)

    Google Scholar 

  11. Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. 3(3), 263–290 (1999)

    Article  Google Scholar 

  12. DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  13. Januzaj, E., Kriegel, H.P., Pfeifle, M.: Scalable density-based distributed clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 231–244. Springer, Heidelberg (2004)

    Google Scholar 

  14. Aoying, Z., Shuigeng, Z., Jing, C., Ye, F., Yunfa, H.: Approaches for scaling dbscan algorithm to large spatial databases. Journal of Comp. Sci. and Tech., 509–526 (2000)

    Google Scholar 

  15. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: Proc. of the 2007 EuroSys Conf., pp. 59–72 (2007)

    Google Scholar 

  16. Yu, et al.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proc. of the 8th OSDI Symp. (2008)

    Google Scholar 

  17. Springel, et al.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)

    Article  Google Scholar 

  18. About the Blue Waters project, http://www.ncsa.illinois.edu/BlueWaters/

  19. Davis, M., Efstathiou, G., Frenk, C.S., White, S.D.M.: The evolution of large-scale structure in a universe dominated by cold dark matter. Astrophysical Journal 292, 371–394 (1985)

    Article  Google Scholar 

  20. Reed, et al.: Evolution of the mass function of dark matter haloes. Monthly Notices of the Royal Astronomical Society 346, 565–572 (2003)

    Article  Google Scholar 

  21. Gardner, J.P., Connolly, A., McBride, C.: Enabling rapid development of parallel tree search applications. In: Proc. of the 2007 CLADE Symp. (2007)

    Google Scholar 

  22. Gardner, J.P., Connolly, A., McBride, C.: Enabling knowledge discovery in a virtual universe. In: Proc. of the 2007 TeraGrid Symp. (2007)

    Google Scholar 

  23. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd KDD Conf., pp. 226–231 (1996)

    Google Scholar 

  24. Arlia, D., Coppola, M.: Experiments in parallel clustering with dbscan. In: Sakellariou, R., Keane, J.A., Gurd, J.R., Freeman, L. (eds.) Euro-Par 2001. LNCS, vol. 2150, pp. 326–331. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  25. DeWitt, et al.: Clustera: an integrated computation and data management system. In: Proc. of the 34th VLDB Conf., pp. 28–41 (2008)

    Google Scholar 

  26. Chaiken, et al.: Scope: easy and efficient parallel processing of massive data sets. In: Proc. of the 34th VLDB Conf., pp. 1265–1276 (2008)

    Google Scholar 

  27. Cascading, http://www.cascading.org/

  28. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13(4) (2005)

    Google Scholar 

  29. Chu, et al.: Map-reduce for machine learning on multicore. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) NIPS, vol. 19 (2007)

    Google Scholar 

  30. Apache Mahout, http://lucene.apache.org/mahout/

  31. Papadimitriou, S., Sun, J.: Disco: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: Proc. of the 8th ICDM Conf., pp. 512–521 (2008)

    Google Scholar 

  32. Panda, B., Herbach, J., Basu, S., Bayardo, R.: Planet: massively parallel learning of tree ensembles with mapreduce. Proc. of the VLDB Endowment 2(2) (2009)

    Google Scholar 

  33. Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proc. of the 22nd SOSP Symp. (2009)

    Google Scholar 

  34. Berger, M., Bokhari, S.: A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers C-36(5) (1987)

    Google Scholar 

  35. Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)

    Article  Google Scholar 

  36. Bigben, http://www.psc.edu/machines/cray/xt3/

  37. Dagum, L., Menon, R.: Openmp: An industry-standard api for shared-memory programming. Computing in Science and Engineering 5(1), 46–55 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S. (2010). Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13818-8_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13817-1

  • Online ISBN: 978-3-642-13818-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics