Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

Kwon, YongChul; Nunley, Dylan; Gardner, Jeffrey P.; Balazinska, Magdalena; Howe, Bill; Loebman, Sarah

doi:10.1007/978-3-642-13818-8_11

YongChul Kwon¹⁸,
Dylan Nunley¹⁸,
Jeffrey P. Gardner¹⁸,
Magdalena Balazinska¹⁸,
Bill Howe¹⁸ &
…
Sarah Loebman¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6187))

Included in the following conference series:

International Conference on Scientific and Statistical Database Management

2000 Accesses
25 Citations

Abstract

Scientists’ ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Becla, J., Lim, K.T.: Report from the SciDB meeting (a.k.a. extremely large database workshop) (2008), http://xldb.slac.stanford.edu/download/attachments/4784226/sciDB2008_report.pdf
Sloan Digital Sky Survey, http://cas.sdss.org
The Large Hadron Collider, http://lhc.web.cern.ch/lhc/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of the 6th OSDI Symp. (2004)
Google Scholar
Hadoop, http://hadoop.apache.org/
Hadoop Hive, http://hadoop.apache.org/hive/
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. of the SIGMOD Conf., pp. 1099–1110 (2008)
Google Scholar
ISO/IEC 9075-*:2003: Database Languages - SQL. ISO, Geneva, Switzerland
Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)
Google Scholar
Stonebraker, et al.: Requirements for science data bases and SciDB. In: Fourth CIDR Conf., perspectives (2009)
Google Scholar
Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. 3(3), 263–290 (1999)
Article Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)
Article Google Scholar
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Scalable density-based distributed clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 231–244. Springer, Heidelberg (2004)
Google Scholar
Aoying, Z., Shuigeng, Z., Jing, C., Ye, F., Yunfa, H.: Approaches for scaling dbscan algorithm to large spatial databases. Journal of Comp. Sci. and Tech., 509–526 (2000)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: Proc. of the 2007 EuroSys Conf., pp. 59–72 (2007)
Google Scholar
Yu, et al.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proc. of the 8th OSDI Symp. (2008)
Google Scholar
Springel, et al.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)
Article Google Scholar
About the Blue Waters project, http://www.ncsa.illinois.edu/BlueWaters/
Davis, M., Efstathiou, G., Frenk, C.S., White, S.D.M.: The evolution of large-scale structure in a universe dominated by cold dark matter. Astrophysical Journal 292, 371–394 (1985)
Article Google Scholar
Reed, et al.: Evolution of the mass function of dark matter haloes. Monthly Notices of the Royal Astronomical Society 346, 565–572 (2003)
Article Google Scholar
Gardner, J.P., Connolly, A., McBride, C.: Enabling rapid development of parallel tree search applications. In: Proc. of the 2007 CLADE Symp. (2007)
Google Scholar
Gardner, J.P., Connolly, A., McBride, C.: Enabling knowledge discovery in a virtual universe. In: Proc. of the 2007 TeraGrid Symp. (2007)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd KDD Conf., pp. 226–231 (1996)
Google Scholar
Arlia, D., Coppola, M.: Experiments in parallel clustering with dbscan. In: Sakellariou, R., Keane, J.A., Gurd, J.R., Freeman, L. (eds.) Euro-Par 2001. LNCS, vol. 2150, pp. 326–331. Springer, Heidelberg (2001)
Chapter Google Scholar
DeWitt, et al.: Clustera: an integrated computation and data management system. In: Proc. of the 34th VLDB Conf., pp. 28–41 (2008)
Google Scholar
Chaiken, et al.: Scope: easy and efficient parallel processing of massive data sets. In: Proc. of the 34th VLDB Conf., pp. 1265–1276 (2008)
Google Scholar
Cascading, http://www.cascading.org/
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13(4) (2005)
Google Scholar
Chu, et al.: Map-reduce for machine learning on multicore. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) NIPS, vol. 19 (2007)
Google Scholar
Apache Mahout, http://lucene.apache.org/mahout/
Papadimitriou, S., Sun, J.: Disco: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: Proc. of the 8th ICDM Conf., pp. 512–521 (2008)
Google Scholar
Panda, B., Herbach, J., Basu, S., Bayardo, R.: Planet: massively parallel learning of tree ensembles with mapreduce. Proc. of the VLDB Endowment 2(2) (2009)
Google Scholar
Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proc. of the 22nd SOSP Symp. (2009)
Google Scholar
Berger, M., Bokhari, S.: A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers C-36(5) (1987)
Google Scholar
Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)
Article Google Scholar
Bigben, http://www.psc.edu/machines/cray/xt3/
Dagum, L., Menon, R.: Openmp: An industry-standard api for shared-memory programming. Computing in Science and Engineering 5(1), 46–55 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, WA
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe & Sarah Loebman

Authors

YongChul Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Dylan Nunley
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey P. Gardner
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Balazinska
View author publications
You can also search for this author in PubMed Google Scholar
Bill Howe
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Loebman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, University of Heidelberg, 69120, Heidelberg, Germany
Michael Gertz
Dept. of Computer Science and Genome Center, University of California, One Shields Avenue, 95616, Davis, CA, USA
Bertram Ludäscher

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S. (2010). Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-13818-8_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13817-1
Online ISBN: 978-3-642-13818-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics