Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM

Amer, Abdelhalim; Maruyama, Naoya; Pericàs, Miquel; Taura, Kenjiro; Yokota, Rio; Matsuoka, Satoshi

doi:10.1007/978-3-642-38750-0_19

Abdelhalim Amer¹⁹,
Naoya Maruyama²⁰,
Miquel Pericàs¹⁹,
Kenjiro Taura²¹,
Rio Yokota²² &
…
Satoshi Matsuoka¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Included in the following conference series:

International Supercomputing Conference

2434 Accesses
14 Citations

Abstract

Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dehnen, W.: A hierarchical o(n) force calculation algorithm. Journal of Computational Physics 179(1), 27–42 (2002)
Article MathSciNet Google Scholar
Chaillat, S., Bonnet, M., Semblat, J.F.: A multi-level fast multipole bem for 3-d elastodynamics in the frequency domain. Computer Methods in Applied Mechanics and Engineering 197, 4233–4249 (2008)
Article Google Scholar
Yokota, R., Narumi, T., Barba, L.A., Yasuoka, K.: Petascale turbulence simulation using a highly parallel fast multipole method (2011)
Google Scholar
Chandramowlishwaran, A., Williams, S., Oliker, L., Lashuk, I., Biros, G., Vuduc, R.: Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)
Google Scholar
Chandramowlishwaran, A., Madduri, K., Vuduc, R.: Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–12. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Augonnet, C., Thibault, S., Namyst, R.: StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines. Rapport de Recherche RR-7240, INRIA (March 2010)
Google Scholar
Duran, A., Ayguade, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 173–193 (2011)
Google Scholar
YarKhan, A., Kurzak, J., Dongarra, J.: Quark users’ guide: Queueing and runtime for kernels. Technical report, University of Tennessee Innovative Computing Laboratory (April 2011)
Google Scholar
http://code.google.com/p/massivethreads/
Barnes, J., Hut, P.: A hierarchical O(N log N) force-calculation algorithm. Nature 324(6096), 446–449 (1986)
Article Google Scholar
Greengard, L.F.: The rapid evaluation of potential fields in particle systems. PhD thesis, New Haven, CT, USA, AAI8727216 (1987)
Google Scholar
Ying, G.L., Biros, Zorin, D., Langston, H.: A new parallel kernel-independent fast multipole method. In: Supercomputing, 2003 ACM/IEEE Conference, p. 14 (November 2003)
Google Scholar
Ying, L., Biros, G., Zorin, D.: A kernel-independent adaptive fast multipole algorithm in two and three dimensions (2003)
Google Scholar
Greengard, L.: The Rapid Evaluation of Potential Fields in Particle Systems, vol. 52. MIT Press (1988)
Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture TCCA Newsletter, 19–25 (1995)
Google Scholar
Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir Performance Analysis Tool-Set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Heidelberg (2008)
Chapter Google Scholar
Brunst, H., Knüpfer, A.: Vampir. In: Encyclopedia of Parallel Computing. Springer (2011)
Google Scholar
http://icl.cs.utk.edu/PAPI/
http://software.intel.com/en-us/intel-vtune-amplifier-xe
Drongowski, P.J.: Basic performance measurements for amd athlontm 64, amd opterontm and amd phenomtm processors (September 25, 2008)
Google Scholar
Intel xeon processor e5-2600 product family uncore performance monitoring guide (March 2012)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009)
Article Google Scholar
Bergstrom, L.: Measuring numa effects with the stream benchmark (2011)
Google Scholar
Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Pipelining the fast multipole method over a runtime system (2012)
Google Scholar
Pericas, M., Amer, A., Fukuda, K., Maruyama, N., Yokota, R., Matsuoka, S.: Towards a dataflow fmm using the ompss programming model. In: 136th IPSJ Conference on High Performance Computing
Google Scholar
Taura, K., Yokota, R., Maruyama, N.: A task parallelism meets fast multipole methods. In: Proceedings of SCALA 2012 (November 2012)
Google Scholar
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors (2009)
Google Scholar
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, 38:1–38:12. ACM, New York (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Abdelhalim Amer, Miquel Pericàs & Satoshi Matsuoka
RIKEN, Kobe, Japan
Naoya Maruyama
The University of Tokyo, Tokyo, Japan
Kenjiro Taura
KAUST, Saudi Arabia
Rio Yokota

Authors

Abdelhalim Amer
View author publications
You can also search for this author in PubMed Google Scholar
Naoya Maruyama
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Pericàs
View author publications
You can also search for this author in PubMed Google Scholar
Kenjiro Taura
View author publications
You can also search for this author in PubMed Google Scholar
Rio Yokota
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Matsuoka
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Hamburg, Department of Informatics, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundestraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amer, A., Maruyama, N., Pericàs, M., Taura, K., Yokota, R., Matsuoka, S. (2013). Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-38750-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38749-4
Online ISBN: 978-3-642-38750-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics