Skip to main content

Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM

  • Conference paper
Book cover Supercomputing (ISC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Included in the following conference series:

Abstract

Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dehnen, W.: A hierarchical o(n) force calculation algorithm. Journal of Computational Physics 179(1), 27–42 (2002)

    Article  MathSciNet  Google Scholar 

  2. Chaillat, S., Bonnet, M., Semblat, J.F.: A multi-level fast multipole bem for 3-d elastodynamics in the frequency domain. Computer Methods in Applied Mechanics and Engineering 197, 4233–4249 (2008)

    Article  Google Scholar 

  3. Yokota, R., Narumi, T., Barba, L.A., Yasuoka, K.: Petascale turbulence simulation using a highly parallel fast multipole method (2011)

    Google Scholar 

  4. Chandramowlishwaran, A., Williams, S., Oliker, L., Lashuk, I., Biros, G., Vuduc, R.: Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)

    Google Scholar 

  5. Chandramowlishwaran, A., Madduri, K., Vuduc, R.: Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–12. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  6. Augonnet, C., Thibault, S., Namyst, R.: StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines. Rapport de Recherche RR-7240, INRIA (March 2010)

    Google Scholar 

  7. Duran, A., Ayguade, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 173–193 (2011)

    Google Scholar 

  8. YarKhan, A., Kurzak, J., Dongarra, J.: Quark users’ guide: Queueing and runtime for kernels. Technical report, University of Tennessee Innovative Computing Laboratory (April 2011)

    Google Scholar 

  9. http://code.google.com/p/massivethreads/

  10. Barnes, J., Hut, P.: A hierarchical O(N log N) force-calculation algorithm. Nature 324(6096), 446–449 (1986)

    Article  Google Scholar 

  11. Greengard, L.F.: The rapid evaluation of potential fields in particle systems. PhD thesis, New Haven, CT, USA, AAI8727216 (1987)

    Google Scholar 

  12. Ying, G.L., Biros, Zorin, D., Langston, H.: A new parallel kernel-independent fast multipole method. In: Supercomputing, 2003 ACM/IEEE Conference, p. 14 (November 2003)

    Google Scholar 

  13. Ying, L., Biros, G., Zorin, D.: A kernel-independent adaptive fast multipole algorithm in two and three dimensions (2003)

    Google Scholar 

  14. Greengard, L.: The Rapid Evaluation of Potential Fields in Particle Systems, vol. 52. MIT Press (1988)

    Google Scholar 

  15. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture TCCA Newsletter, 19–25 (1995)

    Google Scholar 

  16. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir Performance Analysis Tool-Set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  17. Brunst, H., Knüpfer, A.: Vampir. In: Encyclopedia of Parallel Computing. Springer (2011)

    Google Scholar 

  18. http://icl.cs.utk.edu/PAPI/

  19. http://software.intel.com/en-us/intel-vtune-amplifier-xe

  20. Drongowski, P.J.: Basic performance measurements for amd athlontm 64, amd opterontm and amd phenomtm processors (September 25, 2008)

    Google Scholar 

  21. Intel xeon processor e5-2600 product family uncore performance monitoring guide (March 2012)

    Google Scholar 

  22. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009)

    Article  Google Scholar 

  23. Bergstrom, L.: Measuring numa effects with the stream benchmark (2011)

    Google Scholar 

  24. Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Pipelining the fast multipole method over a runtime system (2012)

    Google Scholar 

  25. Pericas, M., Amer, A., Fukuda, K., Maruyama, N., Yokota, R., Matsuoka, S.: Towards a dataflow fmm using the ompss programming model. In: 136th IPSJ Conference on High Performance Computing

    Google Scholar 

  26. Taura, K., Yokota, R., Maruyama, N.: A task parallelism meets fast multipole methods. In: Proceedings of SCALA 2012 (November 2012)

    Google Scholar 

  27. Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors (2009)

    Google Scholar 

  28. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, 38:1–38:12. ACM, New York (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Amer, A., Maruyama, N., Pericàs, M., Taura, K., Yokota, R., Matsuoka, S. (2013). Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38750-0_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38749-4

  • Online ISBN: 978-3-642-38750-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics