Advertisement

Analysis of Data Reuse in Task-Parallel Runtimes

  • Miquel PericàsEmail author
  • Abdelhalim Amer
  • Kenjiro Taura
  • Satoshi Matsuoka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8551)

Abstract

This paper proposes a methodology to study the data reuse quality of task-parallel runtimes. We introduce an coarse-grain version of the reuse distance method called Kernel Reuse Distance (KRD). The metric is a low-overhead alternative designed to analyze data reuse at the socket level while minimizing perturbation to the parallel schedule. Using the KRD metric we show that reuse depends considerably on the system configuration (sockets, cores) and on the runtime scheduler. Furthermore, we correlate KRD with hardware metrics such as cache misses and work time inflation. Overall we found that KRD can be used effectively to assess data reuse in parallel applications. The study also revealed that several current runtimes suffer from severe bottlenecks at scale which often dominate performance.

Keywords

Work Time Fast Multipole Method Intel Corporation Data Reuse Level Cache 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

This work has been supported by a JSPS postdoctoral fellowship (P-12044). We would like to thank the anonymous reviewers for their valuable feedback.

References

  1. 1.
    OpenMP ARB: Openmp specification (July 2013), http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
  2. 2.
    Intel Corporation: Threading building blocks, https://www.threadingbuildingblocks.org/
  3. 3.
    MIT Csail Supertech Research Group: The cilk project, http://supertech.csail.mit.edu/cilk/
  4. 4.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The Implementation of the Cilk-5 Multithreaded Language. In: Proceedings of SIGPLAN 1998 (June 1998)Google Scholar
  5. 5.
    Mohr, E., Kranz, D.A., Halstead, R.H.: Lazy Task Creation: A technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems 2(3) (July 1991)Google Scholar
  6. 6.
    Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In: Proceedings of SC 2012 (November 2012)Google Scholar
  7. 7.
    Tallent, N.R., Mellor-Crummey, J.M.: Effective Performance Measurement and Analysis of Multithreaded Applications. In: Proceedings of PPoPP 2009 (February 2009)Google Scholar
  8. 8.
    Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir Performance Analysis Tool-Set, pp. 139–155. Springer, Heidelberg (2008)Google Scholar
  9. 9.
    Barcelona Supercomputing Center: Extrae User Guide Manual (May 2013)Google Scholar
  10. 10.
    Virtual Institute - High Productivity Supercomputing: SCORE-P User Manual (2013)Google Scholar
  11. 11.
    McCurdy, C., Vetter, J.: Memphis: Finding and Fixing NUMA-related Performance Problems on Multi-core Platforms. In: Proceedings of ISPASS 2010 (March 2010)Google Scholar
  12. 12.
    Liu, X., Mellor-Crummey, J.: Pinpointing Data Locality Problems Using Data-centric Analysis. In: Proceedings of CGO 2011 (April 2011)Google Scholar
  13. 13.
    Intel Corporation: Intel VTune Amplifier XE 2013 (2013), http://software.intel.com/en-us/intel-vtune-amplifier-xe
  14. 14.
    Mattson, R., Gecsei, J., Slutz, D., Traiger, I.: Evaluation techniques for storage hierarchies. IBM Systems Journal 9(2), 78–117 (1970)CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Taura, K., Yokota, R., Maruyama, N.: A Task Parallelism Meets Fast Multipole Methods. In: Proceedings of the SCALA 2012 Workshop (November 2012)Google Scholar
  17. 17.
    The MassiveThreads Team: Massivethreads: A lightweight thread library for high productivity languages, http://code.google.com/p/massivethreads/
  18. 18.
    Nakashima, J., Nakatani, S., Taura, K.: Design and implementation of a customizable work stealing scheduler. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2013, pp. 9:1–9:8 (2013)Google Scholar
  19. 19.
  20. 20.
    Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The Data Locality of Work Stealing. In: Proceedings of SPAA 2000 (2000)Google Scholar
  21. 21.
    The Qthread Team: The qthread library, http://www.cs.sandia.gov/qthreads/
  22. 22.
    Wheeler, K., Murphy, R., Thain, D.: Qthreads: An API for programming with millions of lightweight threads. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)Google Scholar
  23. 23.
    Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling Task Parallelism on Multi-Socket Multicore Systems. In: Proceedings of ROSS 2011, pp. 49–56 (2011)Google Scholar
  24. 24.
    Weaver, V.M.: Linux perf_event Features and Overhead. In: Proceedings of the 2013 FastPath Workshop (2013)Google Scholar
  25. 25.
    Beyls, K., D’Hollander, E.H.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, pp. 617–662 (2001)Google Scholar
  26. 26.
    Intel Corporation: Intel 64 and ia-32 architectures software developer’s manual volume 3b, http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
  27. 27.
    PAPI Team: Performance application programming interface, http://icl.cs.utk.edu/papi/
  28. 28.
    Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying Locality In The Memory Access Patterns of HPC Applications. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (November 2005)Google Scholar
  29. 29.
    Intel Corporation: An Introduction to the Intel QuickPath Interconnect (2009)Google Scholar
  30. 30.
    Hackenberg, D., Molka, D., Nagel, W.E.: Comparing Cache Architectures and Coherency Protocols on x86–64 Multicore SMP Systems. In: Proceedings of MICRO 2009 (December 2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Miquel Pericàs
    • 1
    Email author
  • Abdelhalim Amer
    • 2
  • Kenjiro Taura
    • 3
  • Satoshi Matsuoka
    • 1
    • 2
  1. 1.Global Scientific Information and Computing CenterTokyo Institute of TechnologyTokyoJapan
  2. 2.Department of Mathematical and Computing SciencesTokyo Institute of TechnologyTokyoJapan
  3. 3.Graduate School of Information Science and TechnologyThe University of TokyoTokyoJapan

Personalised recommendations