Abstract
The amount of data transfer required by (parallel) applications can be of significant importance for performance due to bandwidth limitations in real systems. For intra-node, data transfer refers to memory accesses, and usually, cache miss counts are used for performance analysis. However, they cannot explicitly show utilization of links and buses in the memory hierarchy. In this paper, we propose an explicit visualization of the bandwidth requirements of applications within memory hierarchies. This is based on a machine model without bandwidth limits. We show its usefulness within the context of a simple 2D stencil iterative solver, with and without cache optimization applied.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Intel VTune Amplifier, http://software.intel.com/en-us/intel-vtune-amplifier-xe
International Technology Roadmap for Seminconductors (2012), http://www.itrs.net/Links/2012ITRS/2012Chapters/2012Overview.pdf
Almási, G., Caşcaval, C., Padua, D.A.: Calculating stack distances efficiently. In: Proceedings of the 2002 Workshop on Memory System Performance, MSP 2002, pp. 37–43. ACM, New York (2002)
Bennett, B.T., Kruskal, V.J.: LRU stack processing. IBM Journal of Research and Development 19, 353–357 (1975)
Berg, E., Hagersten, E.: Statcache: A probabilistic approach to efficient and accurate data locality analysis. In: Proc. of the Int. Symposium on Performance Analysis of Systems and Software (2004)
de Melo, A.C.: Performance Counters on Linux. Presentation at the Linux Plumbers Conference (September 2009)
Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proc. of PLDI 2005. ACM, New York (2005)
Niu, Q., Dinan, J., Lu, Q., Sadayappan, P.: PARDA: A fast parallel reuse distance analysis algorithm. In: Int. Parallel and Distributed Processing Symposium (2012)
Schuff, D.L., Kulkarni, M., Pai, V.S.: Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proc. of the 19th Int. Conf. on Parallel Architectures and Compilation Techniques. ACM, New York (2010)
Weidendorfer, J., Trinitis, C.: Collecting and exploiting cache-reuse metrics. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 191–198. Springer, Heidelberg (2005)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Comm. ACM 52(4) (April 2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weidendorfer, J. (2014). Data Transfer Requirement Analysis with Bandwidth Curves. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_59
Download citation
DOI: https://doi.org/10.1007/978-3-642-54420-0_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)