Performance characterization of data-intensive kernels on AMD Fusion architectures
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
The cost of data movement over the PCI Express bus is one of the biggest performance bottlenecks for accelerating data-intensive applications on traditional discrete GPU architectures. To address this bottleneck, AMD Fusion introduces a fused architecture that tightly integrates the CPU and GPU onto the same die and connects them with a high-speed, on-chip, memory controller. This novel architecture incorporates shared memory between the CPU and GPU, thus enabling several techniques for inter-device data transfer that are not available on discrete architectures. For instance, a kernel running on the GPU can now directly access a CPU-resident memory buffer and vice versa.
In this paper, we seek to understand the implications of the fused architecture on CPU-GPU heterogeneous computing by systematically characterizing various memory-access techniques instantiated with diverse memory-bound kernels on the latest AMD Fusion system (i.e., Llano A8-3850). Our study reveals that the fused architecture is very promising for accelerating data-intensive applications on heterogeneous platforms in support of supercomputing.
- Aji, A, Daga, M, Feng, W (2011) Bounding the effect of partition camping in GPU kernels. 8th ACM int’l conference on computing frontiers.
- Baghsorkhi, S, Delahaye, M, Patel, S, Gropp, W, Hwu, W (2010) An adaptive performance modeling tool for GPU architectures. ACM SIGPLAN Not 45: pp. 105-114 CrossRef
- Boudier, P, Sellers, G (2011) Memory system on fusion APUs: The benefits of zero copy. AMD Fusion developer summit, AMD.
- Che, S, Boyer, M, Meng, J, Tarjan, D, Sheaffer, J, Skadron, K (2008) A performance study of general-purpose applications on graphics processors using cuda. J Parallel Distrib Comput.
- Che, S, Boyer, M, Meng, J, Tarjan, D, Sheaffer, JW, Lee, S-H, Skadron, K (2009) Rodinia: A benchmark suite for heterogeneous computing. IEEE int’l symp. on workload characterization.
- Daga, M, Scogland, T, Feng, W (2011) Architecture-aware mapping and optimization on a 1600-core GPU. IEEE int’l conf. on parallel and distributed systems.
- Danalis, A, Marin, G, McCurdy, C, Meredith, J, Roth, P, Spafford, K, Tipparaju, V, Vetter, J (2010) The scalable heterogeneous computing (shoc) benchmark suite. 3rd workshop on general-purpose computation on graphics processing units.
- Gutta, S, Foley, D, Naini, A, Wasmuth, R, Cherepacha, D (2011) Int’l solid-state circuits conference digest of technical papers.
- Hong, S, Kim, H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Comput Archit News 37: pp. 152-163 CrossRef
- Khronos Group (2008) The khronos group releases opencl 1.0 specification
- Ryoo, S, Rodrigues, C, Stone, S, Baghsorkhi, S, Ueng, S, Hwu, W (2007) Program optimization study on a 128-core GPU. 1st workshop on general purpose processing on graphics processing units.
- Ryoo, S, Rodrigues, C, Baghsorkhi, S, Stone, S, Kirk, D, Hwu, W (2008) Optimization principles and application performance evaluation of a multithreaded GPU using cuda. 13th ACM SIGPLAN symp. on principles and practice of parallel programming.
- Top500 (2011) http://www.top500.org/
- Wong, H, Papadopoulou, MM, Sadooghi-Alvandi, M, Moshovos, A (2010) Demystifying GPU microarchitecture through microbenchmarking. IEEE Int’l symp. on performance analysis of systems software.
- Performance characterization of data-intensive kernels on AMD Fusion architectures
Computer Science - Research and Development
Volume 28, Issue 2-3 , pp 175-184
- Cover Date
- Print ISSN
- Online ISSN
- Additional Links
- AMD Fusion
- Memory transfer
- Industry Sectors