Performance characterization of data-intensive kernels on AMD Fusion architectures

Special Issue Paper

Abstract

The cost of data movement over the PCI Express bus is one of the biggest performance bottlenecks for accelerating data-intensive applications on traditional discrete GPU architectures. To address this bottleneck, AMD Fusion introduces a fused architecture that tightly integrates the CPU and GPU onto the same die and connects them with a high-speed, on-chip, memory controller. This novel architecture incorporates shared memory between the CPU and GPU, thus enabling several techniques for inter-device data transfer that are not available on discrete architectures. For instance, a kernel running on the GPU can now directly access a CPU-resident memory buffer and vice versa.

In this paper, we seek to understand the implications of the fused architecture on CPU-GPU heterogeneous computing by systematically characterizing various memory-access techniques instantiated with diverse memory-bound kernels on the latest AMD Fusion system (i.e., Llano A8-3850). Our study reveals that the fused architecture is very promising for accelerating data-intensive applications on heterogeneous platforms in support of supercomputing.

Keywords

GPU AMD Fusion Memory transfer 

References

  1. 1.
    Aji A, Daga M, Feng W (2011) Bounding the effect of partition camping in GPU kernels. In: 8th ACM int’l conference on computing frontiers. doi:http://doi.acm.org/10.1145/2016604.2016637 Google Scholar
  2. 2.
    Baghsorkhi S, Delahaye M, Patel S, Gropp W, Hwu W (2010) An adaptive performance modeling tool for GPU architectures. ACM SIGPLAN Not 45:105–114. doi:http://doi.acm.org/10.1145/1837853.1693470 CrossRefGoogle Scholar
  3. 3.
    Boudier P, Sellers G (2011) Memory system on fusion APUs: The benefits of zero copy. In: AMD Fusion developer summit, AMD. http://developer.amd.com/afds/assets/presentations/1004_final.pdf Google Scholar
  4. 4.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer J, Skadron K (2008) A performance study of general-purpose applications on graphics processors using cuda. J Parallel Distrib Comput. doi:10.1016/j.jpdc.2008.05.014 Google Scholar
  5. 5.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: IEEE int’l symp. on workload characterization. doi:10.1109/IISWC.2009.5306797 Google Scholar
  6. 6.
    Daga M, Scogland T, Feng W (2011) Architecture-aware mapping and optimization on a 1600-core GPU. In: IEEE int’l conf. on parallel and distributed systems Google Scholar
  7. 7.
    Danalis A, Marin G, McCurdy C, Meredith J, Roth P, Spafford K, Tipparaju V, Vetter J (2010) The scalable heterogeneous computing (shoc) benchmark suite. In: 3rd workshop on general-purpose computation on graphics processing units. doi:10.1145/1735688.1735702 Google Scholar
  8. 8.
    Gutta S, Foley D, Naini A, Wasmuth R, Cherepacha D (2011) In: Int’l solid-state circuits conference digest of technical papers. doi:10.1109/ISSCC.2011.5746314 Google Scholar
  9. 9.
    Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Comput Archit News 37:152–163. doi:10.1145/1555815.1555775 MathSciNetCrossRefGoogle Scholar
  10. 10.
    Khronos Group (2008) The khronos group releases opencl 1.0 specification Google Scholar
  11. 11.
    Ryoo S, Rodrigues C, Stone S, Baghsorkhi S, Ueng S, Hwu W (2007) Program optimization study on a 128-core GPU. In: 1st workshop on general purpose processing on graphics processing units Google Scholar
  12. 12.
    Ryoo S, Rodrigues C, Baghsorkhi S, Stone S, Kirk D, Hwu W (2008) Optimization principles and application performance evaluation of a multithreaded GPU using cuda. In: 13th ACM SIGPLAN symp. on principles and practice of parallel programming. doi:http://doi.acm.org/10.1145/1345206.1345220 Google Scholar
  13. 13.
    Top500 (2011) http://www.top500.org/
  14. 14.
    Wong H, Papadopoulou MM, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: IEEE Int’l symp. on performance analysis of systems software. doi:10.1109/ISPASS.2010.5452013 Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.Department of Computer ScienceVirginia TechBlacksburgUSA

Personalised recommendations