Advertisement

Journal of Computer Science and Technology

, Volume 31, Issue 1, pp 36–49 | Cite as

Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs

  • Yun LiangEmail author
  • Shuo Wang
Regular Paper

Abstract

The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.

Keywords

register file racetrack memory GPU 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Gebhart M, Keckler S W, Khailany B, Krashinsky R, Dally W J. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.Google Scholar
  2. [2]
    Li X, Liang Y. Energy-efficient kernel management on gpus. In Proc. the Design Automation and Test in Europe (DATE), Mar. 2016.Google Scholar
  3. [3]
    Liang Y, Huynh H, Rupnow K, Goh R, Chen D. Efficient GPU spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(3): 748–760.Google Scholar
  4. [4]
    Liang Y, Xie X, Sun G, Chen D. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015, 34(10): 1677–1690.Google Scholar
  5. [5]
    Xie X, Liang Y, Li X, Wu Y, Sun G, Wang T, Fan D. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proc. the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2015.Google Scholar
  6. [6]
    Xie X, Liang Y, Sun G, Chen D. An efficient compiler framework for cache bypassing on GPUs. In Proc. the International Conference on Computer Aided Design (ICCAD), Nov. 2013, pp.516-523.Google Scholar
  7. [7]
    Xie X, Liang Y, Wang Y, Sun G, Wang T. Coordinated static and dynamic cache bypassing on GPUs. In Proc. the 21st IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2015, pp.76-88.Google Scholar
  8. [8]
    Mao M,Wen W, Zhang Y, Chen Y, Li H H. Exploration of GPGPU register file architecture using domain-wall-shiftwrite based racetrack memory. In Proc. the 51st Annual Design Automation Conference (DAC), June 2014, pp.196:1-196:6.Google Scholar
  9. [9]
    Zhang C, Sun G, Zhang W, Mi F, Li H, Zhao W. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In Proc. the 20th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2015, pp.100-105.Google Scholar
  10. [10]
    Parkin S S P, Hayashi M, Thomas L. Magnetic domain-wall racetrack memory. Science, 2008, 320(5873): 190–194.Google Scholar
  11. [11]
    Sun Z, Wu W, Li H. Cross-layer racetrack memory design for ultra high density and low power consumption. In Proc. the 50th Annual Design Automation Conference (DAC), May 2013, Article No. 53.Google Scholar
  12. [12]
    Venkatesan R, Ramasubramanian S G, Venkataramani S, Roy K, Raghunathan A. Stag: Spintronic-tape architecture for GPGPU cache hierarchies. In Proc. the 41st Annual International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.253-264.Google Scholar
  13. [13]
    Jing N, Shen Y, Lu Y, Ganapathy S, Mao Z, Guo M, Canal R, Liang X. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proc. the 40th Annual International Symposium on Computer Architecture (ISCA), Jun. 2013, pp.344-355.Google Scholar
  14. [14]
    Wang S, Liang Y, Zhang C, Xie X, Sun G, Liu Y, Wang Y, Li X. Performance-centric register file design for GPUs using racetrack memory. In Proc. the 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2016.Google Scholar
  15. [15]
    Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Oct. 2013, pp.157-166.Google Scholar
  16. [16]
    Jog A, Mishra A K, Xu C, Xie Y, Narayanan V, Iyer R, Das C R. Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In Proc. the 49th Annual Design Automation Conference (DAC), June 2012, pp.243-252.Google Scholar
  17. [17]
    Samavatian M H, Abbasitabar H, Arjomand M, Sarbazi-Azad H. An efficient STT-RAM last level cache architecture for GPUs. In Proc. the 51st Annual Design Automation Conference (DAC), May 2014, pp.197:1–197:6.Google Scholar
  18. [18]
    Sun Z, Bi X, Li H H, Wong W F, Ong Z L, Zhu X, Wu W. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In Proc. the 44th Annual International Symposium on Microarchitecture (MICRO), Dec. 2011, pp.329-338.Google Scholar
  19. [19]
    Chen X, Sha E H M, Zhuge Q, Dai P, Jiang W. Optimizing data placement for reducing shift operations on domain wall memories. In Proc. the 52nd Annual Design Automation Conference (DAC), June 2015, pp.139:1–139:6.Google Scholar
  20. [20]
    Venkatesan R, Kozhikkottu V, Augustine C, Raychowdhury A, Roy K, Raghunathan A. TapeCache: A high density, energy efficient cache based on domain wall memory. In Proc. the International Symposium on Low Power Electronics and Design (ISLPED), July 30-August 1, 2012, pp.185-190.Google Scholar
  21. [21]
    Jing N, Liu H, Lu Y Liang X. Compiler assisted dynamic register file in GPGPU. In Proc. the International Symposium on Low Power Electronics and Design (ISLPED), Sept. 2013, pp.3-8.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Center for Energy-Efficient Computing and Applications (CECA), School of Electrical Engineering and Computer SciencesPeking UniversityBeijingChina

Personalised recommendations