Advertisement

Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction

  • Amir Yazdanbakhsh
  • Gennady Pekhimenko
  • Hadi Esmaeilzadeh
  • Onur Mutlu
  • Todd C. Mowry
Chapter

Abstract

In this chapter, we introduce a novel solution to tackle two fundamental memory bottlenecks in accelerator-rich architectures: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). Exploiting the inherent error resilience of a wide range of applications, we introduce an approximation technique, called rollback-free value prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP never checks for or recovers from load value mispredictions. As such, RFVP avoids the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops some fraction of load requests which miss in the cache after predicting their values, hence reducing memory bandwidth contention by removing them from the system. This drop rate then becomes a knob to control the trade-off between performance/energy efficiency and output quality. Employing our technique in a modern GPU across a diverse set of applications delivers, on average, 40% speedup and 31% energy reduction, with average 8.8% quality loss. With 10% loss in quality, the benefits reach a maximum of 2.4× speedup and 2.0× energy reduction.

References

  1. 1.
    Alvarez C, Corbal J, Valero M (2005) Fuzzy memoization for floating-point multimedia applications. IEEE Trans Comput 54.7:922–927CrossRefGoogle Scholar
  2. 2.
    Amant RS, Yazdanbakhsh A, Park J, Thwaites B, Esmaeilzadeh H et al (2014) General-purpose code acceleration with limited-precision analog computation. In: ISCA, pp 505–516Google Scholar
  3. 3.
    Arnau J-M, Parcerisa J-M, Xekalakis P (2014) Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. In: ISCA, pp 529–540Google Scholar
  4. 4.
    Baek W, Chilimbi TM (2010) Green: a framework for supporting energy-conscious programming using controlled approximation. In: PLDI, pp 198–209Google Scholar
  5. 5.
    Bakhoda A, Yuan G, Fung W, Wong H, Aamodt T (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS, pp 163–174Google Scholar
  6. 6.
    Carbin M, Misailovic S, Rinard MC (2013) Verifying quantitative reliability for programs that execute on unreliable hardware. In: OOPSLA, pp 33–52Google Scholar
  7. 7.
    Ceze L, Strauss K, Tuck J, Torrellas J, Renau J (2006) CAVA: using checkpoint-assisted value prediction to hide L2 misses. In: ACM TACO 3.2:182–208CrossRefGoogle Scholar
  8. 8.
    Chakrapani LN, Akgul BES, Cheemalavagu S, Korkmaz P, Palem KV et al (2006) Ultra-efficient (embedded) SoC architectures based on probabilistic CMOS (PCMOS) technology. In: DATE, pp 1–6Google Scholar
  9. 9.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IISWC, pp 44–54Google Scholar
  10. 10.
    Chung ES, Milder PA, Hoe JC, Mai K (2010) Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPUs? In: MICRO, pp 225–236Google Scholar
  11. 11.
    Collange S, Defour D, Zhang Y (2010) Dynamic detection of uniform and affine vectors in GPGPU computations. In: Euro-Par (Parallel Processing Workshops), pp 46–55Google Scholar
  12. 12.
    Collins JD, Wang H, Tullsen DM, Hughes C, Lee Y-F et al (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: ISCA, pp 14–25Google Scholar
  13. 13.
    de Kruijf M, Nomura S, Sankaralingam K (2010) Relax: an architectural framework for software recovery of hardware faults. In: ISCA, pp 497–508Google Scholar
  14. 14.
    Eickemeyer RJ, Vassiliadis S (1993) A load-instruction unit for pipelined processors. IBM J Res Dev 37.4:547–564CrossRefGoogle Scholar
  15. 15.
    Esmaeilzadeh H, Sampson A, Ceze L, Burger D (2012) Architecture support for disciplined approximate programming. In: ASPLOS, pp 301–312Google Scholar
  16. 16.
    Esmaeilzadeh H, Sampson A, Ceze L, Burger D (2012) Neural acceleration for general-purpose approximate programs. In: MICRO, pp 449–460Google Scholar
  17. 17.
    Garcia P, Emambakhsh M, Wallace A (2017) Learning to approximate computing at run-time. In: International conference on intelligent signal processing (ISP), pp 1–6Google Scholar
  18. 18.
    Goeman B, Vandierendonck H, De Bosschere K (2001) Differential FCM: increasing value prediction accuracy by improving table usage efficiency. In: HPCA, pp 207–216Google Scholar
  19. 19.
    He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: PACT, pp 260–269Google Scholar
  20. 20.
    Keckler SW, Dally WJ, Khailany B, Garland M, Glasco D (2011) GPUs and the future of parallel computing. In: IEEE Micro 5:7–17CrossRefGoogle Scholar
  21. 21.
    Kim Y, Venkataramani S, Roy K, Raghunathan A (2016) Designing approximate circuits using clock overgating. In: DAC. IEEE, Piscataway, pp 1–6Google Scholar
  22. 22.
    Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS et al (2013) GPUWattch: enabling energy optimizations in GPGPUs. In: ISCA, pp 487–498Google Scholar
  23. 23.
    Li X, Yeung D (2007) Application-level correctness and its impact on fault tolerance. In: HPCA, pp 181–192Google Scholar
  24. 24.
    Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM et al (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: MICRO, pp 469–480Google Scholar
  25. 25.
    Li A, Song SL, Wijtvliet M, Kumar A, Corporaal H (2016) SFU-driven transparent approximation acceleration on GPUs. In Proceedings of the international conference on supercomputing. ACM, New York, pp 15:1–15:14Google Scholar
  26. 26.
    Lipasti MH, Shen JP (1996) Exceeding the dataflow limit via value prediction. In: MICRO, pp 226–237Google Scholar
  27. 27.
    Lipasti MH, Wilkerson CB, Shen JP (1996) Value locality and load value prediction. In: ASPLOS, pp 138–147Google Scholar
  28. 28.
    Liu S, Pattabiraman K, Moscibroda T, Zorn BG (2011) Flikker: saving refresh-power in mobile devices through critical data partitioning. In: ASPLOS, pp 213–224Google Scholar
  29. 29.
    Luo Y, Govindan S, Sharma BP, Santaniello M, Meza J et al (2014) Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In: DSN, pp 467–478Google Scholar
  30. 30.
    Mahajan D, Ramkrishnan K, Jariwala R, Yazdanbakhsh A, Park J et al (2015) Axilog: abstractions for approximate hardware design and reuse. In: IEEE Micro 35.5:16–30CrossRefGoogle Scholar
  31. 31.
    Maier D, Cosenza B, Juurlink B (2018) Local memory-aware kernel perforation. In: CGO, pp 278–287Google Scholar
  32. 32.
    Muralimanohar N, Balasubramonian R, Jouppi N (2007) Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: MICRO, pp 3–14Google Scholar
  33. 33.
    Murase M (1992) Linear feedback shift register. In: US PatentGoogle Scholar
  34. 34.
    Mutlu O, Stark J, Wilkerson C, Patt YN (2003) Runahead execution: an effective alternative to large instruction windows. In: IEEE Micro 23.6:20–25CrossRefGoogle Scholar
  35. 35.
    Mutlu O, Kim H, Patt YN (2005) Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In: MICRO, pp 233–244Google Scholar
  36. 36.
    Park J, Esmaeilzadeh H, Zhang X, Naik M, Harris W (2015) FlexJava: language support for safe and modular approximate programming. In: FSE, pp 745–757Google Scholar
  37. 37.
    Pekhimenko G, Seshadri V, Mutlu O, Gibbons PB, Kozuch MA et al (2012) Base-delta-immediate compression: practical data compression for on-chip caches. In: PACT, pp 377–388Google Scholar
  38. 38.
    Pekhimenko G, Bolotin E, O’Connor M, Mutlu O, Mowry T et al (2015) Toggle-aware compression for GPUs. Comput. Archit. Lett. 14.2:164–168CrossRefGoogle Scholar
  39. 39.
    Perais A, Seznec A (2014) Practical data value speculation for future high-end processors. In: HPCA, pp 428–439Google Scholar
  40. 40.
    Rogers BM, Krishna A, Bell GB, Vu K, Jiang X et al (2009) Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In: ISCA, pp 371–382Google Scholar
  41. 41.
    Rogers T, O’Connor M, Aamodt T (2012) Cache-conscious wavefront scheduling. In: MICRO, pp 72–83Google Scholar
  42. 42.
    Samadi M, Lee J, Jamshidi DA, Hormati A, Mahlke S (2013) SAGE: self-tuning approximation for graphics engines. In: MICRO, pp 13–24Google Scholar
  43. 43.
    Samadi M, Jamshidi DA, Lee J, Mahlke S (2014) Paraprox: pattern-based approximation for data parallel applications. In: ASPLOS, pp 35–50Google Scholar
  44. 44.
    Sampson A, Dietl W, Fortuna E, Gnanapragasam D, Ceze L et al (2011) EnerJ: approximate data types for safe and general low-power computation. In: PLDI, pp 164–174Google Scholar
  45. 45.
    Sampson A, Nelson J, Strauss K, Ceze L (2013) Approximate storage in solid-state memories. In: MICRO, pp 25–36Google Scholar
  46. 46.
    San Miguel J, Badr M, Jerger NE (2014) Load value approximation. In: MICRO, December 2014, pp 127–139Google Scholar
  47. 47.
    San Miguel J, Albericio J, Jerger NE, Jaleel A (2016) The bunker cache for spatio-value approximation. In: MICRO, pp 1–12Google Scholar
  48. 48.
    Sazeides Y, Smith JE (1997) The predictability of data values. In: MICRO, pp 248–258Google Scholar
  49. 49.
    Sidiroglou-Douskos S, Misailovic S, Hoffmann H, Rinard M (2011) Managing performance vs. accuracy trade-offs with loop perforation. In: FSE, pp 124–134Google Scholar
  50. 50.
    Thomas R, Franklin M (2001) Using dataflow based context for accurate value prediction. In: PACT, pp 107–117Google Scholar
  51. 51.
    Thwaites B, Pekhimenko G, Esmaeilzadeh H, Yazdanbakhsh A, Mutlu O et al (2014) Rollback-free value prediction with approximate loads. In: PACT, August 2014, pp 493–494Google Scholar
  52. 52.
    Vijaykumar N, Pekhimenko G, Jog A, Bhowmick A, Ausavarungnirun R et al (2015) A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In: ISCA, pp 41–53Google Scholar
  53. 53.
    Wong D, Kim NS, Annavaram M (2016) Approximating warps with intra-warp operand value similarity. In: HPCA, pp 176–187Google Scholar
  54. 54.
    Yazdanbakhsh A, Mahajan D, Thwaites B, Park J, Nagendrakumar A et al (2015) Axilog: language support for approximate hardware design. In: DATE, pp 812–817Google Scholar
  55. 55.
    Zhou H, Conte TM (2005) Enhancing memory-level parallelism via recovery-free value prediction. IEEE Trans Comput 54.7:897–912CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Amir Yazdanbakhsh
    • 1
  • Gennady Pekhimenko
    • 2
  • Hadi Esmaeilzadeh
    • 3
  • Onur Mutlu
    • 4
  • Todd C. Mowry
    • 5
  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.University of TorontoTorontoCanada
  3. 3.University of California-San DiegoSan DiegoUSA
  4. 4.ETH ZurichZurichSwitzerland
  5. 5.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations