Abstract
Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we show that a fixed value for detecting patterns and prefetch degree makes GHB to (1) be conservative while there are more opportunities to create new addresses and (2) generate wrong addresses in the presence of constant strides. To resolve these problems, we separate the pattern length from the prefetching degree. The result is an aggressive prefetcher that can generate more addresses with a given pattern length. Furthermore with a variable pattern length mechanism, constant strides are grouped, such that more accurate patterns are detected. As the aggressiveness of this prefetcher is relatively high, we further propose an efficient throttling procedure to reduce the negative effects of wrong prefetching using a new measure of cache pollution. This adaptive method is suitable for CMP processors where the prefetcher resides in the shared LCC. Simulation results with a mixed suite of integer and floating point benchmarks from SPEC CPU2006 show that on a single-core processor both aggressive and adaptive methods outperform existing prefetchers by 48 and 28 %, respectively, while increasing the memory traffic by 20 and 14 %, respectively. Further on an 8-core CMP with a mix of multiprogrammed workloads, the adaptive method outperforms the state-of-the-art throttling methods by 8 % in speedup, while reducing the memory traffic by 3 %.
Similar content being viewed by others
Notes
Throughout this paper, by GHB we mean a GHB with global delta correlation (G/DC) [11].
References
International technology roadmap for semiconductor (ITRS). http://www.itrs.net/links/2010itrs
Palacharla S, Jouppi NP, Smith JE (1997) Complexity-effective superscalar processors. In: Proceedings of international symposium on computer, architecture, pp 206–218
Reinman G, Austin T, Calder B (1999) A scalable front-end architecture for fast instruction delivery. In: Proceedings of international symposium on computer architecture, pp 234–245
Camacho ON, Villa VLA, Espinosa SO (2007) High performance cache. In: Proceedings of the international conference on computer design, pp 181–187
Bellas NE, Hajj IN, Polychronopoulos CD (2000) Using dynamic cache management techniques to reduce energy in general purpose processors. IEEE Trans Very Large Scale Integr Syst 8:693–708
Ku JC, Ozdemir S, Ismail Y (2006) Power density minimization for highly-associative caches in embedded processors. In: Proceedings of the ACM Great Lakes symposium on VLSI, pp 100–104
Gove D (2007) Cpu2006 working set size. ACM SIGARCH Comput Archit News 35:90–96
Prakash TK, Peng L (2008) Performance characterization of spec cpu2006 benchmarks on intel core 2 duo processor. ISAST Trans Comput Softw Eng 2:36–41
Wang Z, Burger D, McKinley KS, Reinhardt SK, Weems CC (2003) Guided region prefetching: a cooperative hardware/software approach. In: Proceedings of international symposium on computer, architecture, pp 388–398
Spracklen L, Chou Y, Abraham SG (2005) Effective instruction prefetching in chip multiprocessors for modern commercial applications. In: Proceedings of international symposium on high performance computer, architecture, pp 225–236
Nesbit KJ, Smith JE (2004) Data cache prefetching using a global history buffer. In: Proceedings of international symposium on high performance computer, architecture, pp 96–105
Sair S, Sherwood T, Calder B (2003) A decoupled predictor-directed stream prefetching architecture. IEEE Trans Comput 52:260–276
Liu G, Huang Z, Peir J-K, Shi X, Peng L (2011) Enhancements for accurate and timely streaming prefetcher. J Instr Level Parallelism 13
Smith AJ (1982) Cache memories. ACM Comput Surv 14:473–530
Chen T, Baer J (1995) Effective hardware-based data prefetching for high-performance processors. IEEE Trans Comput 44:609–623
Wang K, Franklin M (1997) Highly accurate data value prediction using hybrid predictors. In: Proceedings of international symposium on microarchitecture, pp 281–290
Charney M, Reeves A (1995) Generalized correlation based hardware prefetching. Technical Report EE-CEG-95-1 Cornell University
Joseph D, Grunwald D (1997) Prefetching using markov predictors. In: Proceedings of international symposium on computer, architecture, pp 252–263
Perez DG, Mouchard G, Temam O (2004) Microlib: a case for the quantitative comparison of micro-architecture mechanisms. In: Proceedings of the International Symposium on microarchitecture, pp 43–54
Srinath S, Mutlu O, Kim H, Patt YN (2007) Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In: Proceedings of international symposium on high performance computer, architecture, pp 63–74
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39:1–7
Standard performance evaluation corporation (SPEC) cpu2006 benchmark suite. http://www.spec.org/cpu2006
Verma S, Koppelman DM, Peng L (2011) Efficient prefetching with hybrid schemes and use of program feedback to adjust prefetcher aggressiveness. J Instr Level Parallelism 13
Dahlgren F, Stenstrom P (1995) Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceeding of symposium on high-performance computer, architecture, pp 68–77
Dimitrov M, Zhou H (2011) Combining local and global history for high performance data prefetching. J Instr Level Parallelism 13
Sharif A, Lee HS (2011) Data prefetching by exploiting global and local access patterns. J Instr Level Parallelism 13
Nesbit KJ, Smith JE (2004) AC/DC: an adaptive data cache prefetcher. In: Proceedings of international conference on parallel architecture and compilation, techniques, pp 135–145
Diaz P, Cintra M (2009) Stream chaining: exploiting multiple levels of correlation in data prefetching. In: Proceedings of international symposium on computer, architecture, pp 81–92
Grannaes M, Jahre M, Natvig L (2011) Storage efficient hardware prefetching using delta-correlating prediction tables. J Instr Level Parallelism 13
Somogyi S, Wenisch TF, Ailamaki A, Falsafi B, Moshovos A (2006) Spatial memory streaming. In: Proceedings of international symposium on computer, architecture, pp 252–263
Ebrahimi E, Multu O, Lee CJ, Patt YN (2009) Coordinated control of multiple prefetchers in multi-core systems. In: Proceedings of the international symposium on microarchitecture, pp 316–326
Dang X, Wang X, Tong D, Lu J, Yi J, Wang K (2012) S/DC: a storage and energy efficient data prefetcher. In: Proceedings of the international conference on design, automation and test in Europe, pp 461–466
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Naderan-Tahan, M., Sarbazi-Azad, H. Adaptive prefetching using global history buffer in multicore processors. J Supercomput 68, 1302–1320 (2014). https://doi.org/10.1007/s11227-014-1088-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1088-y