Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Garcia, Victor; Rico, Alejandro; Villavieja, Carlos; Carpenter, Paul; Navarro, Nacho; Ramirez, Alex

doi:10.1007/s10766-016-0431-8

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Published: 29 April 2016

Volume 45, pages 530–550, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Victor Garcia^1,2,
Alejandro Rico²,
Carlos Villavieja³,
Paul Carpenter²,
Nacho Navarro^1,2 &
…
Alex Ramirez⁴

292 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Centric Hardware Prefetching in Multi-core Processors

PS-Cache: an energy-efficient cache design for chip multiprocessors

Article 13 September 2014

Storage-Efficient Data Prefetching for High Performance Computing

References

ARM, Cortex-A9 Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388f/DDI0388F_cortex_a9_r2p2_trm.pdf (2008). Accessed 10 Nov 2014
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)
Article Google Scholar
Baer, J.-L., Chen, T.-F.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)
Article MATH Google Scholar
Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, pp. 19–24 (2008). doi:10.1109/I-SPAN.2008.24
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)
Article Google Scholar
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40(10), 519–538
Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching Schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, Chicago, IL, pp. 223–232 (1994)
Chung, I.H., Hollingsworth, J.K.: A case study using automatic performance tuning for large-scale scientific programs. In: 15th IEEE International Conference on High Performance Distributed Computing, Paris, 2006, pp. 45–56 (2006). doi:10.1109/HPDC.2006.1652135
OpenMP Consortium. http://openmp.org/wp/ (2014). Accessed 25 July 2014
Dahlgren, F., Dubois, M., Stenstrom, P.: Fixed and adaptive sequential prefetching in shared memory multiprocessors. In: International Conference on Parallel Processing, 1993. ICPP 1993, Syracuse, NY, pp. 56–63 (1993)
Dahlgren, F., Stenstrom, P.: Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceedings., First IEEE Symposium on High-Performance Computer Architecture, 1995, Raleigh, NC, pp. 68–77 (1995). doi:10.1109/HPCA.1995.386554
GCC Developers: GCC Optimization Options. https://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Optimize-Options.html (2014). Accessed 14 Aug 2014
Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)
Article MathSciNet Google Scholar
Ebrahimi, E., Lee, C.J., Mutlu, O., Patt, Y.N.: Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. SIGARCH Comput. Archit. News 38(1), 335–346 (2010)
Article Google Scholar
Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York, NY, USA (2006)
Feng, X., Cameron, K.W., Buell, D.A.: PBPI: a high performance implementation of bayesian phylogenetic inference. In: Proceedings of the ACM/IEEE, SC 2006 Conference, Tampa, FL, pp. 40 (2006). doi:10.1109/SC.2006.47
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223, ACM, New York, NY, USA (1998)
Gornish, E.H., Granston, E.D., Veidenbaum, A.V.: Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: In International Conference on Supercomputing, pp. 354–368 (1990)
Guo, Y., Narayanan, P., Bennaser, M., Chheda, S., Moritz, C.: Energy-efficient hardware data prefetching. Very Large Scale Integration (VLSI) Syst. IEEE Trans. 19(2), 250–263 (2011)
Article Google Scholar
D. Lowenthal and M. James. Run-time selection of block size in pipelined parallel programs. In: Parallel Processing, 1999. Proceedings13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP, pp. 82–87
Lu, J.: Design and Implementation of a Lightweight Runtime Optimization System on Modern Computer Architectures. Ph.D. Thesis, Minneapolis, MN, USA, AAI3220014 (2006)
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp. 190–200, New York, NY, USA, 2005. ACM
Martonosi, M.R.: Analyzing and tuning memory performance in sequential and parallel programs. Technical report, Stanford, CA, USA (1994)
Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 87–106 (1991)
Article Google Scholar
Nesbit, K., Smith, J.: Data cache prefetching using a global history buffer. In: Software, IEE Proceedings, p. 96 (2004)
Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 325–334, ACM, New York, NY, USA (2013)
Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly and Associates Inc, Sebastopol (2007)
Google Scholar
Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)
Article Google Scholar
Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. Sci. Program. 17(1–2), 59–76 (2009)
Google Scholar
Rothberg, E., Singh, J.P., Gupta, A.: Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 14–25 (1993)
Solihin, Y., Lee, J., Torrellas, J.: Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA ’02, pp. 171–182, IEEE Computer Society, Washington, DC, USA (2002)
Tandri, S., Abdelrahman, T.S.: Automatic partitioning of data and computations on scalable shared memory multiprocessors. In: Proceedings of the 1997 International Conference on Parallel Processing, 1997, Bloomington, IL, pp. 64–73 (1997). doi:10.1109/ICPP.1997.622557
Tullsen, D.M., Eggers, S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)
Article Google Scholar
Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 340–349 (2011)
Wall, M.: Using block prefetch for optimized memory performance. http://web.mit.edu/ehliu/Public/ProjectX/Meetings/AMD_block_prefetch_paper.pdf (2001). Accessed 10 July 2014
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)
Article Google Scholar

Download references

Acknowledgments

This work has been partially supported by an FPI-UPC Grant, the Consolider program of the Spanish Ministry of Economy and Competitiveness (TIN2012-34557), the Mont-Blanc Project (ICT-FP7-288777), and the European Network of Excellence HIPEAC-3 (ICT-287759).

Author information

Authors and Affiliations

Universitat Politecnica de Catalunya, Barcelona, Spain
Victor Garcia & Nacho Navarro
Barcelona Supercomputing Center, Barcelona, Spain
Victor Garcia, Alejandro Rico, Paul Carpenter & Nacho Navarro
Google Inc., New York, NY, USA
Carlos Villavieja
NVIDIA Corporation, Santa Clara, CA, USA
Alex Ramirez

Authors

Victor Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Rico
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Villavieja
View author publications
You can also search for this author in PubMed Google Scholar
Paul Carpenter
View author publications
You can also search for this author in PubMed Google Scholar
Nacho Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Alex Ramirez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victor Garcia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garcia, V., Rico, A., Villavieja, C. et al. Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors. Int J Parallel Prog 45, 530–550 (2017). https://doi.org/10.1007/s10766-016-0431-8

Download citation

Received: 13 November 2014
Accepted: 20 April 2016
Published: 29 April 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10766-016-0431-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Memory Centric Hardware Prefetching in Multi-core Processors

PS-Cache: an energy-efficient cache design for chip multiprocessors

Storage-Efficient Data Prefetching for High Performance Computing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Memory Centric Hardware Prefetching in Multi-core Processors

PS-Cache: an energy-efficient cache design for chip multiprocessors

Storage-Efficient Data Prefetching for High Performance Computing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation