Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

  • Victor GarciaEmail author
  • Alejandro Rico
  • Carlos Villavieja
  • Paul Carpenter
  • Nacho Navarro
  • Alex Ramirez


Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.


Cache memories Prefetch Task based programming models 



This work has been partially supported by an FPI-UPC Grant, the Consolider program of the Spanish Ministry of Economy and Competitiveness (TIN2012-34557), the Mont-Blanc Project (ICT-FP7-288777), and the European Network of Excellence HIPEAC-3 (ICT-287759).


  1. 1.
    ARM, Cortex-A9 Technical Reference Manual. (2008). Accessed 10 Nov 2014
  2. 2.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  3. 3.
    Baer, J.-L., Chen, T.-F.: Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44(5), 609–623 (1995)CrossRefzbMATHGoogle Scholar
  4. 4.
    Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), Sydney, NSW, pp. 19–24 (2008). doi: 10.1109/I-SPAN.2008.24
  5. 5.
    Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRefGoogle Scholar
  6. 6.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40(10), 519–538Google Scholar
  7. 7.
    Chen, T.-F., Baer, J.-L.: A performance study of software and hardware data prefetching Schemes. In: Proceedings the 21st Annual International Symposium on Computer Architecture, 1994, Chicago, IL, pp. 223–232 (1994)Google Scholar
  8. 8.
    Chung, I.H., Hollingsworth, J.K.: A case study using automatic performance tuning for large-scale scientific programs. In: 15th IEEE International Conference on High Performance Distributed Computing, Paris, 2006, pp. 45–56 (2006). doi: 10.1109/HPDC.2006.1652135
  9. 9.
    OpenMP Consortium. (2014). Accessed 25 July 2014
  10. 10.
    Dahlgren, F., Dubois, M., Stenstrom, P.: Fixed and adaptive sequential prefetching in shared memory multiprocessors. In: International Conference on Parallel Processing, 1993. ICPP 1993, Syracuse, NY, pp. 56–63 (1993)Google Scholar
  11. 11.
    Dahlgren, F., Stenstrom, P.: Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In: Proceedings., First IEEE Symposium on High-Performance Computer Architecture, 1995, Raleigh, NC, pp. 68–77 (1995). doi: 10.1109/HPCA.1995.386554
  12. 12.
    GCC Developers: GCC Optimization Options. (2014). Accessed 14 Aug 2014
  13. 13.
    Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Ebrahimi, E., Lee, C.J., Mutlu, O., Patt, Y.N.: Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. SIGARCH Comput. Archit. News 38(1), 335–346 (2010)CrossRefGoogle Scholar
  15. 15.
    Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, ACM, New York, NY, USA (2006)Google Scholar
  16. 16.
    Feng, X., Cameron, K.W., Buell, D.A.: PBPI: a high performance implementation of bayesian phylogenetic inference. In: Proceedings of the ACM/IEEE, SC 2006 Conference, Tampa, FL, pp. 40 (2006). doi: 10.1109/SC.2006.47
  17. 17.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, pp. 212–223, ACM, New York, NY, USA (1998)Google Scholar
  18. 18.
    Gornish, E.H., Granston, E.D., Veidenbaum, A.V.: Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: In International Conference on Supercomputing, pp. 354–368 (1990)Google Scholar
  19. 19.
    Guo, Y., Narayanan, P., Bennaser, M., Chheda, S., Moritz, C.: Energy-efficient hardware data prefetching. Very Large Scale Integration (VLSI) Syst. IEEE Trans. 19(2), 250–263 (2011)CrossRefGoogle Scholar
  20. 20.
    D. Lowenthal and M. James. Run-time selection of block size in pipelined parallel programs. In: Parallel Processing, 1999. Proceedings13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP, pp. 82–87Google Scholar
  21. 21.
    Lu, J.: Design and Implementation of a Lightweight Runtime Optimization System on Modern Computer Architectures. Ph.D. Thesis, Minneapolis, MN, USA, AAI3220014 (2006)Google Scholar
  22. 22.
    Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp. 190–200, New York, NY, USA, 2005. ACMGoogle Scholar
  23. 23.
    Martonosi, M.R.: Analyzing and tuning memory performance in sequential and parallel programs. Technical report, Stanford, CA, USA (1994)Google Scholar
  24. 24.
    Mowry, T., Gupta, A.: Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 87–106 (1991)CrossRefGoogle Scholar
  25. 25.
    Nesbit, K., Smith, J.: Data cache prefetching using a global history buffer. In: Software, IEE Proceedings, p. 96 (2004)Google Scholar
  26. 26.
    Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 325–334, ACM, New York, NY, USA (2013)Google Scholar
  27. 27.
    Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly and Associates Inc, Sebastopol (2007)Google Scholar
  28. 28.
    Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)CrossRefGoogle Scholar
  29. 29.
    Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. Sci. Program. 17(1–2), 59–76 (2009)Google Scholar
  30. 30.
    Rothberg, E., Singh, J.P., Gupta, A.: Working sets, cache sizes, and node granularity issues for large-scale multiprocessors. In: Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993, pp. 14–25 (1993)Google Scholar
  31. 31.
    Solihin, Y., Lee, J., Torrellas, J.: Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA ’02, pp. 171–182, IEEE Computer Society, Washington, DC, USA (2002)Google Scholar
  32. 32.
    Tandri, S., Abdelrahman, T.S.: Automatic partitioning of data and computations on scalable shared memory multiprocessors. In: Proceedings of the 1997 International Conference on Parallel Processing, 1997, Bloomington, IL, pp. 64–73 (1997). doi: 10.1109/ICPP.1997.622557
  33. 33.
    Tullsen, D.M., Eggers, S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)CrossRefGoogle Scholar
  34. 34.
    Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A., Mendelson, A., Navarro, N., Cristal, A., Unsal, O.S.: Didi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston, TX, pp. 340–349 (2011)Google Scholar
  35. 35.
    Wall, M.: Using block prefetch for optimized memory performance. (2001). Accessed 10 July 2014
  36. 36.
    Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Victor Garcia
    • 1
    • 2
    Email author
  • Alejandro Rico
    • 2
  • Carlos Villavieja
    • 3
  • Paul Carpenter
    • 2
  • Nacho Navarro
    • 1
    • 2
  • Alex Ramirez
    • 4
  1. 1.Universitat Politecnica de CatalunyaBarcelonaSpain
  2. 2.Barcelona Supercomputing CenterBarcelonaSpain
  3. 3.Google Inc.New YorkUSA
  4. 4.NVIDIA CorporationSanta ClaraUSA

Personalised recommendations