Skip to main content

PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing (Euro-Par 2021)

Abstract

The ever-increasing gap between the processor and main memory speeds requires careful utilization of the limited memory link. This is additionally emphasized for the case of memory-bound applications. Prioritization of memory requests in the memory controller is one of the approaches to improve performance of such codes. However, current designs do not consider high-level information about parallel applications. In this paper, we propose a holistic approach to this problem, where the runtime system-level knowledge is made available in hardware. Processor exploits this information to better prioritize memory requests, while introducing negligible hardware cost. Our design is based on the notion of critical path in the execution of a parallel code. The critical tasks are accelerated by prioritizing their memory requests within the on-chip memory hierarchy. As a result, we reduce the critical path and improve the overall performance up to 1.19\(\times \) compared to the baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Section 4 describes the experimental setup in detail.

References

  1. Alvarez, L., Vilanova, L., Moreto, M., Casas, M., Gonzàlez, M., et al.: Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. In: ISCA 2015, pp. 720–732 (2015). https://doi.org/10.1145/2749469.2750411

  2. Barcelona Supercomputing Center: Nanos++ Runtime Library (2014). http://pm.bsc.es/nanox

  3. Cai, Q., González, J., Rakvic, R., Magklis, G., Chaparro, P., González, A.: Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In: PACT 2008 pp. 240–249 (2008). https://doi.org/10.1145/1454115.1454149,

  4. Casas, M., Moretó, M., Alvarez, L., Castillo, E., Chasapis, D., Hayes, T., et al.: Runtime-aware architectures. In: Euro-Par 2015, pp. 16–27 (2015). https://doi.org/10.1007/978-3-662-48096-0_2

  5. Castillo, E., Moreto, M., Casas, M., Alvarez, L., Vallejo, E., Chronaki, K., et al.: CATA: criticality aware task acceleration for multicore processors. In: IPDPS 2016, pp. 413–422 (2016). https://doi.org/10.1109/IPDPS.2016.49

  6. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21(3), 291–312 (2007). https://doi.org/10.1177/1094342007078442

    Article  Google Scholar 

  7. Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., Valero, M.: Criticality-aware dynamic task scheduling for heterogeneous architectures. In: ICS 2015, pp. 329–338 (2015). https://doi.org/10.1145/2751205.2751235

  8. Liu, C.-H., Li, C.-F., Lai, K.-C., Wu, C.-C.: A dynamic critical path duplication task scheduling algorithm for distributed heterogeneous computing systems. In: ICPADS 2006, vol. 1, p. 8 (2006). https://doi.org/10.1109/ICPADS.2006.37

  9. Daoud, M., Kharma, N.: Efficient compile-time task scheduling for heterogeneous distributed computing systems. In: ICPADS 2006, vol. 1, pp. 11–22 (2006). https://doi.org/10.1109/ICPADS.2006.40

  10. Dimić, V., Moretó, M., Casas, M., Ciesko, J., Valero, M.: Rich: implementing reductions in the cache hierarchy. In: ICS 2020, p. 13 (2020). https://doi.org/10.1145/3392717.3392736

  11. Dimić, V., Moretó, M., Casas, M., Valero, M.: Runtime-assisted shared cache insertion policies based on re-reference intervals. In: Euro-Par 2017, vol. 10417, pp. 247–259 (2017). https://doi.org/10.1007/978-3-319-64203-1_18

  12. Du Bois, K., Eyerman, S., Sartor, J.B., Eeckhout, L.: Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. In: ISCA 2013, pp. 511–522 (2013). https://doi.org/10.1145/2485922.2485966

  13. Ghose, S., Lee, H., Martínez, J.F.: Improving memory scheduling via processor-side load criticality information. In: ISCA 2013, pp. 84–95 (2013). https://doi.org/10.1145/2485922.2485930

  14. Hakem, M., Butelle, F.: Dynamic critical path scheduling parallel programs onto multiprocessors. In: IPDPS 2005, p. 7 (2005). https://doi.org/10.1109/IPDPS.2005.175

  15. Hashemi, M., Ebrahimi, E.K., Mutlu, O., Patt, Y.N.: Accelerating dependent cache misses with an enhanced memory controller. In: ISCA 2016, pp. 444–455 (2016). https://doi.org/10.1109/ISCA.2016.46

  16. Intel Copropration: Intel® Cilk™ Plus Language Extension Specification (2013)

    Google Scholar 

  17. Intel Copropration: Intel® Thread Bulding Blocks (2020)

    Google Scholar 

  18. Ipek, E., Mutlu, O., Martínez, J.F., Caruana, R.: Self-optimizing memory controllers: a reinforcement learning approach. In: ISCA 2008, pp. 39–50 (2008). https://doi.org/10.1109/ISCA.2008.21

  19. Jaulmes, L., Casas, M., Moretó, M., Ayguadé, E., Labarta, J., Valero, M.: Exploiting asynchrony from exact forward recovery for due in iterative solvers. In: SC 2015 (2015). https://doi.org/10.1145/2807591.2807599

  20. Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS 2012, pp. 223–234 (2012). https://doi.org/10.1145/2150976.2151001

  21. Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Utility-based acceleration of multithreaded applications on asymmetric CMPs. In: ISCA 2013, pp. 154–165 (2013). https://doi.org/10.1145/2485922.2485936

  22. Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: OOPSLA 1993, pp. 91–108 (1993). https://doi.org/10.1145/165854.165874

  23. Kim, Y., Han, D., Mutlu, O., Harchol-Balter, M.: ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: HPCA 2010, pp. 1–12 (2010). https://doi.org/10.1109/HPCA.2010.5416658

  24. Manivannan, M., Papaefstathiou, V., Pericas, M., Stenstrom, P.: RADAR: runtime-assisted dead region management for last-level caches. In: HPCA 2016, pp. 644–656 (2016). https://doi.org/10.1109/HPCA.2016.7446101

  25. Mutlu, O., Moscibroda, T.: Parallelism-aware batch scheduling: enabling high-performance and fair shared memory controllers. IEEE Micro 29(1), 22–32 (2009). https://doi.org/10.1109/MM.2009.12

    Article  Google Scholar 

  26. Mutlu, O., Moscibroda, T.: Stall-time fair memory access scheduling for chip multiprocessors. In: MICRO 2007, pp. 146–160 (2007). https://doi.org/10.1109/MICRO.2007.40

  27. Nesbit, K.J., et al.: Fair queuing memory systems. In: MICRO 2006, pp. 208–222 (2006). https://doi.org/10.1109/MICRO.2006.24

  28. OpenMP Architecture Review Board: OpenMP Technical Report 4 Version 5.0 Preview 1 (2016)

    Google Scholar 

  29. Peiron, M., Valero, M., Ayguadé, E., Lang, T.: Vector multiprocessors with arbitrated memory access. In: ISCA 1995. pp. 243–252 (1995). https://doi.org/10.1145/223982.224435

  30. Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., et al.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012). https://doi.org/10.1145/2086696.2086715

  31. Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., Valero, M.: Trace-driven simulation of multithreaded applications. In: ISPASS 2011, pp. 87–96 (2011). https://doi.org/10.1109/ISPASS.2011.5762718

  32. Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: ISCA 2000, pp. 128–138 (2000). https://doi.org/10.1145/339647.339668

  33. Subramaniam, S., Bracy, A., Wang, H., Loh, G.H.: Criticality-based optimizations for efficient load processing. In: HPCA 2009, pp. 419–430 (2009). https://doi.org/10.1109/HPCA.2009.4798280

  34. Subramanian, L., Lee, D., Seshadri, V., Rastogi, H., Mutlu, O.: The blacklisting memory scheduler: achieving high performance and fairness at low cost. In: ICCD 2014, pp. 8–15 (2014). https://doi.org/10.1109/ICCD.2014.6974655

  35. Suleman, M.A., Mutlu, O., Qureshi, M.K., Patt, Y.N.: Accelerating critical section execution with asymmetric multi-core architectures. In: ASPLOS 2009, pp. 253–264 (2009). https://doi.org/10.1145/1508244.1508274

  36. Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206

    Article  Google Scholar 

  37. Valero, M., Lang, T., Llabería, J.M., Peiron, M., Ayguadé, E., Navarra, J.J.: Increasing the number of strides for conflict-free vector access. In: ISCA 1992, pp. 372–381 (1992). https://doi.org/10.1145/139669.140400

  38. Valero, M., Moretó, M., Casas, M., Ayguade, E., Labarta, J.: Runtime-aware architectures: a first approach. Supercomput. Front. Innov. 1(1) (2014). https://doi.org/10.14529/jsfi140102

  39. Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput. Arch. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/ 501100011033), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328), by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877) and by the RoMoL ERC Advanced Grant (GA 321253). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship numbers RYC-2016-21104 and RYC-2017-23269, respectively.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vladimir Dimić .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dimić, V., Moretó, M., Casas, M., Valero, M. (2021). PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85665-6_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85664-9

  • Online ISBN: 978-3-030-85665-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics