PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy

Dimić, Vladimir; Moretó, Miquel; Casas, Marc; Valero, Mateo

doi:10.1007/978-3-030-85665-6_37

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12820))

Included in the following conference series:

European Conference on Parallel Processing

1753 Accesses

Abstract

The ever-increasing gap between the processor and main memory speeds requires careful utilization of the limited memory link. This is additionally emphasized for the case of memory-bound applications. Prioritization of memory requests in the memory controller is one of the approaches to improve performance of such codes. However, current designs do not consider high-level information about parallel applications. In this paper, we propose a holistic approach to this problem, where the runtime system-level knowledge is made available in hardware. Processor exploits this information to better prioritize memory requests, while introducing negligible hardware cost. Our design is based on the notion of critical path in the execution of a parallel code. The critical tasks are accelerated by prioritizing their memory requests within the on-chip memory hierarchy. As a result, we reduce the critical path and improve the overall performance up to 1.19\(\times \) compared to the baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Section 4 describes the experimental setup in detail.

References

Alvarez, L., Vilanova, L., Moreto, M., Casas, M., Gonzàlez, M., et al.: Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. In: ISCA 2015, pp. 720–732 (2015). https://doi.org/10.1145/2749469.2750411
Barcelona Supercomputing Center: Nanos++ Runtime Library (2014). http://pm.bsc.es/nanox
Cai, Q., González, J., Rakvic, R., Magklis, G., Chaparro, P., González, A.: Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In: PACT 2008 pp. 240–249 (2008). https://doi.org/10.1145/1454115.1454149,
Casas, M., Moretó, M., Alvarez, L., Castillo, E., Chasapis, D., Hayes, T., et al.: Runtime-aware architectures. In: Euro-Par 2015, pp. 16–27 (2015). https://doi.org/10.1007/978-3-662-48096-0_2
Castillo, E., Moreto, M., Casas, M., Alvarez, L., Vallejo, E., Chronaki, K., et al.: CATA: criticality aware task acceleration for multicore processors. In: IPDPS 2016, pp. 413–422 (2016). https://doi.org/10.1109/IPDPS.2016.49
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21(3), 291–312 (2007). https://doi.org/10.1177/1094342007078442
Article Google Scholar
Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., Valero, M.: Criticality-aware dynamic task scheduling for heterogeneous architectures. In: ICS 2015, pp. 329–338 (2015). https://doi.org/10.1145/2751205.2751235
Liu, C.-H., Li, C.-F., Lai, K.-C., Wu, C.-C.: A dynamic critical path duplication task scheduling algorithm for distributed heterogeneous computing systems. In: ICPADS 2006, vol. 1, p. 8 (2006). https://doi.org/10.1109/ICPADS.2006.37
Daoud, M., Kharma, N.: Efficient compile-time task scheduling for heterogeneous distributed computing systems. In: ICPADS 2006, vol. 1, pp. 11–22 (2006). https://doi.org/10.1109/ICPADS.2006.40
Dimić, V., Moretó, M., Casas, M., Ciesko, J., Valero, M.: Rich: implementing reductions in the cache hierarchy. In: ICS 2020, p. 13 (2020). https://doi.org/10.1145/3392717.3392736
Dimić, V., Moretó, M., Casas, M., Valero, M.: Runtime-assisted shared cache insertion policies based on re-reference intervals. In: Euro-Par 2017, vol. 10417, pp. 247–259 (2017). https://doi.org/10.1007/978-3-319-64203-1_18
Du Bois, K., Eyerman, S., Sartor, J.B., Eeckhout, L.: Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. In: ISCA 2013, pp. 511–522 (2013). https://doi.org/10.1145/2485922.2485966
Ghose, S., Lee, H., Martínez, J.F.: Improving memory scheduling via processor-side load criticality information. In: ISCA 2013, pp. 84–95 (2013). https://doi.org/10.1145/2485922.2485930
Hakem, M., Butelle, F.: Dynamic critical path scheduling parallel programs onto multiprocessors. In: IPDPS 2005, p. 7 (2005). https://doi.org/10.1109/IPDPS.2005.175
Hashemi, M., Ebrahimi, E.K., Mutlu, O., Patt, Y.N.: Accelerating dependent cache misses with an enhanced memory controller. In: ISCA 2016, pp. 444–455 (2016). https://doi.org/10.1109/ISCA.2016.46
Intel Copropration: Intel® Cilk™ Plus Language Extension Specification (2013)
Google Scholar
Intel Copropration: Intel® Thread Bulding Blocks (2020)
Google Scholar
Ipek, E., Mutlu, O., Martínez, J.F., Caruana, R.: Self-optimizing memory controllers: a reinforcement learning approach. In: ISCA 2008, pp. 39–50 (2008). https://doi.org/10.1109/ISCA.2008.21
Jaulmes, L., Casas, M., Moretó, M., Ayguadé, E., Labarta, J., Valero, M.: Exploiting asynchrony from exact forward recovery for due in iterative solvers. In: SC 2015 (2015). https://doi.org/10.1145/2807591.2807599
Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS 2012, pp. 223–234 (2012). https://doi.org/10.1145/2150976.2151001
Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Utility-based acceleration of multithreaded applications on asymmetric CMPs. In: ISCA 2013, pp. 154–165 (2013). https://doi.org/10.1145/2485922.2485936
Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: OOPSLA 1993, pp. 91–108 (1993). https://doi.org/10.1145/165854.165874
Kim, Y., Han, D., Mutlu, O., Harchol-Balter, M.: ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: HPCA 2010, pp. 1–12 (2010). https://doi.org/10.1109/HPCA.2010.5416658
Manivannan, M., Papaefstathiou, V., Pericas, M., Stenstrom, P.: RADAR: runtime-assisted dead region management for last-level caches. In: HPCA 2016, pp. 644–656 (2016). https://doi.org/10.1109/HPCA.2016.7446101
Mutlu, O., Moscibroda, T.: Parallelism-aware batch scheduling: enabling high-performance and fair shared memory controllers. IEEE Micro 29(1), 22–32 (2009). https://doi.org/10.1109/MM.2009.12
Article Google Scholar
Mutlu, O., Moscibroda, T.: Stall-time fair memory access scheduling for chip multiprocessors. In: MICRO 2007, pp. 146–160 (2007). https://doi.org/10.1109/MICRO.2007.40
Nesbit, K.J., et al.: Fair queuing memory systems. In: MICRO 2006, pp. 208–222 (2006). https://doi.org/10.1109/MICRO.2006.24
OpenMP Architecture Review Board: OpenMP Technical Report 4 Version 5.0 Preview 1 (2016)
Google Scholar
Peiron, M., Valero, M., Ayguadé, E., Lang, T.: Vector multiprocessors with arbitrated memory access. In: ISCA 1995. pp. 243–252 (1995). https://doi.org/10.1145/223982.224435
Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., et al.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012). https://doi.org/10.1145/2086696.2086715
Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., Valero, M.: Trace-driven simulation of multithreaded applications. In: ISPASS 2011, pp. 87–96 (2011). https://doi.org/10.1109/ISPASS.2011.5762718
Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: ISCA 2000, pp. 128–138 (2000). https://doi.org/10.1145/339647.339668
Subramaniam, S., Bracy, A., Wang, H., Loh, G.H.: Criticality-based optimizations for efficient load processing. In: HPCA 2009, pp. 419–430 (2009). https://doi.org/10.1109/HPCA.2009.4798280
Subramanian, L., Lee, D., Seshadri, V., Rastogi, H., Mutlu, O.: The blacklisting memory scheduler: achieving high performance and fairness at low cost. In: ICCD 2014, pp. 8–15 (2014). https://doi.org/10.1109/ICCD.2014.6974655
Suleman, M.A., Mutlu, O., Qureshi, M.K., Patt, Y.N.: Accelerating critical section execution with asymmetric multi-core architectures. In: ASPLOS 2009, pp. 253–264 (2009). https://doi.org/10.1145/1508244.1508274
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206
Article Google Scholar
Valero, M., Lang, T., Llabería, J.M., Peiron, M., Ayguadé, E., Navarra, J.J.: Increasing the number of strides for conflict-free vector access. In: ISCA 1992, pp. 372–381 (1992). https://doi.org/10.1145/139669.140400
Valero, M., Moretó, M., Casas, M., Ayguade, E., Labarta, J.: Runtime-aware architectures: a first approach. Supercomput. Front. Innov. 1(1) (2014). https://doi.org/10.14529/jsfi140102
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput. Arch. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Article Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/ 501100011033), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328), by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877) and by the RoMoL ERC Advanced Grant (GA 321253). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship numbers RYC-2016-21104 and RYC-2017-23269, respectively.

Author information

Authors and Affiliations

Barcelona Supercomputing Center (BSC), Barcelona, Spain
Vladimir Dimić, Miquel Moretó, Marc Casas & Mateo Valero
Universitat Politècnica de Catalunya (UPC), Barcelona, Spain
Vladimir Dimić, Miquel Moretó, Marc Casas & Mateo Valero

Authors

Vladimir Dimić
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Moretó
View author publications
You can also search for this author in PubMed Google Scholar
Marc Casas
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vladimir Dimić .

Editor information

Editors and Affiliations

Universidade de Lisboa, Lisbon, Portugal
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
Pedro Tomás

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dimić, V., Moretó, M., Casas, M., Valero, M. (2021). PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-85665-6_37
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics