Abstract
The ever-increasing gap between the processor and main memory speeds requires careful utilization of the limited memory link. This is additionally emphasized for the case of memory-bound applications. Prioritization of memory requests in the memory controller is one of the approaches to improve performance of such codes. However, current designs do not consider high-level information about parallel applications. In this paper, we propose a holistic approach to this problem, where the runtime system-level knowledge is made available in hardware. Processor exploits this information to better prioritize memory requests, while introducing negligible hardware cost. Our design is based on the notion of critical path in the execution of a parallel code. The critical tasks are accelerated by prioritizing their memory requests within the on-chip memory hierarchy. As a result, we reduce the critical path and improve the overall performance up to 1.19\(\times \) compared to the baseline systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Section 4 describes the experimental setup in detail.
References
Alvarez, L., Vilanova, L., Moreto, M., Casas, M., Gonzàlez, M., et al.: Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. In: ISCA 2015, pp. 720–732 (2015). https://doi.org/10.1145/2749469.2750411
Barcelona Supercomputing Center: Nanos++ Runtime Library (2014). http://pm.bsc.es/nanox
Cai, Q., González, J., Rakvic, R., Magklis, G., Chaparro, P., González, A.: Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In: PACT 2008 pp. 240–249 (2008). https://doi.org/10.1145/1454115.1454149,
Casas, M., Moretó, M., Alvarez, L., Castillo, E., Chasapis, D., Hayes, T., et al.: Runtime-aware architectures. In: Euro-Par 2015, pp. 16–27 (2015). https://doi.org/10.1007/978-3-662-48096-0_2
Castillo, E., Moreto, M., Casas, M., Alvarez, L., Vallejo, E., Chronaki, K., et al.: CATA: criticality aware task acceleration for multicore processors. In: IPDPS 2016, pp. 413–422 (2016). https://doi.org/10.1109/IPDPS.2016.49
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21(3), 291–312 (2007). https://doi.org/10.1177/1094342007078442
Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., Valero, M.: Criticality-aware dynamic task scheduling for heterogeneous architectures. In: ICS 2015, pp. 329–338 (2015). https://doi.org/10.1145/2751205.2751235
Liu, C.-H., Li, C.-F., Lai, K.-C., Wu, C.-C.: A dynamic critical path duplication task scheduling algorithm for distributed heterogeneous computing systems. In: ICPADS 2006, vol. 1, p. 8 (2006). https://doi.org/10.1109/ICPADS.2006.37
Daoud, M., Kharma, N.: Efficient compile-time task scheduling for heterogeneous distributed computing systems. In: ICPADS 2006, vol. 1, pp. 11–22 (2006). https://doi.org/10.1109/ICPADS.2006.40
Dimić, V., Moretó, M., Casas, M., Ciesko, J., Valero, M.: Rich: implementing reductions in the cache hierarchy. In: ICS 2020, p. 13 (2020). https://doi.org/10.1145/3392717.3392736
Dimić, V., Moretó, M., Casas, M., Valero, M.: Runtime-assisted shared cache insertion policies based on re-reference intervals. In: Euro-Par 2017, vol. 10417, pp. 247–259 (2017). https://doi.org/10.1007/978-3-319-64203-1_18
Du Bois, K., Eyerman, S., Sartor, J.B., Eeckhout, L.: Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. In: ISCA 2013, pp. 511–522 (2013). https://doi.org/10.1145/2485922.2485966
Ghose, S., Lee, H., Martínez, J.F.: Improving memory scheduling via processor-side load criticality information. In: ISCA 2013, pp. 84–95 (2013). https://doi.org/10.1145/2485922.2485930
Hakem, M., Butelle, F.: Dynamic critical path scheduling parallel programs onto multiprocessors. In: IPDPS 2005, p. 7 (2005). https://doi.org/10.1109/IPDPS.2005.175
Hashemi, M., Ebrahimi, E.K., Mutlu, O., Patt, Y.N.: Accelerating dependent cache misses with an enhanced memory controller. In: ISCA 2016, pp. 444–455 (2016). https://doi.org/10.1109/ISCA.2016.46
Intel Copropration: Intel® Cilk™ Plus Language Extension Specification (2013)
Intel Copropration: Intel® Thread Bulding Blocks (2020)
Ipek, E., Mutlu, O., Martínez, J.F., Caruana, R.: Self-optimizing memory controllers: a reinforcement learning approach. In: ISCA 2008, pp. 39–50 (2008). https://doi.org/10.1109/ISCA.2008.21
Jaulmes, L., Casas, M., Moretó, M., Ayguadé, E., Labarta, J., Valero, M.: Exploiting asynchrony from exact forward recovery for due in iterative solvers. In: SC 2015 (2015). https://doi.org/10.1145/2807591.2807599
Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS 2012, pp. 223–234 (2012). https://doi.org/10.1145/2150976.2151001
Joao, J.A., Suleman, M.A., Mutlu, O., Patt, Y.N.: Utility-based acceleration of multithreaded applications on asymmetric CMPs. In: ISCA 2013, pp. 154–165 (2013). https://doi.org/10.1145/2485922.2485936
Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: OOPSLA 1993, pp. 91–108 (1993). https://doi.org/10.1145/165854.165874
Kim, Y., Han, D., Mutlu, O., Harchol-Balter, M.: ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: HPCA 2010, pp. 1–12 (2010). https://doi.org/10.1109/HPCA.2010.5416658
Manivannan, M., Papaefstathiou, V., Pericas, M., Stenstrom, P.: RADAR: runtime-assisted dead region management for last-level caches. In: HPCA 2016, pp. 644–656 (2016). https://doi.org/10.1109/HPCA.2016.7446101
Mutlu, O., Moscibroda, T.: Parallelism-aware batch scheduling: enabling high-performance and fair shared memory controllers. IEEE Micro 29(1), 22–32 (2009). https://doi.org/10.1109/MM.2009.12
Mutlu, O., Moscibroda, T.: Stall-time fair memory access scheduling for chip multiprocessors. In: MICRO 2007, pp. 146–160 (2007). https://doi.org/10.1109/MICRO.2007.40
Nesbit, K.J., et al.: Fair queuing memory systems. In: MICRO 2006, pp. 208–222 (2006). https://doi.org/10.1109/MICRO.2006.24
OpenMP Architecture Review Board: OpenMP Technical Report 4 Version 5.0 Preview 1 (2016)
Peiron, M., Valero, M., Ayguadé, E., Lang, T.: Vector multiprocessors with arbitrated memory access. In: ISCA 1995. pp. 243–252 (1995). https://doi.org/10.1145/223982.224435
Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., et al.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012). https://doi.org/10.1145/2086696.2086715
Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., Valero, M.: Trace-driven simulation of multithreaded applications. In: ISPASS 2011, pp. 87–96 (2011). https://doi.org/10.1109/ISPASS.2011.5762718
Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: ISCA 2000, pp. 128–138 (2000). https://doi.org/10.1145/339647.339668
Subramaniam, S., Bracy, A., Wang, H., Loh, G.H.: Criticality-based optimizations for efficient load processing. In: HPCA 2009, pp. 419–430 (2009). https://doi.org/10.1109/HPCA.2009.4798280
Subramanian, L., Lee, D., Seshadri, V., Rastogi, H., Mutlu, O.: The blacklisting memory scheduler: achieving high performance and fairness at low cost. In: ICCD 2014, pp. 8–15 (2014). https://doi.org/10.1109/ICCD.2014.6974655
Suleman, M.A., Mutlu, O., Qureshi, M.K., Patt, Y.N.: Accelerating critical section execution with asymmetric multi-core architectures. In: ASPLOS 2009, pp. 253–264 (2009). https://doi.org/10.1145/1508244.1508274
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206
Valero, M., Lang, T., Llabería, J.M., Peiron, M., Ayguadé, E., Navarra, J.J.: Increasing the number of strides for conflict-free vector access. In: ISCA 1992, pp. 372–381 (1992). https://doi.org/10.1145/139669.140400
Valero, M., Moretó, M., Casas, M., Ayguade, E., Labarta, J.: Runtime-aware architectures: a first approach. Supercomput. Front. Innov. 1(1) (2014). https://doi.org/10.14529/jsfi140102
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput. Arch. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Acknowledgements
This work has been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/ 501100011033), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328), by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877) and by the RoMoL ERC Advanced Grant (GA 321253). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship numbers RYC-2016-21104 and RYC-2017-23269, respectively.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Dimić, V., Moretó, M., Casas, M., Valero, M. (2021). PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-85665-6_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)