Skip to main content

PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling

Abstract

We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application’s lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use probabilistic and computationally efficient reuse profiles to predict the cache hit rates and runtimes of OpenMP programs’ parallel sections. We model Intel’s Broadwell, Haswell, and AMD’s Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. D and RD are used interchangeably

References

  1. 7-CPU: 7-Zip LZMA Benchmark. https://www.7-cpu.com (2021). [Online; accessed 4-Dec-2020]

  2. Aarno D, Engblom J (2014) Software and system development using virtual platforms: full-system simulation with Wind river simics. Morgan Kaufmann, Burlington

    Google Scholar 

  3. Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1997) LogGP: incorporating long messages into the LogP model for parallel computation. J Parallel Distrib Comput 44(1):71–79

    Article  Google Scholar 

  4. Arafa Y, Badawy AA, Chennupati G, Santhi N, Eidenbenz S (2019) Ppt-gpu: scalable gpu performance modeling. IEEE Comput Archit Lett 18(1):55–58

    Article  Google Scholar 

  5. Arafa Y, Badawy AH, Chennupati G, Barai A, Santhi N, Eidenbenz S (2020) Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles. In: Proceedings of the 34th ACM International Conference on Supercomputing, ICS ’20. Association for Computing Machinery, New York, NY, USA

  6. Arafa Y, Chennupati G, Barai A, Badawy AHA, Santhi N, Eidenbenz S, (2019) GPUs cache performance estimation using reuse distance analysis. In: 2019 IEEE 38th International Performance of Computing and Communications Conference (IPCCC), Piscataway, NJ, USA, pp 1–8. IEEE

  7. Austin T, Larson E, Ernst D (2002) SimpleScalar: an infrastructure for computer system modeling. Computer 35(2):59–67

    Article  Google Scholar 

  8. Badamo M, Casarona J, Zhao M, Yeung D (2016) Identifying power-efficient multicore cache hierarchies via reuse distance analysis. ACM Trans Comput Syst 34(1):1–30

    Article  Google Scholar 

  9. Badawy AA, Yeung D (2017) Guiding locality optimizations for graph computations via reuse distance analysis. IEEE Comput Archit Lett 16(2):119–122

    Article  Google Scholar 

  10. Badawy AA, Yeung D (2017) Optimizing locality in graph computations using reuse distance profiles. In: 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), pp 1–8

  11. Barai A, Chennupati G, Santhi N, Badawy AH, Arafa Y, Eidenbenz S (2020) PPT-SASMM: scalable analytical shared memory model: predicting the performance of multicore caches from a single-threaded execution trace. The International symposium on memory systems. MEMSYS. Association for Computing Machinery. NY, USA, New York, pp 341–351

    Chapter  Google Scholar 

  12. Berg E, Hagersten E (2004) StatCache: a probabilistic approach to efficient and accurate data locality analysis. In: 2004 IEEE International Symposium on Performance Analysis of Systems and Software—ISPASS , Piscataway, NJ, USA, pp 20–27. IEEE

  13. Berg E, Zeffer H, Hagersten E (2006) A statistical multiprocessor cache model. In: 2006 IEEE International Symposium on Performance Analysis of Systems and Software, Piscataway, NJ, USA, pp 89–99. IEEE

  14. Beyls K, D’Hollander EH (2001) Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, Piscataway, NJ, USA pp. 617–622. IEEE

  15. Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University

  16. Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7

    Article  Google Scholar 

  17. Brehob M, Enbody R (1999) An analytical model of locality and caching. Technical Report MSU-CSE-99-31

  18. Carlson TE, Heirman W, Eyerman S, Hur I, Eeckhout L (2014) An evaluation of high-level mechanistic core models. ACM Trans Archit Code Optim (TACO) 11(3):1–25

    Article  Google Scholar 

  19. Carothers CD, Meredith JS, Blanco MP, Vetter JS, Mubarak M, LaPre J, Moore S (2017) Durango: scalable synthetic workload generation for extreme-scale application performance modeling and simulation. In: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’17, pp 97–108. Association for Computing Machinery, New York, NY, USA

  20. Cascaval C, Padua DA (2003) Estimating cache misses and locality using stack distances. In: Proceedings of the 17th Annual International Conference on Supercomputing, ICS ’03, pp 150–159. ACM, New York, NY, USA

  21. Ceballos G, Hagersten E, Black-Schaffer D (2016) Formalizing data locality in task parallel applications. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 43–61

    Chapter  Google Scholar 

  22. Chennupati G, Santhi N, Bird R, Thulasidasan S, Badawy AHA, Misra S, Eidenbenz S (2018) A scalable analytical memory model for CPU performance prediction. In: Jarvis S, Wright S, Hammond S (eds) High performance computing systems. Performance modeling, benchmarking, and simulation. Springer, Cham, pp 114–135

    Chapter  Google Scholar 

  23. Chennupati G, Santhi N, Eidenbenz S (2019) Scalable prformance prediction of codes with memory hierarchy and pipelines. In: Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’19, pp 13–24. Association for Computing Machinery, New York, NY, USA

  24. Chennupati G, Santhi N, Eidenbenz S, Thulasidasan S (2017) An analytical memory hierarchy model for performance prediction. In: Proceedings of the 2017 Winter Simulation Conference, WSC ’17. IEEE Press, Piscataway, NJ, USA

  25. Chennupati G, Santhi N, Eidenbenz S, Zerr RJ, Rosa M, Zamora RJ, Park EJ, Nadiga BT, Liu J, Ahmed K, Obaida MA (2017c) Performance prediction toolkit (PPT). Los Alamos National Laboratory (LANL) . https://github.com/lanl/PPT

  26. Collange S, Daumas M, Defour D, Parello D (2010) Barra: a parallel functional simulator for GPGPU. In: 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp 351–360. IEEE

  27. Cope J, Liu N, Lang S, Carns P, Carothers C, Ross R (2011) CODES: enabling co-design of multilayer exascale storage architectures. In: Proceedings of the Workshop on Emerging Supercomputing Technologies

  28. Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. Proceedings of the fourth ACM SIGPLAN symposium on principles and practice of parallel programming 28(7):1–12

    Google Scholar 

  29. Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55

    Article  Google Scholar 

  30. Das S, Aamodt TM, Dally WJ (2015) Reuse distance-based probabilistic cache replacement. ACM Trans Archit Code Optim 12(4):1–22

    Article  Google Scholar 

  31. Davis JD, Laudon J, Olukotun K (2005) Maximizing CMP throughput with mediocre cores. In: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05, pp 51–62. IEEE Computer Society, USA

  32. De Pestel S, Steen SVd, Akram S, Eeckhout L (2018) RPPM: rapid performance prediction of multithreaded applications on multicore hardware. IEEE Comput Archit Lett 17(2):183–186

  33. Ding C, Chilimbi T (2009) A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR-2009-107, Microoft

  34. Ding C, Xiang X, Bao B, Luo H, Luo YW, Wang XL (2014) Performance metrics and models for shared cache. J Comput Sci Technol 29(4):692–712

    Article  MathSciNet  Google Scholar 

  35. Ding C, Zhong Y (2001) Reuse distance analysis. University of Rochester, Rochester, NY, USA, Technical Report

  36. Ding C, Zhong Y (2003) Predicting whole-program locality through reuse distance analysis. Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation ( 38(5):245–257

    Article  Google Scholar 

  37. Dubach C, Jones T, O’Boyle M (2007) Microarchitectural design space exploration using an architecture-centric approach. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 262–271

  38. Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum AV (2012) Improving cache management policies using dynamic reuse distances. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 389–400. IEEE, Piscataway, NJ, USA

  39. Eeckhout L (2010) Computer architecture performance evaluation methods. Synth Lect Comput Archit 5(1):1–145

    Google Scholar 

  40. Fog A (2016) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs

  41. Grass T, Allande C, Armejach A, Rico A, Ayguadé E, Labarta J, Valero M, Casas M, Moreto M (2016) MUSA: a multi-level simulation approach for next-generation HPC machines. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16. IEEE Press

  42. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J, (2012) Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, Piscataway, NJ, USA, pp 1–10

  43. Hardavellas N, Somogyi S, Wenisch TF, Wunderlich RE, Chen S, Kim J, Falsafi B, Hoe JC, Nowatzyk AG (2004) SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of Server Architecture. SIGMETRICS Perform Eval Rev 31(4):31–34

    Article  Google Scholar 

  44. Heywood RIP, Howel F (1995) HASE: a flexible toolset for computer architects. Comput J 38(10):764–775

    Google Scholar 

  45. Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41:33–38

    Article  Google Scholar 

  46. Hughes CJ, Pai VS, Ranganathan P, Adve SV (2002) Rsim: simulating shared-memory multiprocessors with ILP processors. Computer 35(2):40–49

    Article  Google Scholar 

  47. Huh J, Burger D, Keckler SW (2001) Exploring the design space of future CMPs. In: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, PACT ’01, pp 199–210. IEEE Computer Society, USA

  48. \(\ddot{{\rm I}}\)pek E, McKee SA, Caruana R, de Supinski BR, Schulz M (2006) Efficiently exploring architectural design spaces via predictive modeling. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp 195–206. Association for Computing Machinery, New York, NY, USA

  49. Jiang Y, Zhang EZ, Tian K, Shen X (2010) Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10, pp 264–282. Springer

  50. Kaxiras S, Young C (2000) Coherence communication prediction in shared-memory multiprocessors. In: IEEE Proceedings Sixth International Symposium on High-performance Computer Architecture. HPCA-6 (Cat. No. PR00550), pp 156–167 Piscataway, NJ, USA. IEEE

  51. Keramidas G, Petoumenos P, Kaxiras S (2007) Cache replacement based on reuse-distance prediction. In: 2007 25th International Conference on Computer Design. IEEE, NY, USA, pp 245–250

  52. Kise K, Katagiri T, Honda H, Yuba, T (2004) The simCore/alpha functional simulator. In: Proceedings of the 2004 workshop on computer architecture education: held in conjunction with the 31st International symposium on computer architecture, pp. 24–es

  53. Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pp 75–86. IEEE Computer Society, Washington, DC, USA

  54. Lee BC, Collins J, Wang H, Brooks D (2008) CPR: composable performance regression for scalable multiprocessor models. In: 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp 270–281. IEEE

  55. Lee S, Meredith JS, Vetter JS (2015) COMPASS: a framework for automated performance modeling and prediction. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pp 405–414. Association for Computing Machinery, New York, NY, USA

  56. Liao C, Quinlan DJ, Panas T, de Supinski BR (2010) A ROSE-based openMP 3.0 research compiler supporting multiple runtime libraries. In: Proceedings of the 6th International Conference on Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, IWOMP’10, pp 15–28. Springer-Verlag, Berlin, Heidelberg

  57. Maeda RKV, Cai Q, Xu J, Wang Z, Tian Z, (2017) Fast and accurate exploration of multi-level caches using hierarchical reuse distance. In: 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, Piscataway, NJ, USA, pp 145–156

  58. Malakar P, Balaprakash P, Vishwanath V, Morozov V, Kumaran K (2018) Benchmarking machine learning methods for performance modeling of scientific applications. In: 2018 IEEE/ACM Performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp 33–44

  59. Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117

    Article  Google Scholar 

  60. McCurdy C, Fischer C (2005) Using pin as a memory reference generator for multiprocessor simulation. SIGARCH Comput Archit News 33(5):39–44

    Article  Google Scholar 

  61. Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42(6):89–100

    Article  Google Scholar 

  62. Niu Q, Dinan J, Lu Q, Sadayappan P (2012) PARDA: a fast parallel reuse distance analysis algorithm. In: Proceedings of the 2012 IEEE 26th International parallel and distributed processing symposium, IPDPS ’12, pp 1284–1294. IEEE Computer Society, USA

  63. Obaida MA, Liu J, Chennupati G, Santhi N, Eidenbenz S (2018) Parallel application performance prediction using analysis based models and HPC simulations. In: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’18, pp 49–59. Association for Computing Machinery, New York, NY, USA

  64. Pakin S, McCormick P (2013) Hardware-independent application characterization. In: 2013 IEEE International Symposium on Workload Characterization (IISWC), pp 111–112

  65. Patel A, Afram F, Chen S, Ghose K (2011) MARSS: a full system simulator for multicore X86 CPUs. In: Proceedings of the 48th Design Automation Conference, DAC ’11, pp 1050–1055. Association for Computing Machinery, New York, NY, USA

  66. Payer M, Kravina E, Gross TR, (2013) Lightweight memory tracing. In: 2013 USENIX Annual Technical Conference (USENIX ATC 13). USENIX Association, San Jose, CA, pp 115–126

  67. Pouchet LN (2012) Polybench: the polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench

  68. Reddi VJ, Settle A, Connors DA, Cohn RS (2004) PIN: a binary instrumentation tool for computer architecture research and education. In: Proceedings of the 2004 Workshop on computer architecture education: Held in Conjunction with the 31st International Symposium on Computer Architecture, WCAE ’04, pp 22–es. Association for Computing Machinery, New York, NY, USA

  69. Rodrigues AF, Hemmert KS, Barrett BW, Kersey C, Oldfield R, Weston M, Risen R, Cook J, Rosenfeld P, Cooper-Balis E, Jacob B (2011) The structural simulation toolkit. SIGMETRICS Perform Eval Rev 38(4):37–42

    Article  Google Scholar 

  70. Sabarimuthu JM, Venkatesh TG (2019) Analytical derivation of concurrent reuse distance profile for multi-threaded application running on chip multi-processor. IEEE Trans Parallel Distrib Syst 30(8):1704–1721

    Article  Google Scholar 

  71. Sanchez D, Kozyrakis C (2013) ZSim: fast and accurate microarchitectural simulation of thousand-core systems. SIGARCH Comput Archit News 41(3):475–486

    Article  Google Scholar 

  72. Santhi N, Eidenbenz S, Liu J (2015) The Simian concept: parallel discrete event simulation with interpreted languages and just-in-time compilation. In: 2015 Winter Simulation Conference (WSC), pp. 3013–3024

  73. Schuff DL, Kulkarni M, Pai VS (2010) Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, p. 53–64. ACM

  74. Schuff DL, Parsons BS, Pai VS (2010) Multicore-aware reuse distance analysis. In: 2010 IEEE International symposium on parallel and distributed processing. Workshops and Phd Forum (IPDPSW). IEEE, IEEE, Piscataway, NJ, USA, pp 1–8

  75. Sen R, Wood DA (2013) Reuse-based Online Models for Caches. In: Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’13, pp. 279–292. ACM, New York, NY, USA

  76. Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: Palma JMLM, Daydé M, Marques O, Lopes JC (eds) High Perform Comput Comput Sci–VECPAR 2010. Springer, Berlin, pp 1–25

    Google Scholar 

  77. Sharkey J, Ponomarev D, Ghose K (2005) Abstract M-SIM: a flexible, multithreaded architectural simulation environment. Technical report, Department of Computer Science, State University of New York at Binghamton

  78. Shen X, Shaw J, Meeker B, Ding C (2007) Locality Approximation Using Time. In: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, POPL ’07, pp. 55–61. ACM, New York, NY, USA

  79. Shi X, Su F, Peir JK, Xia Y, Yang Z (2009) Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans Parallel Distrib Syst 20(12):1752–1763

    Article  Google Scholar 

  80. Spafford KL, Vetter JS (2012) Aspen: A domain specific language for performance modeling. In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11

  81. Van den Steen S, Eyerman S, De Pestel S, Mechri M, Carlson T, Black-Schaffer D, Hagersten E, Eeckhout L (2016) Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans Comput 65(12):3537–3551

    MathSciNet  MATH  Google Scholar 

  82. den Steen SV, Eeckhout L (2018) Modeling superscalar processor memory-level parallelism. IEEE Comput. Archit. Lett 17(1):9–12

    Article  Google Scholar 

  83. Sun G, Hughes CJ, Kim C, Zhao J, Xu C, Xie Y, Chen YK (2011) Moguls: a model to explore the memory hierarchy for bandwidth improvements. SIGARCH Comput Archit News 39(3):377–388

    Article  Google Scholar 

  84. Terpstra D, Jagode H, You H, Dongarra J (2010) Collecting performance data with PAPI-C. Tools for high performance computing 2009. Springer, Berlin, pp 157–173

    Chapter  Google Scholar 

  85. Thazhuthaveetil M, Vaswani K, Joseph P (2006) Construction and use of linear regression models for processor performance analysis. In: Twelfth international symposium on high-performance computer architecture. IEEE Computer Society, Los Alamitos, CA, USA, pp. 99–108

  86. Unat D, Chan C, Zhang W, Williams S, Bachan J, Bell J, Shalf J (2015) ExaSAT: an exascale co-design tool for performance modeling. Int J High Perform Comput Appl 29(2):209–232

    Article  Google Scholar 

  87. Wu MJ, Yeung D (2012) Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In: Proceedings of the 2012 ACM SIGPLAN workshop on memory systems performance and correctness, MSPC ’12, pp. 2–11. Association for Computing Machinery, New York, NY, USA

  88. Wu MJ, Yeung D (2013) Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Trans Comput Syst 31(1):1–37

    Article  Google Scholar 

  89. Wu MJ, Zhao M, Yeung D (2013) Studying multicore processor scaling via reuse distance analysis. In: Proceedings of the 40th Annual international symposium on computer architecture, ISCA ’13, p. 499–510. Association for Computing Machinery, New York, NY, USA

  90. Wu X, Taylor V (2013) Performance modeling of hybrid MPI/openMP scientific applications on large-scale multicore Supercomputers. J Comput Syst Sci 79(8):1256–1268

    Article  MathSciNet  Google Scholar 

  91. Zhao M, Yeung D (2017) Using multicore reuse distance to study coherence directories. ACM Trans Comput Syst (TOCS) 35(2):1–49

    Article  Google Scholar 

  92. Zhong Y, Dropsho SG, Shen X, Studer A, Ding C (2007) Miss rate prediction across program inputs and cache configurations. IEEE Trans Comput 56(3):328–343

    Article  MathSciNet  Google Scholar 

  93. Zhong Y, Shen X, Ding C (2009) Program locality analysis using reuse distance. ACM Trans Program Lang Syst 31(6):1–39

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers for their valuable review, suggestions, and questions. It helped improve the clarity and readability of the manuscript. We also thank Dr. David Newsom for donating several machines to the PEARL laboratory at NMSU. Some of the experiments in this paper were run on the donated machines.

Funding

This work is partially supported by Triad National Security, LLC subcontract #581326.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atanu Barai.

Ethics declarations

Consent for publication

This paper has been approved for unlimited public distribution under LA-UR-21-22749. Any opinions, findings, and/or conclusions expressed in this paper do not necessarily represent the views of the DOE or the US Government.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barai, A., Arafa, Y., Badawy, AH. et al. PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling. J Supercomput 78, 2354–2385 (2022). https://doi.org/10.1007/s11227-021-03949-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03949-4

Keywords

  • Performance Modeling
  • Parallel Application
  • Shared Cache
  • Reuse Distance Analysis
  • LLVM Basic Block
  • Multicore Processor