Abstract
The ability to precisely predict how memory contention degrades performance when co-scheduling programs is critical for reaching high performance levels in cluster, grid and cloud environments. In this paper we present an overview and compare the performance of state-of-the-art characterization methods for memory aware (co-)scheduling. We evaluate the prediction accuracy and co-scheduling performance of four methods: one slowdown-based, two cache-contention based and one based on memory bandwidth usage. Both our regression analysis and scheduling simulations find that the slowdown based method, represented by Memgen, performs better than the other methods. The linear correlation coefficient \(R^2\) of Memgen’s prediction is 0.890. Memgen’s preferred schedules reached 99.53 % of the obtainable performance on average. Also, the memory bandwidth usage method performed almost as well as the slowdown based method. Furthermore, while most prior work promote characterization based on cache miss rate we found it to be on par with random scheduling of programs and highly unreliable.
Similar content being viewed by others
References
Akyil L et al (2012) Memory management and programming tools. In: Intel guide for developing multithreaded applications, Intel Corporation, pp 1–133. http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications
Antonopoulos CD, Nikolopoulos DS, Papatheodorou TS (2004) Realistic workload scheduling policies for taming the memory bandwidth bottleneck of smps., International conference on high performance computing, Springer, Berlin
Araiza R, Aguilera MG, Pham T, Teller PJ (2005) Towards a cross-platform microbenchmark suite for evaluating hardware performance counter data. In: Proceedings of the 2005 conference on diversity in computing, ACM, New York, NY, USA, TAPIA ’05, pp 36–39. doi:10.1145/1095242.1095259
Blagodurov S, Zhuravlev S, Fedorova A (2010) Contention-aware scheduling on multicore systems. ACM Trans Comput Syst 28(4):8:1–8:45. doi:10.1145/1880018.1880019
de Blanche A, Lundqvist T (2014) A methodology for estimating co-scheduling slowdowns due to memory bus contention on multicore nodes. In: International conference on parallel and distributed computing and networks
de Blanche A, Mankefors-Christiernin S (2010) Method for experimental measurement of an applications memory bus usage. In: International conference on parallel and distributed processing techniques and applications, CRSEA
Boklund A, Jiresjo C, Mankefors-Christiernin S, Namaki N, Gustavsson-Christiernin L, Ebbmar M (2005) Performance of network subsystems for technical simulation on linux clusters. In: Conference on parallel and distributed computing and systems, pp 503–509
Boklund A, Namaki N, Mankefors-Christiernin S, Gustafsson J, Lingbrand M (2008) Dual core efficiency for engineering simulation applications. In: International conference on parallel and distributed processing techniques and applications, pp 962–968
Browne S, Dongarra J, Garner N, London K, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14:189–204
Cascaval C, Rose LD, Padua DA, Reed DA (2000) Compile-time based performance prediction. In: Proceedings of the 12th international workshop on languages and compilers for parallel computing, Springer, London, LCPC ’99, pp 365–379. http://dl.acm.org/citation.cfm?id=645677.663790
Chandra D, Guo F, Kim S, Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture., International symposium on high-performance computer architectureIEEE Computer Society, Washington, DC, USA
Daci G, Tartari M (2013) A comparative review of contention-aware scheduling algorithms to avoid contention in multicore systems. In: Das VV (ed) Proceedings of the third international conference on trends in information, telecommunication and computing, vol 150, lecture notes in electrical engineering, Springer, New York, pp 99–106
Eklov D, Nikoleris N, Black-Schaffer D, Hagersten E (2011) Cache pirating: measuring the curse of the shared cache. In: Parallel processing (ICPP), 2011 International conference on, pp 165–175. doi:10.1109/ICPP.2011.15
Eklov D, Nikoleris N, Black-Schaffer D, Hagersten E (2012) Bandwidth bandit: quantitative characterization of memory contention. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, ACM, New York, PACT ’12, pp 457–458. doi:10.1145/2370816.2370894
Eranian S (2008) What can performance counters do for memory subsystem analysis? ACM SIGPLAN workshop on Memory systems performance and correctness: in conjunction with the thirteenth international conference on architectural support for programming languages and operating systems. ACM, New York, pp 26–30
Fedorova A, Blagodurov S, Zhuravlev S (2010) Managing contention for shared resources on multicore processors. Commun ACM 53(2):49–57. doi:10.1145/1646353.1646371
Field D, Johnson D, Mize D, Stober R (2007) Scheduling to overcome the multi-core memory bandwidth bottleneck. Hewlett Packard and Platform Computing White Paper
Guo F (2008) Analyzing and managing shared cache in chip multi-processors. PhD thesis, North Carolina State University
Hoste K, Eeckhout L (2007) Microarchitecture-independent workload characterization. IEEE Micro 27(3):63–72. doi:10.1109/MM.2007.56
Iyer R, Zhao L, Guo F, Illikkal R, Makineni S, Newell D, Solihin Y, Hsu L, Reinhardt S (2007) Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS Perform Eval Rev 35(1):25–36. doi:10.1145/1269899.1254886
Jia G, Sheng W, Dai W, Li X (2011) Using fom predicting method for scheduling on chip multi-processor. In: Communication software and networks (ICCSN), 2011 IEEE 3rd international conference on, pp 579–584. doi:10.1109/ICCSN.2011.6013973
Jiang Y, Shen X, Chen J, Tripathi R (2008) Analysis and approximation of optimal co-scheduling on chip multiprocessors. International conference on parallel architectures and compilation techniques. NY, USA, New York, pp 220–229
Koller R, Verma A, Rangaswami R (2011) Estimating application cache requirement for provisioning caches in virtualized systems. In: Modeling, analysis simulation of computer and telecommunication systems (MASCOTS), 2011 IEEE 19th international symposium on, pp 55–62. doi:10.1109/MASCOTS.2011.67
Koukis E, Koziris N (2006) Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of smps. International conference on parallel and distributed systems, vol 1. IEEE Computer Society, Washington, DC, pp 345–354
Levinthal D (2007) Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel White Paper, from internet 2014. http://software.intel.com/sites/products/collateral/hpc/vtune/resolving_multicore_non_scaling.pdf
Levinthal D (2009) Analyzing and resolving multi-core non scaling on intel core 2 processors. Intel White Paper, from internet 2014. https://software.intel.com/sites/products/collateral/hpc/vtun/performance_analysis_guide.pdf
Liu X, Tong W, Zhi X, ZhiRen F, WenZhao L (2014) Performance analysis of cloud computing services considering resources sharing among virtual machines. J Supercomput 69(1):357–374. doi:10.1007/s11227-014-1156-3
Mars J, Vachharajani N, Hundt R, Soffa ML (2010) Contention aware execution: online contention detection and response. In: CGO ’10: proceedings of the 2010 international symposium on code generation and optimization, ACM, New York, pp 257–265. doi:10.1145/1772954.1772991
Mars J, Tang L, Hundt R, Skadron K, Soffa ML (2011) Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: MICRO ’11: proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture, ACM, New York
Mars J, Tang L, Hundt R, Skadron K, Soffa ML (2012) Increasing utilization in warehouse scale computers using bubbleup. IEEE Micro
McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture newsletter pp 19–25
Namaki N, de Blanche A, Mankefors-Christiernin S (2009a) Exhaustion dominated performance: a first attempt. In: Proceedings of the 2009 ACM symposium on applied computing, ACM, New York, SAC ’09, pp 1011–1012. doi:10.1145/1529282.1529504
Namaki N, de Blanche A, Mankefors-Christiernin S (2009b) A tool for processor dependency characterization of hpc applications. In: International Conference HPC Asia 2009
Namaki N, de Blanche A, Mankefors-Christiernin S (2010) Black-box characterization of processor workloads for engineering applications. In: IEEE international symposium on workload characterization, IEEE
Niemi T, Hameri AP (2012) Memory-based scheduling of scientific computing clusters. J Supercomput 61(3):520–544. doi:10.1007/s11227-011-0612-6
Publications NASD (2009) Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html
Singer N (2009) More chip cores can mean slower supercomputing, sandia simulation shows. Sandia National Laboratories News Release
Tam DK, Azimi R, Soares LB, Stumm M (2009) Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems, ACM, New York, ASPLOS XIV, pp 121–132. doi:10.1145/1508244.1508259
Tang L, Mars J, Vachharajani N, Hundt R, Soffa ML (2011) The impact of memory subsystem resource sharing on datacenter applications. In: ISCA ’11: Proceeding of the 38th annual international symposium on computer architecture, ACM, New York, ISCA ’11, pp 283–294. doi:10.1145/2000064.2000099
Utrera G, Corbalan J, Labarta J (2014) Scheduling parallel jobs on multicore clusters using cpu oversubscription. J Supercomput 68(3):1113–1140. doi:10.1007/s11227-014-1142-9
Xu D, Wu C, Yew PC (2010) On mitigating memory bandwidth contention through bandwidth-aware scheduling. International conference on parallel architectures and compilation techniques. New York, USA, pp 237–248
Yang CT, Leu FY, Chen SY (2010) Network bandwidth-aware job scheduling with dynamic information model for grid resource brokers. J Supercomput 52(3):199–223. doi:10.1007/s11227-008-0256-3
Yang LT, Ma X, Mueller F (2005) Cross-platform performance prediction of parallel applications using partial execution. In: Proceedings of the 2005 ACM/IEEE conference on supercomputing, IEEE Computer Society, Washington, DC, USA, SC ’05. doi:10.1109/SC.2005.20
Zhuravlev S, Blagodurov S, Fedorova A (2010) Addressing shared resource contention in multicore processors via scheduling., ASPLOS on Architectural support for programming languages and operating systems.ACM, New York
Zhuravlev S, Saez JC, Blagodurov S, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):4:1–4:28. doi:10.1145/2379776.2379780
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Blanche, A., Lundqvist, T. Addressing characterization methods for memory contention aware co-scheduling. J Supercomput 71, 1451–1483 (2015). https://doi.org/10.1007/s11227-014-1374-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1374-8