Advertisement

The Journal of Supercomputing

, Volume 71, Issue 4, pp 1451–1483 | Cite as

Addressing characterization methods for memory contention aware co-scheduling

  • Andreas de Blanche
  • Thomas Lundqvist
Article

Abstract

The ability to precisely predict how memory contention degrades performance when co-scheduling programs is critical for reaching high performance levels in cluster, grid and cloud environments. In this paper we present an overview and compare the performance of state-of-the-art characterization methods for memory aware (co-)scheduling. We evaluate the prediction accuracy and co-scheduling performance of four methods: one slowdown-based, two cache-contention based and one based on memory bandwidth usage. Both our regression analysis and scheduling simulations find that the slowdown based method, represented by Memgen, performs better than the other methods. The linear correlation coefficient \(R^2\) of Memgen’s prediction is 0.890. Memgen’s preferred schedules reached 99.53 % of the obtainable performance on average. Also, the memory bandwidth usage method performed almost as well as the slowdown based method. Furthermore, while most prior work promote characterization based on cache miss rate we found it to be on par with random scheduling of programs and highly unreliable.

Keywords

Memory contention Memory subsystem Performance measurements Co-scheduling Slowdown based scheduling 

References

  1. 1.
    Akyil L et al (2012) Memory management and programming tools. In: Intel guide for developing multithreaded applications, Intel Corporation, pp 1–133. http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications
  2. 2.
    Antonopoulos CD, Nikolopoulos DS, Papatheodorou TS (2004) Realistic workload scheduling policies for taming the memory bandwidth bottleneck of smps., International conference on high performance computing, Springer, BerlinGoogle Scholar
  3. 3.
    Araiza R, Aguilera MG, Pham T, Teller PJ (2005) Towards a cross-platform microbenchmark suite for evaluating hardware performance counter data. In: Proceedings of the 2005 conference on diversity in computing, ACM, New York, NY, USA, TAPIA ’05, pp 36–39. doi: 10.1145/1095242.1095259
  4. 4.
    Blagodurov S, Zhuravlev S, Fedorova A (2010) Contention-aware scheduling on multicore systems. ACM Trans Comput Syst 28(4):8:1–8:45. doi: 10.1145/1880018.1880019 CrossRefGoogle Scholar
  5. 5.
    de Blanche A, Lundqvist T (2014) A methodology for estimating co-scheduling slowdowns due to memory bus contention on multicore nodes. In: International conference on parallel and distributed computing and networksGoogle Scholar
  6. 6.
    de Blanche A, Mankefors-Christiernin S (2010) Method for experimental measurement of an applications memory bus usage. In: International conference on parallel and distributed processing techniques and applications, CRSEAGoogle Scholar
  7. 7.
    Boklund A, Jiresjo C, Mankefors-Christiernin S, Namaki N, Gustavsson-Christiernin L, Ebbmar M (2005) Performance of network subsystems for technical simulation on linux clusters. In: Conference on parallel and distributed computing and systems, pp 503–509Google Scholar
  8. 8.
    Boklund A, Namaki N, Mankefors-Christiernin S, Gustafsson J, Lingbrand M (2008) Dual core efficiency for engineering simulation applications. In: International conference on parallel and distributed processing techniques and applications, pp 962–968Google Scholar
  9. 9.
    Browne S, Dongarra J, Garner N, London K, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14:189–204CrossRefGoogle Scholar
  10. 10.
    Cascaval C, Rose LD, Padua DA, Reed DA (2000) Compile-time based performance prediction. In: Proceedings of the 12th international workshop on languages and compilers for parallel computing, Springer, London, LCPC ’99, pp 365–379. http://dl.acm.org/citation.cfm?id=645677.663790
  11. 11.
    Chandra D, Guo F, Kim S, Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture., International symposium on high-performance computer architectureIEEE Computer Society, Washington, DC, USACrossRefGoogle Scholar
  12. 12.
    Daci G, Tartari M (2013) A comparative review of contention-aware scheduling algorithms to avoid contention in multicore systems. In: Das VV (ed) Proceedings of the third international conference on trends in information, telecommunication and computing, vol 150, lecture notes in electrical engineering, Springer, New York, pp 99–106Google Scholar
  13. 13.
    Eklov D, Nikoleris N, Black-Schaffer D, Hagersten E (2011) Cache pirating: measuring the curse of the shared cache. In: Parallel processing (ICPP), 2011 International conference on, pp 165–175. doi: 10.1109/ICPP.2011.15
  14. 14.
    Eklov D, Nikoleris N, Black-Schaffer D, Hagersten E (2012) Bandwidth bandit: quantitative characterization of memory contention. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, ACM, New York, PACT ’12, pp 457–458. doi: 10.1145/2370816.2370894
  15. 15.
    Eranian S (2008) What can performance counters do for memory subsystem analysis? ACM SIGPLAN workshop on Memory systems performance and correctness: in conjunction with the thirteenth international conference on architectural support for programming languages and operating systems. ACM, New York, pp 26–30Google Scholar
  16. 16.
    Fedorova A, Blagodurov S, Zhuravlev S (2010) Managing contention for shared resources on multicore processors. Commun ACM 53(2):49–57. doi: 10.1145/1646353.1646371 CrossRefGoogle Scholar
  17. 17.
    Field D, Johnson D, Mize D, Stober R (2007) Scheduling to overcome the multi-core memory bandwidth bottleneck. Hewlett Packard and Platform Computing White PaperGoogle Scholar
  18. 18.
    Guo F (2008) Analyzing and managing shared cache in chip multi-processors. PhD thesis, North Carolina State UniversityGoogle Scholar
  19. 19.
    Hoste K, Eeckhout L (2007) Microarchitecture-independent workload characterization. IEEE Micro 27(3):63–72. doi: 10.1109/MM.2007.56 CrossRefGoogle Scholar
  20. 20.
    Iyer R, Zhao L, Guo F, Illikkal R, Makineni S, Newell D, Solihin Y, Hsu L, Reinhardt S (2007) Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS Perform Eval Rev 35(1):25–36. doi: 10.1145/1269899.1254886 CrossRefGoogle Scholar
  21. 21.
    Jia G, Sheng W, Dai W, Li X (2011) Using fom predicting method for scheduling on chip multi-processor. In: Communication software and networks (ICCSN), 2011 IEEE 3rd international conference on, pp 579–584. doi: 10.1109/ICCSN.2011.6013973
  22. 22.
    Jiang Y, Shen X, Chen J, Tripathi R (2008) Analysis and approximation of optimal co-scheduling on chip multiprocessors. International conference on parallel architectures and compilation techniques. NY, USA, New York, pp 220–229Google Scholar
  23. 23.
    Koller R, Verma A, Rangaswami R (2011) Estimating application cache requirement for provisioning caches in virtualized systems. In: Modeling, analysis simulation of computer and telecommunication systems (MASCOTS), 2011 IEEE 19th international symposium on, pp 55–62. doi: 10.1109/MASCOTS.2011.67
  24. 24.
    Koukis E, Koziris N (2006) Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of smps. International conference on parallel and distributed systems, vol 1. IEEE Computer Society, Washington, DC, pp 345–354Google Scholar
  25. 25.
    Levinthal D (2007) Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel White Paper, from internet 2014. http://software.intel.com/sites/products/collateral/hpc/vtune/resolving_multicore_non_scaling.pdf
  26. 26.
    Levinthal D (2009) Analyzing and resolving multi-core non scaling on intel core 2 processors. Intel White Paper, from internet 2014. https://software.intel.com/sites/products/collateral/hpc/vtun/performance_analysis_guide.pdf
  27. 27.
    Liu X, Tong W, Zhi X, ZhiRen F, WenZhao L (2014) Performance analysis of cloud computing services considering resources sharing among virtual machines. J Supercomput 69(1):357–374. doi: 10.1007/s11227-014-1156-3 CrossRefGoogle Scholar
  28. 28.
    Mars J, Vachharajani N, Hundt R, Soffa ML (2010) Contention aware execution: online contention detection and response. In: CGO ’10: proceedings of the 2010 international symposium on code generation and optimization, ACM, New York, pp 257–265. doi: 10.1145/1772954.1772991
  29. 29.
    Mars J, Tang L, Hundt R, Skadron K, Soffa ML (2011) Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: MICRO ’11: proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture, ACM, New YorkGoogle Scholar
  30. 30.
    Mars J, Tang L, Hundt R, Skadron K, Soffa ML (2012) Increasing utilization in warehouse scale computers using bubbleup. IEEE MicroGoogle Scholar
  31. 31.
    McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture newsletter pp 19–25Google Scholar
  32. 32.
    Namaki N, de Blanche A, Mankefors-Christiernin S (2009a) Exhaustion dominated performance: a first attempt. In: Proceedings of the 2009 ACM symposium on applied computing, ACM, New York, SAC ’09, pp 1011–1012. doi: 10.1145/1529282.1529504
  33. 33.
    Namaki N, de Blanche A, Mankefors-Christiernin S (2009b) A tool for processor dependency characterization of hpc applications. In: International Conference HPC Asia 2009Google Scholar
  34. 34.
    Namaki N, de Blanche A, Mankefors-Christiernin S (2010) Black-box characterization of processor workloads for engineering applications. In: IEEE international symposium on workload characterization, IEEEGoogle Scholar
  35. 35.
    Niemi T, Hameri AP (2012) Memory-based scheduling of scientific computing clusters. J Supercomput 61(3):520–544. doi: 10.1007/s11227-011-0612-6 CrossRefGoogle Scholar
  36. 36.
    Publications NASD (2009) Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html
  37. 37.
    Singer N (2009) More chip cores can mean slower supercomputing, sandia simulation shows. Sandia National Laboratories News ReleaseGoogle Scholar
  38. 38.
    Tam DK, Azimi R, Soares LB, Stumm M (2009) Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems, ACM, New York, ASPLOS XIV, pp 121–132. doi: 10.1145/1508244.1508259
  39. 39.
    Tang L, Mars J, Vachharajani N, Hundt R, Soffa ML (2011) The impact of memory subsystem resource sharing on datacenter applications. In: ISCA ’11: Proceeding of the 38th annual international symposium on computer architecture, ACM, New York, ISCA ’11, pp 283–294. doi: 10.1145/2000064.2000099
  40. 40.
    Utrera G, Corbalan J, Labarta J (2014) Scheduling parallel jobs on multicore clusters using cpu oversubscription. J Supercomput 68(3):1113–1140. doi: 10.1007/s11227-014-1142-9 CrossRefGoogle Scholar
  41. 41.
    Xu D, Wu C, Yew PC (2010) On mitigating memory bandwidth contention through bandwidth-aware scheduling. International conference on parallel architectures and compilation techniques. New York, USA, pp 237–248Google Scholar
  42. 42.
    Yang CT, Leu FY, Chen SY (2010) Network bandwidth-aware job scheduling with dynamic information model for grid resource brokers. J Supercomput 52(3):199–223. doi: 10.1007/s11227-008-0256-3 CrossRefGoogle Scholar
  43. 43.
    Yang LT, Ma X, Mueller F (2005) Cross-platform performance prediction of parallel applications using partial execution. In: Proceedings of the 2005 ACM/IEEE conference on supercomputing, IEEE Computer Society, Washington, DC, USA, SC ’05. doi: 10.1109/SC.2005.20
  44. 44.
    Zhuravlev S, Blagodurov S, Fedorova A (2010) Addressing shared resource contention in multicore processors via scheduling., ASPLOS on Architectural support for programming languages and operating systems.ACM, New YorkCrossRefGoogle Scholar
  45. 45.
    Zhuravlev S, Saez JC, Blagodurov S, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):4:1–4:28. doi: 10.1145/2379776.2379780 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Engineering ScienceUniversity WestTrollhättanSweden

Personalised recommendations