Abstract
On today’s multiprocessor systems, simultaneously executing multi-threaded applications contend for cache space and CPU time. This contention can be managed by changing application thread count. In this paper, we describe a technique to configure thread count using utility models. A utility model predicts application performance given its thread count and other workload thread counts. Built offline with linear regression, utility models are used online by a system policy to dynamically configure applications’ thread counts. We present a policy which uses the models to maximize throughput while maintaining QoS. Our approach improves system throughput by 6 % and meets QoS 22 % more often than the best evaluated traditional policy.
Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
Notes
For brevity of nomenclature, the term “CPU-bound threads” refers to both memory-bound threads and CPU-bound threads. We recognize the distinction between the two terms.
This approach is also amenable to parallelization.
i.e., Multiple settings of appStep, otherStep can result in the same number of profile points being selected.
We limit models to a single area of linear growth, followed by an optional area of no-growth.
A performance plateau at \(\infty \) indicates expected continued scaling.
The constant large number should be greater than any possible expected system throughput.
An application QoS goal of \(Q\) means that the application should execute at least \(Q\) times as fast as its single-threaded performance. \(Q\) should be chosen by the user after considering minimum performance requirements and application scalability.
References
Moore RW, Childers BR (2012) Using utility prediction models to dynamically choose program thread counts. In: 2012 IEEE international symposium on performance analysis of systems and software (ISPASS). doi:10.1109/ISPASS.2012.6189220
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: ASPLOS VII: Proceedings of the seventh international conference on architectural support for programming languages and operating systems. ACM, New York, NY, USA, pp 2–11
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, PACT ’08. ACM, New York. doi:10.1145/1454115.1454128
Moore RW, Childers BR (2011) Inflation and deflation of self-adaptive applications. In: Proceedings of the 6th international symposium on software engineering for adaptive and self-managing systems, SEAMS ’11. ACM, New York. doi:10.1145/1988008.1988041
Yu C, Petrov P (2010) Adaptive multi-threading for dynamic workloads in embedded multiprocessors. In: Proceedings of the 23rd symposium on integrated circuits and system design, SBCCI ’10. ACM, New York. doi:10.1145/1854153.1854173
Raman A, Zaks A, Lee JW, August DI (2012) Parcae: a system for flexible parallel execution. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation, PLDI ’12. ACM, New York. doi:10.1145/2254064.2254082
Lee J, Wu H, Ravichandran M, Clark N (2010) Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA ’10. ACM, New York. doi:10.1145/1815961.1815996
Bienia C, Li K (2010) Fidelity and scaling of the parsec benchmark inputs. In: 2010 IEEE international symposium on workload characterization (IISWC). doi:10.1109/IISWC.2010.5649519
Tian K, Jiang Y, Zhang EZ, Shen X (2010) An input-centric paradigm for program dynamic optimizations. In: Proceedings of the ACM international conference on object oriented programming systems languages and applications, OOPSLA ’10. ACM, New York. doi:10.1145/1869459.1869471
LuxRender Team (2012) Luxrender v0.8. http://www.luxrender.net
Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the amd opteron processor, Micro, IEEE, 30 (2). doi:10.1109/MM.2010.31
Ahmad SB (2011) On improved processor allocation in 2D mesh-based multicomputers: controlled splitting of parallel requests. In: Proceedings of the 2011 international conference on communication computing and security, ICCCS ’11. ACM, New York. doi:10.1145/1947940.1947984
Leung LF, Tsui CY, Ki WH (2004) Minimizing energy consumption of multiple-processors-core systems with simultaneous task allocation, scheduling and voltage assignment. In: Proceedings of the 2004 Asia and South Pacific design automation conference, ASP-DAC ’04, IEEE Press, Piscataway. http://portal.acm.org/citation.cfm?id=1015090.1015267
Kandemir M, Muralidhara SP, Narayanan SHK, Zhang Y, Ozturk O (2009) Optimizing shared cache behavior of chip multiprocessors. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO 42. ACM, New York. doi:10.1145/1669112.1669176
Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to nonuniform cluster computing. In: OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM, New York, NY, USA, pp 519–538
Leiserson CE (2009) The cilk++ concurrency platform. In: Proceedings of the 46th annual design automation conference, DAC ’09. ACM, New York. doi:10.1145/1629911.1630048
Architecture Review Board, Openmp application program interface v3.0. http://www.openmp.org/mp-documents/spec30.pdf
Message Passing Interface Forum, Mpi: a message-passing interface standard version 2.2. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
Calder B, Grunwald D, Jones M, Lindsay D, Martin J, Mozer M, Zorn B. Evidence-based static branch prediction using machine learning, ACM Trans. Program. Lang. Syst. 19 (1). doi:10.1145/239912.239923
Chen G, Kandemir M (2005) Optimizing embedded applications using programmer-inserted hints. In: Proceedings of the 2005 Asia and South Pacific design automation conference, ASP-DAC ’05. ACM, New York. doi:10.1145/1120725.1120794
Suganuma T, Yasue T, Kawahito M, Komatsu H, Nakatani T. Design and evaluation of dynamic optimizations for a java just-in-time compiler, ACM Trans. Program. Lang. Syst. 27 (4). doi:10.1145/1075382.1075386
Maury MC, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Proceedings of the 20th annual international conference on supercomputing, ICS ’06. ACM, New York. doi:10.1145/1183401.1183426
Itzkowitz M, Maruyama Y (2010) HPC profiling with the sun studio performance tools. In: Muller MS, Resch MM, Schulz A, Nagel WE (eds) Tools for high performance computing 2009. Springer, Berlin, p 6. doi:10.1007/978-3-642-11261-4_6
Pusukuri KK, Gupta R, Bhuyan LN. Thread tranquilizer: dynamically reducing performance variation, ACM Trans. Archit. Code Optim. 8 (4). doi:10.1145/2086696.2086725
Becchi M, Crowley P (2006) Dynamic thread assignment on heterogeneous multiprocessor architectures. In: Proceedings of the 3rd conference on computing frontiers, CF ’06. ACM, New York. doi:10.1145/1128022.1128029
Wang Z, O’Boyle MF (2009) Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’09. ACM, New York. doi:10.1145/1504176.1504189
Martinez JF, Ipek E, Dynamic multicore resource management: a machine learning approach, IEEE Micro 29 (5). doi:10.1109/MM.2009.77
Barnes BJ, Rountree B, Lowenthal DK, Reeves J, de Supinski B, Schulz M (2008) A regression-based approach to scalability prediction. In: Proceedings of the 22nd annual international conference on supercomputing, ICS ’08. ACM, New York. doi:10.1145/1375527.1375580
Ipek E, de Supinski B, Schulz M, McKee S (2005) An approach to performance prediction for parallel applications Euro-Par 2005 parallel processing. In: Euro-Par 2005 parallel processing, vol. 3648 of Lecture notes in computer science, Springer, Berlin. doi:10.1007/11549468_24
Lee BC, Brooks DM, de Supinski BR, Schulz M, Singh K, McKee SA (2007) Methods of inference and learning for performance modeling of parallel applications. In: Proceedings of the 12th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’07. ACM, New York. doi:10.1145/1229428.1229479
Ipek E, McKee SA, Caruana R, de Supinski BR, Schulz M (2006) Efficiently exploring architectural design spaces via predictive modeling. In: Proceedings of the 12th international conference on architectural support for programming languages and operating systems, ASPLOS-XII. ACM, New York. doi:10.1145/1168857.1168882
Duan R, Nadeem F, Wang J, Zhang Y, Prodan R, Fahringer T (2009) A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid, CCGRID ’09, IEEE Computer Society, Washington. doi:10.1109/CCGRID.2009.58
Zhai J, Chen W, Zheng W (2010) Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’10. ACM, New York. doi:10.1145/1693453.1693493
Suleman MA, Qureshi MK, Patt YN (2008) Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on cmps. In: Proceedings of the 13th international conference on architectural support for programming languages and operating systems, ASPLOS XIII. ACM, New York. doi:10.1145/1346281.1346317
Moseley T, Grunwald D, Kihm JL, Connors DA. Methods for modeling resource contention on simultaneous multithreading processors, In: International conference on computer design. doi:10.1109/ICCD.2005.74
Vengerov D (2005) Adaptive utility-based scheduling in resource-constrained systems. In: Zhang S, Jarvis R (eds) AI 2005: advances in artificial intelligence, vol. 3809 of Lecture notes in computer science. Springer, Berlin
Pusukuri K, Gupta R, Bhuyan L (2011) Thread reinforcer: dynamically determining number of threads via os level monitoring. In: 2011 IEEE international symposium on workload characterization (IISWC). doi:10.1109/IISWC.2011.6114208
Grewe D, Wang Z, O’Boyle MFP (2011) A workload-aware mapping approach for data-parallel programs. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC ’11. ACM, New York. doi:10.1145/1944862.1944881
De P, Kothari R, Mann V (2007) Identifying sources of operating system jitter through fine-grained kernel instrumentation. In: 2007 IEEE international conference on cluster computing. doi:10.1109/CLUSTR.2007.4629247
De P, Mann V, Mittaly U (2009) Handling os jitter on multicore multithreaded systems. In: IEEE international symposium on parallel distributed processing, 2009. IPDPS 2009. doi:10.1109/IPDPS.2009.5161046
Nataraj A, Morris A, Malony AD, Sottile M, Beckman P (2007) The ghost in the machine: observing the effects of kernel operation on parallel application performance. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07. ACM, New York. doi:10.1145/1362622.1362662
Shen K (2010) Request behavior variations. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems. ASPLOS ’10. ACM, New York. doi:10.1145/1736020.1736034
Constantinou T, Sazeides Y, Michaud P, Fetis D, Seznec A. Performance implications of single thread migration on a chip multi-core, SIGARCH Comput. Archit. News. 33 (4). doi:10.1145/1105734.1105745
Teng Q, Sweeney P, Duesterwald E (2009) Understanding the cost of thread migration for multi-threaded java applications running on a multicore platform. In: IEEE international symposium on performance analysis of systems and software, ISPASS 2009. doi:10.1109/ISPASS.2009.4919644
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is an extension of [1] and includes more than 30 % new content due to: (a) new sampling policies, including a policy (stepwise sampling) that subsumes the previously published sampling policy, (b) model comparisons using several time budgets, (c) more flexible models through the use of real-number performance plateau settings, (d) the use of a new benchmark, luxrender, (e) more diverse quality of service settings, and (f) experiments involving all combinations of up to four applications executing concurrently, instead of all combinations of up to two applications. This research was supported in part by the National Science Foundation through awards CNS-1012070, CCF-0811295, and CCF-0811352.
Rights and permissions
About this article
Cite this article
Moore, R.W., Childers, B.R. Building and using application utility models to dynamically choose thread counts. J Supercomput 68, 1184–1213 (2014). https://doi.org/10.1007/s11227-014-1148-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1148-3