Steal Locally, Share Globally

Tousimojarad, Ashkan; Vanderbauwhede, Wim

doi:10.1007/s10766-015-0350-0

Steal Locally, Share Globally

A Strategy for Multiprogramming in the Manycore Era

Published: 28 March 2015

Volume 43, pages 894–917, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Ashkan Tousimojarad¹ &
Wim Vanderbauwhede¹

220 Accesses
1 Citation
Explore all metrics

Abstract

In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called “Steal Locally, Share Globally” implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer; (ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

TCB is called “subtask list” in the previous GPRM papers.
These features can be enabled via command-line switches when compiling the GPRM runtime system. It that sense they can be considered as runtime support features.
We use the noun “steal” (OED “the act of stealing”) rather than “theft”.
The “sequential tile” is responsible for the sequential tasks, but also contributes to the parallel execution, whenever required.
For all experiments, results from the benchmarks kernel are considered in the figures (a) and (b), while in the other results taken from the VTune Amplifier, all information from the start of the application, including its initial phase and the CPU time consumed by the shared libraries is taken into account.
The sequential phase of the MergeSort benchmark with the input size 80M is around 2 s, and the initial phase of the MatMul benchmark with the input size 4096\(\times \)4096 is about half a second.

References

Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34(2), 115–144 (2001)
Article MathSciNet MATH Google Scholar
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)
Article Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)
Article Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Article MathSciNet MATH Google Scholar
Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of openmp task scheduling algorithms for large numa architectures. In: Euro-Par 2014 Parallel Processing, pp. 596–607. Springer, New York (2014)
Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. pp. 1–11. IEEE (2008)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: a set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, 2009. ICPP’09. pp. 124–131. IEEE (2009)
Emani, M.K., Wang, Z., O’Boyle, M.F.: Smart, adaptive mapping of parallelism in the presence of external workload. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. IEEE (2013)
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)
Article Google Scholar
Harris, T., Maas, M., Marathe, V.J.: Callisto: co-scheduling parallel runtime systems. In: Proceedings of the 9th European Conference on Computer Systems, p. 24. ACM (2014)
Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: ACM Sigplan Notices, vol. 45, pp. 147–158. ACM (2010)
Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
Kim, W., Voss, M.: Multicore desktop programming with intel threading building blocks. IEEE Softw. 28(1), 23–31 (2011)
Article Google Scholar
Lubin, M., McMillan, S., Kruse, C.G., Del Vento, D., Montuoro, R.: Efficient software development: 4 Whats new in intel parallel studio xe 2013 service pack (2013)
McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Elsevier (2012)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc. (2007)
Sasaki, H., Tanimoto, T., Inoue, K., Nakamura, H.: Scalability-based manycore partitioning. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 107–116. ACM (2012)
Saule, E., Catalyurek, U.V.: An early evaluation of the scalability of graph algorithms on the intel mic architecture. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) pp. 1629–1639. IEEE (2012)
Sussman, G.J., Jr., G.L.S.: Scheme: an interpreter for extended lambda calculus. In: MEMO 349, MIT AI LAB (1975)
Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 256–259. IBM Corp. (2007)
Tousimojarad, A., Vanderbauwhede, W.: The Glasgow Parallel Reduction Machine: Programming Shared-Memory Many-Core Systems Using Parallel Task Composition. EPTCS 137, 79–94 (2013). doi:10.4204/EPTCS.137.7
Tousimojarad, A., Vanderbauwhede, W.: Comparison of three popular parallel programming models on the Intel Xeon Phi. In: Euro-Par 2014: Parallel Processing Workshops, pp. 314–325. Springer, New York (2014)
Tousimojarad, A., Vanderbauwhede, W.: An efficient thread mapping strategy for multiprogramming on manycore processors. In: Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing, vol. 25, pp. 63–71. IOS Press (2014). doi:10.3233/978-1-61499-381-0-63
Tousimojarad, A., Vanderbauwhede, W.: A parallel task-based approach to linear algebra. In: 2014 IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 59–66. IEEE (2014)
Tucker, A.: Efficient Scheduling on Multiprogrammed Shared-memory Multiprocessors. Ph.D. thesis, Stanford University (1994)
Veen, A.H.: Dataflow machine architecture. ACM Comput. Surv. (CSUR) 18(4), 365–396 (1986)
Article Google Scholar
Yan, J., He, J., Han, W., Chen, W., Zheng, W.: How openmp applications get more benefit from many-core era. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 83–95. Springer, New York (2010)

Download references

Author information

Authors and Affiliations

School of Computing Science, University of Glasgow, Glasgow, UK
Ashkan Tousimojarad & Wim Vanderbauwhede

Authors

Ashkan Tousimojarad
View author publications
You can also search for this author in PubMed Google Scholar
Wim Vanderbauwhede
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashkan Tousimojarad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tousimojarad, A., Vanderbauwhede, W. Steal Locally, Share Globally. Int J Parallel Prog 43, 894–917 (2015). https://doi.org/10.1007/s10766-015-0350-0

Download citation

Received: 07 July 2014
Accepted: 03 February 2015
Published: 28 March 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10766-015-0350-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Steal Locally, Share Globally

Abstract

Access this article

Similar content being viewed by others

A Hybrid Parallel Barnes-Hut Algorithm for GPU and Multicore Architectures

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

OmpSs-OpenCL Programming Model for Heterogeneous Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Steal Locally, Share Globally

Abstract

Access this article

Similar content being viewed by others

A Hybrid Parallel Barnes-Hut Algorithm for GPU and Multicore Architectures

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

OmpSs-OpenCL Programming Model for Heterogeneous Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation