Abstract
We reconsider the familiar problem of executing a perfectly parallel workload consisting of \(N\) independent tasks on a parallel computer with \(P \ll N\) processors. We show that there are memory-bound problems for which the runtime can be reduced by the forced parallelization of individual tasks across a small number of cores. Specific examples include solving differential equations, performing sparse matrix–vector multiplications, and sorting integer keys.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The L3 cache on Abisko is a non-inclusive victim cache, hence the addition.
- 2.
Note that the effective memory bandwidth is a tool used to illustrate the time measurements and does not reflect the memory bandwidth that is actually consumed at the hardware level.
References
Grant, R.E., Afsahi, A.: A comprehensive analysis of OpenMP applications on dual-core Intel Xeon SMPs. In: IPDPS, pp. 1–8 (2007)
Gustafson, J.L.: Fixed time, tiered memory, and superlinear speedup. In: Proceedings of the Fifth Distributed Memory Computing Conference (DMCC), pp. 1255–1260 (1990)
Karlsson, L., Kågström, B., Wadbro, E.: Fine-grained bulge-chasing kernels for strongly scalable parallel QR algorithms. Parallel Comput. (2013, accepted)
Muddukrishna, A., Podobas, A., Brorsson, M., Vlassov, V.: Task scheduling on manycore processors with home caches. In: Caragiannis, J., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 357–367. Springer, Heidelberg (2013)
Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling task parallelism on multi-socket multicore systems. In: Proceedings of ROSS’11, pp. 49–56. ACM, New York (2011)
Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)
Rauber, T., Rünger, G.: M-task-programming for heterogeneous systems and grid environments. In: IPDPS (2005)
Williams, S.W.: Auto-tuning performance on multicore computers. Ph.D. thesis, EECS Department, University of California, Berkeley (2008)
Zhuravlev, S., Blagodurov, S., Fedorova, A.: Addressing shared resource contention in multicore processors via scheduling. SIGARCH Comput. Archit. News 38(1), 129–142 (2010)
Acknowledgements
Financial support by the Swedish Research Council grant VR A0581501 and eSSENCE, a strategic collaborative eScience programme. This research was conducted using the resources of HPC2N and NSC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Karlsson, L., Kjelgaard Mikkelsen, C., Kågström, B. (2014). Improving Perfect Parallelism. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2013. Lecture Notes in Computer Science(), vol 8384. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55224-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-55224-3_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55223-6
Online ISBN: 978-3-642-55224-3
eBook Packages: Computer ScienceComputer Science (R0)