Performance and energy effects on task-based parallelized applications

Caminal, Helena; Caballero, Diego; Cebrián, Juan M.; Ferrer, Roger; Casas, Marc; Moretó, Miquel; Martorell, Xavier; Valero, Mateo

doi:10.1007/s11227-018-2294-9

Performance and energy effects on task-based parallelized applications

User-directed versus manual vectorization

Published: 13 March 2018

Volume 74, pages 2627–2637, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Helena Caminal ORCID: orcid.org/0000-0002-2052-8107¹,
Diego Caballero²,
Juan M. Cebrián²,
Roger Ferrer²,
Marc Casas²,
Miquel Moretó^2,3,
Xavier Martorell² &
…
Mateo Valero²

288 Accesses
2 Citations
Explore all metrics

Abstract

Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

An empirical investigation of performance overhead in cross-platform mobile development frameworks

Article Open access 09 June 2020

References

Maleki S et al (2011) An evaluation of vectorizing compilers. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’11. IEEE Computer Society, Washington, DC, pp 372–382
Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97(11):1077–1100
Article MathSciNet MATH Google Scholar
Programming Models, Barcelona Supercomputing Center (2011) The Mercurium \(\text{C}/\text{ C }++\) Source-to-source Compiler Website. http://pm.bsc.es/projects/mcxx. Accessed 1 Jan 2017
Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architetcures. Parallel Process Lett 21:173–193
Article MathSciNet Google Scholar
Caballero de Gea DL (2015) SIMD@OpenMP: a programming model approach to leverage SIMD features. PhD Thesis. http://www.tdx.cat/handle/10803/334171. Accessed 1 Jan 2017
Mucci PJ et al (1999) PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference
Intel Corporation (2011) Intel SPMD Program Compiler. https://ispc.github.io. Accessed 1 Jan 2017
Rapaport G, Zaks A, Ben-Asher Y (2015) Streamlining whole function vectorization in C using higher order vector semantics. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), pp 718–727
Molka D et al (2011) Flexible workload generation for HPC cluster efficiency benchmarking. Springer, Berlin/Heidelberg
Google Scholar
Kim C et al (2012) Technical report: closing the ninja performance gap through traditional programming and compiler technology
Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE, pp 44–54
Li M et al (2005) The ALPBench benchmark suite. In: Proceedings of the IEEE International Symposium on Workload Characterization

Download references

Author information

Authors and Affiliations

Computer Systems Laboratory, Cornell University, Ithaca, NY, 14853, USA
Helena Caminal
Barcelona Supercomputing Center (BSC), Barcelona, Spain
Diego Caballero, Juan M. Cebrián, Roger Ferrer, Marc Casas, Miquel Moretó, Xavier Martorell & Mateo Valero
Universitat Politecnica de Catalunya (UPC), Barcelona, Spain
Miquel Moretó

Authors

Helena Caminal
View author publications
You can also search for this author in PubMed Google Scholar
Diego Caballero
View author publications
You can also search for this author in PubMed Google Scholar
Juan M. Cebrián
View author publications
You can also search for this author in PubMed Google Scholar
Roger Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Marc Casas
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Moretó
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Martorell
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helena Caminal.

Additional information

This work was done at Barcelona Supercomputing Center (BSC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Caminal, H., Caballero, D., Cebrián, J.M. et al. Performance and energy effects on task-based parallelized applications. J Supercomput 74, 2627–2637 (2018). https://doi.org/10.1007/s11227-018-2294-9

Download citation

Published: 13 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11227-018-2294-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance and energy effects on task-based parallelized applications

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

An empirical investigation of performance overhead in cross-platform mobile development frameworks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance and energy effects on task-based parallelized applications

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

An empirical investigation of performance overhead in cross-platform mobile development frameworks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation