Skip to main content

Performance and energy effects on task-based parallelized applications

User-directed versus manual vectorization


Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. Maleki S et al (2011) An evaluation of vectorizing compilers. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’11. IEEE Computer Society, Washington, DC, pp 372–382

  2. Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97(11):1077–1100

    Article  MathSciNet  MATH  Google Scholar 

  3. Programming Models, Barcelona Supercomputing Center (2011) The Mercurium \(\text{C}/\text{ C }++\) Source-to-source Compiler Website. Accessed 1 Jan 2017

  4. Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architetcures. Parallel Process Lett 21:173–193

    Article  MathSciNet  Google Scholar 

  5. Caballero de Gea DL (2015) SIMD@OpenMP: a programming model approach to leverage SIMD features. PhD Thesis. Accessed 1 Jan 2017

  6. Mucci PJ et al (1999) PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference

  7. Intel Corporation (2011) Intel SPMD Program Compiler. Accessed 1 Jan 2017

  8. Rapaport G, Zaks A, Ben-Asher Y (2015) Streamlining whole function vectorization in C using higher order vector semantics. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), pp 718–727

  9. Molka D et al (2011) Flexible workload generation for HPC cluster efficiency benchmarking. Springer, Berlin/Heidelberg

    Google Scholar 

  10. Kim C et al (2012) Technical report: closing the ninja performance gap through traditional programming and compiler technology

  11. Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE, pp 44–54

  12. Li M et al (2005) The ALPBench benchmark suite. In: Proceedings of the IEEE International Symposium on Workload Characterization

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Helena Caminal.

Additional information

This work was done at Barcelona Supercomputing Center (BSC).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Caminal, H., Caballero, D., Cebrián, J.M. et al. Performance and energy effects on task-based parallelized applications. J Supercomput 74, 2627–2637 (2018).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Data-level parallelism
  • Task-level parallelism
  • Vectorization
  • Energy efficiency