New Generation Computing

, Volume 31, Issue 3, pp 139–161 | Cite as

A Comparative Study and Evaluation of Parallel Programming Models for Shared-Memory Parallel Architectures

  • Luis Miguel SanchezEmail author
  • Javier Fernandez
  • Rafael Sotomayor
  • Soledad Escolar
  • J. Daniel. Garcia


Nowadays, shared-memory parallel architectures have evolved and new programming frameworks have appeared that exploit these architectures: OpenMP, TBB, Cilk Plus, ArBB and OpenCL. This article focuses on the most extended of these frameworks in commercial and scientific areas. This paper shows a comparative study of these frameworks and an evaluation. The study covers several capacities, such as task deployment, scheduling techniques, or programming language abstractions. The evaluation measures three dimensions: code development complexity, performance and efficiency, measure as speedup per watt. For this evaluation, several parallel benchmarks have been implemented with each framework. These benchmarks are created to cover certain scenarios, like regular memory access or irregular computation. The conclusions show some highlights, like the fact that some frameworks (OpenMP, Cilk Plus) are better for transforming quickly a sequential code, others (TBB) have a small footprint which is ideal for small problems, and others (OpenCL) are suited for heterogeneous architectures but they require a very complex development process. The conclusions also show that the vectorization support is more critical than multitasking to achieve efficiency for those problems where this approach fits.


Parallel Programming Vector Instructions Multithreading Performance Analysis Efficiency Analysis Power Consumption 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ali, A., Dastgeer, U. and Kessler, C., "OpenCL for programming shared memory multicore CPUs,” in Proc. of MULTIPROG-2012, 2012.Google Scholar
  2. 2.
    Marowka A.: "On performance analysis of a multithreaded application parallelized by different programming models using intel vtune,". in Parallel Computing Technologies. 6873,pp. 317–331 2011CrossRefGoogle Scholar
  3. 3.
    Robison, A., Voss, M. and Kukanov, A., "Optimization via reflection on work stealing in tbb,” in IPDPS, IEEE, pp. 1–8, 2008.Google Scholar
  4. 4.
    Chapman, B., Jost, G. and Pas, R. v. d., Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation), The MIT Press, 2007.Google Scholar
  5. 5.
    Newburn, C. J., So, B., Liu, Z., McCool, M. D., Ghuloum, A. M., Toit, S. D., Wang, Z.-G., Du, Z., Chen, Y., Wu, G., Guo, P., Liu, Z. and Zhang, D., "Intel’s array building blocks: A retargetable, dynamic compiler and embedded language,” in CGO, pp. 224–235, 2011.Google Scholar
  6. 6.
    Kim, C., Satish, N., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M. and Dubey, P., "Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology," Intel White Paper 2012, 2012.Google Scholar
  7. 7.
    Contreras, G. and Martonosi, M., "Characterizing and improving the performance of intel threading building blocks,” in Workload Characterization, 2008, IISWC 2008, IEEE International Symposium on, sept. 2008, pp. 57-66.Google Scholar
  8. 8.
    Intel MKL. (2012, Jan.) Intel Math Kernel Library. [Online]. Available:
  9. 9.
    Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z. and Hwu, W.-m. W., "Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs,” in Proc. of the 8th annual IEEE/ACM international symposium on Code generation and optimization, ser. CGO ’10. New York, NY, USA, ACM, pp. 111–119, 2010.Google Scholar
  10. 10.
    Munson, J. C. and Elbaum, S. G., "Code churn: A measure for estimating the impact of code change,” in Proc. of the International Conference on Software Maintenance, ser. ICSM ’98, Washington, DC, USA, IEEE Computer Society, pp. 24–31, 1998.Google Scholar
  11. 11.
    Mair, J., Leung, K. and Huang, Z., "Metrics and task scheduling policies for energy saving in multicore computers,” in Grid Computing (GRID), 2010 11th IEEE/ACM International Conference, pp. 266–273, 2010.Google Scholar
  12. 12.
    Reinders, J., Intel threading building blocks - outfitting C++ for multi-core processor parallelism, O’Reilly, 2007.Google Scholar
  13. 13.
    Treibig, J., Hager, G. and Wellein, G., "LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments,” in ICPP Workshops, pp. 207–216, 2010.Google Scholar
  14. 14.
    Pharr, M. and Mark, W. R., "ispc: A SPMD Compiler for High-Performance CPU Programming,” in Proc. of Innovative Parallel Computing (InPar), 2012.Google Scholar
  15. 15.
    Wittmann, M. and Hager, G., "Optimizing ccnuma locality for task-parallel execution under openmp and tbb on multicore-based systems,” CoRR, vol. abs/1101.0093, 2011.Google Scholar
  16. 16.
    Dickson N.G., Karimi K., Hamze F.: "Importance of explicit vectorization for cpu and gpu software performance”. J. Comput. Physics. 230(13), pp. 5383–5398 2011MathSciNetCrossRefGoogle Scholar
  17. 17.
    OpenCL. (2012, Jan.) Open Computing Language. [Online]. Available:
  18. 18.
    Kegel P., Schellmann M., Gorlatch S.: "Using openmp vs. threading building blocks for medical imaging on multi-cores”. in Euro-Par 2009 Parallel Processing. 5704, pp. 654–665 2009CrossRefGoogle Scholar
  19. 19.
    Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H. and Zhou, Y., "Cilk: an efficient multithreaded runtime system,” SIGPLAN Not., 30, 8, pp. 207–216, Aug. 1995.Google Scholar
  20. 20.
    Leißa, R., Hack, S. and Wald, I., "Extending a C-like language for portable SIMD programming,” in PPOPP, pp. 65–74, 2012.Google Scholar
  21. 21.
    Top500. (2011, Nov.) Supercomputer sites. [Online]. Available:
  22. 22.
    Kindratenko V., Trancoso P.: "Trends in High-Performance Computing”. Computing in Science and Engineering. 13, pp. 92–95 2011CrossRefGoogle Scholar
  23. 23.
    Lee, V. W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A. D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R. and Dubey, P., "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” SIGARCH Comput. Archit. News, 38, 3, pp. 451–460, Jun. 2010.Google Scholar

Copyright information

© Ohmsha and Springer Japan 2013

Authors and Affiliations

  • Luis Miguel Sanchez
    • 1
    Email author
  • Javier Fernandez
    • 1
  • Rafael Sotomayor
    • 1
  • Soledad Escolar
    • 1
  • J. Daniel. Garcia
    • 1
  1. 1.Computer Architecture and Technology AreaUniversidad Carlos III de MadridMadrid, ColmenarejoSpain

Personalised recommendations