Skip to main content

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential


Parallelization schemes are essential in order to exploit the full benefits of multi-core architectures, which have become widespread in recent years, especially for scientific applications. In shared memory architectures, the most common parallelization API is OpenMP. However, the introduction of correct and optimal OpenMP parallelization to applications is not always a simple task, due to common parallel shared memory management pitfalls and architecture heterogeneity. To ease this process, many automatic parallelization compilers were created. In this paper we focus on three source-to-source compilers—AutoPar, Par4All and Cetus—which were found to be most suitable for the task, point out their strengths and weaknesses, analyze their performances, inspect their capabilities and suggest new paths for enhancement. We analyze and compare the compilers’ performances over several different exemplary test cases, with each test case pointing out different pitfalls, and suggest several new ways to overcome these pitfalls, while yielding excellent results in practice. Moreover, we note that all of those source-to-source parallelization compilers function in the limits of OpenMP 2.5—an outdated version of the API which is no longer in optimal accordance with nowadays complicated heterogeneous architectures. Therefore we suggest a path to exploit the new features of OpenMP 4.5, as it provides new directives to fully utilize heterogeneous architectures, specifically ones that have a strong collaboration between CPUs and GPGPUs, thus it outperforms previous results by an order of magnitude.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Source code of relevant sections can be found at: \(\_NAS\)


  1. Geer, D.: Chip makers turn to multicore processors. Computer 38(5), 11–13 (2005)

    Article  Google Scholar 

  2. Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Process. Mag. 26(6), 26–37 (2009)

    Article  Google Scholar 

  3. Pacheco, P.: An Introduction to Parallel Programming. Elsevier, Amsterdam (2011)

    Google Scholar 

  4. Leopold, C.: Parallel and Distributed Computing: A Survey of Models, Paradigms and Approaches. Wiley, Hoboken (2001)

    Google Scholar 

  5. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  6. Gropp, W., Thakur, R., Lusk, E.: Using MPI-2: Advanced Features of the Message Passing Interface. MIT Press, Cambridge (1999)

    Book  Google Scholar 

  7. Snir, M., Otto, S., Huss-Lederman, S., Dongarra, J., Walker, D.: MPI-the Complete Reference: The MPI Core, vol. 1. MIT press, Cambridge (1998)

    Google Scholar 

  8. Boku, Taisuke, Sato, Mitsuhisa, Matsubara, Masazumi, Takahashi, Daisuke: Openmpi-openmp like tool for easy programming in mpi. In Sixth European Workshop on OpenMP, pages 83–88, (2004)

  9. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. In: ACM SIGGRAPH 2008 Classes, p. 16. ACM (2008)

  10. Oren, G., Ganan, Y., Malamud, G.: Automp: an automatic openmp parallization generator for variable-oriented high-performance scientific codes. Int. J. Comb. Optim. Probl. Inform. 9(1), 46–53 (2018)

    Google Scholar 

  11. Neamtiu, I., Foster, J.S., Hicks, M.: Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Softw. Eng. Notes 30(4), 1–5 (2005)

    Article  Google Scholar 

  12. AutoPar documentations. Accessed 8 Aug 2019

  13. ROSE homepage. Accessed 8 Aug 2019

  14. Dever, M.: AutoPar: automating the parallelization of functional programs. PhD thesis, Dublin City University (2015)

  15. Par4All homepage. Accessed 8 Aug 2019

  16. PIPS homepage. Accessed 8 Aug 2019

  17. Ventroux, N., Sassolas, T., Guerre, A., Creusillet, B., Keryell, R.: SESAM/Par4all: a tool for joint exploration of MPSoC architectures and dynamic dataflow code generation. In: Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, pp. 9–16. ACM (2012)

  18. Cetus homepage. Accessed 8 Aug 2019

  19. Dave, C., Bae, H., Min, S.-J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. Computer 42(12), 36–42 (2009)

    Article  Google Scholar 

  20. Hall, M.W., Anderson, J.M., Amarasinghe, S.P., Murphy, B.R., Liao, S.-W., Bugnion, E., Lam, M.S.: Maximizing multiprocessor performance with the suif compiler. Computer 29(12), 84–89 (1996)

    Article  Google Scholar 

  21. Pottenger, B., Eigenmann, R.: Idiom recognition in the Polaris parallelizing compiler. In: Proceedings of the 9th International Conference on Supercomputing, pp. 444–448. ACM (1995)

  22. Tian, X., Bik, A., Girkar, M., Grey, P., Saito, H., Su, E.: Intel® openmp c++/fortran compiler for hyper-threading technology: implementation and performance. Intel Technol. J. 6(1), 36–46 (2002)

    Google Scholar 

  23. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: PLUTO: a practical and fully automatic polyhedral program optimization system. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08), Tucson, AZ (June 2008). Citeseer (2008)

  24. Prema, S., Jehadeesan, R., Panigrahi, B.K.: Identifying pitfalls in automatic parallelization of NAS parallel benchmarks. In: 2017 National Conference on Parallel Computing Technologies (PARCOMPTECH), pp. 1–6. IEEE (2017)

  25. Arenaz, M., Hernandez, O., Pleiter, D.: The technological roadmap of parallware and its alignment with the openpower ecosystem. In: International Conference on High Performance Computing, pp. 237–253. Springer (2017)

  26. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The nas parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)

    Article  Google Scholar 

  27. Graham, S.L., Kessler, P.B., McKusick, M.K.: Gprof: a call graph execution profiler. ACM SIGPLAN Not. 39(4), 49–57 (2004)

    Article  Google Scholar 

  28. Prema, S., Jehadeesan, R.: Analysis of parallelization techniques and tools. Int. J. Inf. Comput. Technol. 3(5), 471–478 (2013)

    Google Scholar 

  29. Sohal, M., Kaur, R.: Automatic parallelization: a review. Int. J. Comput. Sci. Mob. Comput. 5(5), 17–21 (2016)

    Google Scholar 

  30. Quinlan, D.: ROSE: compiler support for object-oriented frameworks. Parallel Process. Lett. 10(02n03), 215–226 (2000)

    Article  Google Scholar 

  31. Amini, M., Creusillet, B., Even, S., Keryell, R., Goubier, O., Guelton, S., McMahon, J.O., Pasquier, F.-X., Péan, G., Villalon, P.: Par4all: from convex array regions to heterogeneous computing. In: IMPACT 2012: Second International Workshop on Polyhedral Compilation Techniques HiPEAC 2012 (2012)

  32. Lee, S.-I., Johnson, T.A., Eigenmann, R.: Cetus—an extensible compiler infrastructure for source-to-source transformation. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 539–553. Springer (2003)

  33. Liang, X., Humos, A.A., Pei, T.: Vectorization and parallelization of loops in C/C++ code. In: Proceedings of the International Conference on Frontiers in Education: Computer Science and Computer Engineering (FECS). The Steering Committee of The World Congress in Computer Science, Computer, pp. 203–206 (2017)

  34. Jubb, C.: Loop optimizations in modern c compilers (2014)

  35. Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes, Oxford (2013)

    Google Scholar 

  36. Lu, M., Zhang, L., Huynh, H.P., Ong, Z., Liang, Y., He, B., Goh, R.S.M., Huynh, R.: Optimizing the mapreduce framework on Intel Xeon Phi coprocessor. In 2013 IEEE International Conference on Big Data, pp. 125–130. IEEE (2013)

  37. Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi coprocessor. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 126–137. IEEE (2013)

  38. Bailey, D.H.: NAS parallel benchmarks. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 1254–1259. Springer, Heidelberg (2011)

    Google Scholar 

  39. NPB in C homepage. Accessed 8 Aug 2019

  40. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22(6), 702–719 (2010)

    Google Scholar 

  41. Van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP The Next Step: Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)

    Google Scholar 

  42. Sui, Y., Fan, X.I., Zhou, H., Xue, J.: Loop-oriented array-and field-sensitive pointer analysis for automatic SIMD vectorization. In: ACM SIGPLAN Notices, vol. 51, pp. 41–51. ACM (2016)

  43. Zhou, H.: Compiler techniques for improving SIMD parallelism. PhD thesis, University of New South Wales, Sydney, Australia (2016)

  44. NegevHPC Project. Accessed 8 Aug 2019

Download references


This work was supported by the Lynn and William Frankel Center for Computer Science. Computational support was provided by the NegevHPC project [44].

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gal Oren.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Harel, R., Mosseri, I., Levin, H. et al. Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential. Int J Parallel Prog 48, 1–31 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Parallel programming
  • Automatic parallelism
  • Cetus
  • AutoPar
  • Par4All