Advertisement

Extending OpenMP to Facilitate Loop Optimization

  • Ian Bertolacci
  • Michelle Mills Strout
  • Bronis R. de Supinski
  • Thomas R. W. Scogland
  • Eddie C. Davis
  • Catherine Olschanowsky
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11128)

Abstract

OpenMP provides several mechanisms to specify parallel source-code transformations. Unfortunately, many compilers perform these transformations early in the translation process, often before performing traditional sequential optimizations, which can limit the effectiveness of those optimizations. Further, OpenMP semantics preclude performing those transformations in some cases prior to the parallel transformations, which can limit overall application performance.

In this paper, we propose extensions to OpenMP that require the application of traditional sequential loop optimizations. These extensions can be specified to apply before, as well as after, other OpenMP loop transformations. We discuss limitations implied by existing OpenMP constructs as well as some previously proposed (parallel) extensions to OpenMP that could benefit from constructs that explicitly apply sequential loop optimizations. We present results that explore how these capabilities can lead to as much as a 20% improvement in parallel loop performance by applying common sequential loop optimizations.

Keywords

Loop optimization Loop chain abstraction Heterogeneous adaptive worksharing Memory transfer pipelining 

References

  1. 1.
    Bertolacci, I.J., Strout, M.M., Guzik, S., Riley, J., Olschanowsky, C.: Identifying and scheduling loop chains using directives. In: Proceedings of the Third International Workshop on Accelerator Programming Using Directives, pp. 57–67. IEEE Press (2016)Google Scholar
  2. 2.
    Bertolacci, I.J., Strout, M.M., Riley, J., Guzik, S.M., Davis, E.C., Olschanowsky, C.: Using the loop chain abstraction to schedule across loops in existing code. Int. J. High Perform. Comput. Netw. (To be published)Google Scholar
  3. 3.
    Cui, X., Scogland, T.R., de Supinski, B.R., Feng, W.C.: Directive-based partitioning and pipelining for graphics processing units. In: International Parallel and Distributed Processing Symposium, pp. 575–584. IEEE (2017)Google Scholar
  4. 4.
    Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th Annual ACM SIGPLAN Symposium on Priniciples of Programming Languages, pp. 319–329 (1988)Google Scholar
  5. 5.
    Krieger, C.D., et al.: Loop chaining: a programming abstraction for balancing locality and parallelism. In: Proceedings of the 18th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), May 2013Google Scholar
  6. 6.
    Olschanowsky, C., Strout, M.M., Guzik, S., Loffeld, J., Hittinger, J.: A study on balancing parallelism, data locality, and recomputation in existing PDE solvers. In: The IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 2014Google Scholar
  7. 7.
    Scogland, T.R.W., Feng, W., Rountree, B., de Supinski, B.R.: CoreTSAR: core task-size adapting runtime. IEEE Trans. Parallel Distrib. Syst. 26, 2970–2983 (2015)CrossRefGoogle Scholar
  8. 8.
    Scogland, T.R.W., Gyllenhaal, J., Keasler, J., Hornung, R., de Supinski, B.R.: Enabling region merging optimizations in OpenMP. In: Terboven, C., de Supinski, B.R., Reble, P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2015. LNCS, vol. 9342, pp. 177–188. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24595-9_13CrossRefGoogle Scholar
  9. 9.
    Scogland, T.R.W., Feng, W., Rountree, B., de Supinski, B.R.: CoreTSAR: adaptive worksharing for heterogeneous systems. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 172–186. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-07518-1_11CrossRefGoogle Scholar
  10. 10.
    Strout, M., Olschanowsky, C., Davis, E., Bertolacci, I., et al.: Varitions on a theme (2017). https://github.com/CompOpt4Apps/VariationsOnATheme
  11. 11.
    Verdoolaege, S.: Integer Set Library (2016). http://isl.gforge.inria.fr/
  12. 12.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Programming Language Design and Implementation. ACM, New York (1991)Google Scholar
  13. 13.
    Wolfe, M.J.: Iteration space tiling for memory hierarchies. In: Third SIAM Conference on Parallel Processing for Scientific Computing, pp. 357–361 (1987)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Ian Bertolacci
    • 1
  • Michelle Mills Strout
    • 1
  • Bronis R. de Supinski
    • 2
  • Thomas R. W. Scogland
    • 2
  • Eddie C. Davis
    • 3
  • Catherine Olschanowsky
    • 3
  1. 1.The University of ArizonaTucsonUSA
  2. 2.Lawrence Livermore National LaboratoryLivermoreUSA
  3. 3.Boise State UniversityBoiseUSA

Personalised recommendations