Portable SIMD Performance with OpenMP* 4.x Compiler Directives

  • Florian Wende
  • Matthias Noack
  • Thomas Steinke
  • Michael Klemm
  • Chris J. Newburn
  • Georg Zitzlsberger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


Effective vectorization is becoming increasingly important for high performance and energy efficiency on processors with wide SIMD units. Compilers often require programmers to identify opportunities for vectorization, using directives to disprove data dependences. The OpenMP 4.x SIMD directives strive to provide portability. We investigate the ability of current compilers (GNU, Clang, and Intel) to generate SIMD code for microbenchmarks that cover common patterns in scientific codes and for two kernels from the VASP and the MOM5/ERGOM application. We explore coding strategies for improving SIMD performance across different compilers and platforms (Intel® Xeon® processor and Intel® Xeon Phi (co)processor). We compare OpenMP* 4.x SIMD vectorization with and without vector data types against SIMD intrinsics and C++ SIMD types. Our experiments show that in many cases portable performance can be achieved. All microbenchmarks are available as open source as a reference for programmers and compiler experts to enhance SIMD code generation.


Loop Body Optional Clause Loop Kernel Intel Compiler Math Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is supported by Intel within the IPCC activities at ZIB, and partially supported by the project SECOS—“The Service of Sediments in German Coastal Seas” (Subproject 3.2, grant BMBF 03F0666D). We would like to acknowledge G. Kresse and M. Marsman for collaboration on VASP tuning. (Intel, Xeon and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. * Other brands and names are the property of their respective owners. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.).


  1. 1.
    Kuck, D., Kuhn, R., Leasure, B., Wolfe, M.: The structure of an advanced retargetable vectorizer. In: Tutorial on Supercomputers: Designs and Applications, pp. 163–178. IEEE Press, New York (1984)Google Scholar
  2. 2.
    Davies, I., Huson, C., Macke, T., Leasure, B., Wolfe, M.: The KAP/S-1: an advanced source-to-source vectorizer for the S-l Mark IIa supercomputer. In: Proceedings of the 1986 International Conference on Parallel Processing, pp. 833–835. IEEE Press, New York (1986)Google Scholar
  3. 3.
    Davies, I., Huson, C., Macke, T., Leasure, B., Wolfe, M.: The KAP/205: an advanced source-to-source vectorizer for the Cyber 205 supercomputer. In: Proceedings of the 1986 International Conference on Parallel Processing, pp. 827–832. IEEE Press, New York (1986)Google Scholar
  4. 4.
    Allen, J., Kennedy, K.: PFC: A program to convert Fortran to parallel form. Report MASC-TR82-6, Rice Univ. Houston, Texas, March 1982Google Scholar
  5. 5.
    Brode, B.: Precompilation of Fortran programs to facilitate array processing. Computer 14(9), 46–51 (1981)CrossRefGoogle Scholar
  6. 6.
    Shin, J.: SIMD Programming by Expansion. Technical report, Mathematics and Computer Science Division, Argonne National Laboratory Argonne, IL 60439 USA (2007).
  7. 7.
    Krzikalla, O., Feldhoff, K., Müller-Pfefferkorn, R., Nagel, W.E.: Scout: a source-to-source transformator for SIMD-Optimizations. In: Alexander, M., et al. (eds.) Euro-Par 2011, Part II. LNCS, vol. 7156, pp. 137–145. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Karrenberg, R., Hack, S.: Whole function vectorization. In: International Symposium on Code Generation and Optimization. CGO (2011)Google Scholar
  9. 9.
    Rice, H.: Classes of recursively enumerable sets and their decision problems. Trans. Am. Math. Soc. 74, 358–366 (1953)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Bacon, D., Graham, S., Sharp, O.: Compiler transformations for high-performance computing. ACM Comput. Surv. 26(4), 345–420 (1994)CrossRefGoogle Scholar
  11. 11.
    Wolfe, M.: High-Performance Compilers for Parallel Computing. Pearson, Redwood City (1995)zbMATHGoogle Scholar
  12. 12.
    Senkevich, A.: Libmvec (2015).
  13. 13.
    Intel: Intrinsics for Short Vector Math Library Operations (2015).
  14. 14.
    O’Donell, C.: The GNU C Library version 2.22 is now available (2015).
  15. 15.
    Intel: Vectorization and Loops (2015).
  16. 16.
    Intel(R) Mobile Computing and Compilers: Vector Function Application Binary Interface, Version 0.9.5 (2013).
  17. 17.
  18. 18.
    OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015).
  19. 19.
    Krzikalla, O., Wende, F., Höhnerbach, M.: Dynamic SIMD vector lane scheduling. In: Proceedings of the ISC 2016 IXPUG Workshop. LNCS. Springer (2016).
  20. 20.
    Kretz, M., Lindenstruth, V.: Vc: a C++ library for explicit vectorization. Softw. Pract. Exper. 42(11), 1409–1430 (2012)CrossRefGoogle Scholar
  21. 21.
    This paper: Code samples hosted on
  22. 22.
    Kresse, G., Furthmüller, J.: Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54, 11169–11186 (1996)CrossRefGoogle Scholar
  23. 23.
    Kresse, G., Joubert, D.: From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B 59, 1758–1775 (1999)CrossRefGoogle Scholar
  24. 24.
    Griffies, S.M.: Elements of the Modular Ocean Model (MOM). NOAA Geophysical Fluid Dynamics Laboratory, Princeton, USA (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Florian Wende
    • 1
  • Matthias Noack
    • 1
  • Thomas Steinke
    • 1
  • Michael Klemm
    • 2
  • Chris J. Newburn
    • 3
  • Georg Zitzlsberger
    • 2
  1. 1.Zuse Institute BerlinBerlinGermany
  2. 2.Intel Deutschland GmbHNeubibergGermany
  3. 3.Intel CorporationSanta ClaraUSA

Personalised recommendations