Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads

  • Florian Wende
  • Thomas Steinke
  • Frank Cordes
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)


Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent kernel execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.


Monte Carlo Kernel Execution Physical Core Persistent Data Many Integrate Core 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hwu, W.M.W.: GPU Computing Gems Jade Edition, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)Google Scholar
  2. 2.
    Intel Corporation: Intel Xeon Phi Product Family Performance, rev. 1.0. (December 2012),
  3. 3.
    Newburn, C.J., Dmitriev, S., Narayanaswamy, R., Wiegert, J., Murty, R., Chinchilla, F., Deodhar, R., McGuire, R.: Offload Compiler Runtime for the Intel Xeon Phi Coprocessor. In: IPDPS Workshops, pp. 1213–1225. IEEE Computer Society (2013)Google Scholar
  4. 4.
    Johnson, J., Krieder, S.J., Grimmer, B., Wozniak, J.M., Wilde, M., Raicu, I.: Understanding the Costs of Many-Task Computing Workloads on Intel Xeon Phi Coprocessors. In: 2nd Greater Chicago Area System Research Workshop (GCASR). Northwestern University, Evanston (2013)Google Scholar
  5. 5.
    Pennycook, S.J., Hughes, C.J., Smelyanskiy, M., Jarvis, S.A.: Exploring SIMD for Molecular Dynamics Using Intel Xeon Processors and Intel Xeon Phi Coprocessors. In: IEEE International Parallel & Distributed Processing Symposium, pp. 1085–1097. IEEE Computer Society, Los Alamitos (2013)Google Scholar
  6. 6.
    Wang, L., Huang, M., El-Ghazawi, T.: Towards Efficient GPU Sharing on Multicore Processors. In: Proceedings of the 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems, PMBS 2011, pp. 23–24. ACM, New York (2011)Google Scholar
  7. 7.
    Wende, F., Cordes, F., Steinke, T.: On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering. In: Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing, SAAHPC 2012, pp. 74–83. IEEE Computer Society, Washington, DC (2012)CrossRefGoogle Scholar
  8. 8.
    Wende, F., Cordes, F., Steinke, T.: Multi-threaded Kernel Offloading to GPGPU using Hyper-Q on Kepler Architecture. Technical Report 14-19, ZIB, Takustr. 7, 14195 Berlin (June 2014)Google Scholar
  9. 9.
    Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013)Google Scholar
  10. 10.
    OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.0. 4.0 edn. (July 2013),

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Florian Wende
    • 1
  • Thomas Steinke
    • 1
  • Frank Cordes
    • 2
  1. 1.Zuse Institute BerlinBerlinGermany
  2. 2.GETLIG&TAR GbRFalkenseeGermany

Personalised recommendations