Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

  • Andrey RodchenkoEmail author
  • Andy Nisbet
  • Antoniu Pop
  • Mikel Luján
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9233)


Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC barrier OpenMP microbenchmark. The optimized barriers presented in the paper are available at released as free software.


Barrier synchronization Scalability Algorithms Many-core architectures Intel Xeon Phi 



This work is supported by EPSRC grants EP/M004880/1, DOME EP/J016330/1 and PAMELA EP/K008730/1. A. Rodchenko is funded by a Microsoft Research PhD Scholarship, A. Pop is funded by a Royal Academy of Engineering Research Fellowship and M. Luján is funded by a Royal Society University Research Fellowship. We also thank the anonymous reviewers for their constructive feedback.


  1. 1.
    Agarwal, A., Cherian, M.: Adaptive backoff synchronization techniques. In: Proceedings of the of the International Symposium on Computer Architecture, pp. 396–406 (1989)Google Scholar
  2. 2.
    Brooks III, E.D.: The butterfly barrier. Int. J. Parallel Program. 15(4), 295–307 (1986)CrossRefzbMATHGoogle Scholar
  3. 3.
    Bull, J.M.: Measuring synchronisation and scheduling overheads in OpenMP. In: Proceedings of the First European Workshop on OpenMP, pp. 99–105 (1999)Google Scholar
  4. 4.
    Caballero, D., Duran, A., Martorell, X.: An OpenMP barrier usingSIMD instructions for Intel\(^{\textregistered }\) Xeon Phi\(^{\rm TM}\) coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99–113. Springer, Heidelberg (2013)Google Scholar
  5. 5.
    Cownie, J.: Fastest possible barrier (Intel developer zone forum discussion) (2013). Last accessed 1-Jun-2015
  6. 6.
    Dolbeau, R.: Address selection for efficient barriers on the Intel Xeon Phi (2013). Last accessed 1 Jun 2015
  7. 7.
    Grunwald, D., Vajracharya, S.: Efficient barriers for distributed shared memory computers. In: Proceedings of International Parallel Processing Symposium, pp. 604–608 (1994)Google Scholar
  8. 8.
    Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17(1), 1–17 (1988)CrossRefzbMATHGoogle Scholar
  9. 9.
    Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBand. In: 20th International Parallel and Distributed Processing Symposium, p. 7 (2006)Google Scholar
  10. 10.
    Intel Xeon Phi coprocessor system software developers guide (2014). Last accessed 1 Jun 2015
  11. 11.
    Krishnaiyer, R., Kultursay, E., Chawla, P., Preis, S., Zvezdin, A., Saito, H.: Compiler-based data prefetching and streaming non-temporal store generation for the Intel Xeon Phi coprocessor. In: Workshop on Multithreaded Architectures and Applications published as 27th IEEE IPDPSW, pp. 1575–1586 (2013)Google Scholar
  12. 12.
    Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)CrossRefGoogle Scholar
  13. 13.
    NAS parallel benchmarks. Last accessed 1 Jun 2015
  14. 14.
    Ramos, S., Hoefler, T.: Modeling communication in cache-coherent smp systems: A case-study with Xeon Phi. In: High-Performance Parallel and Distributed Computing 2013, pp. 97–108 (2013)Google Scholar
  15. 15.
    Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Proceedings of the 5th International Conference on High Performance and Embedded Architecture and Compilation, pp. 18–34 (2010)Google Scholar
  16. 16.
    Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization, pp. 137–148 (2011)Google Scholar
  17. 17.
    Shirako, J., Peixotto, D.M., Sarkar, V., Scherer, W.N.: Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In: Proceedings of the 22nd International Conference on Supercomputing, pp. 277–288 (2008)Google Scholar
  18. 18.
    Yew, P.C., Tzeng, N.F., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C–36(4), 388–395 (1987)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Andrey Rodchenko
    • 1
    Email author
  • Andy Nisbet
    • 1
  • Antoniu Pop
    • 1
  • Mikel Luján
    • 1
  1. 1.School of Computer ScienceThe University of ManchesterManchesterUK

Personalised recommendations