International Journal of Parallel Programming

, Volume 45, Issue 2, pp 225–241 | Cite as

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

  • Alvaro Estebanez
  • Diego R. Llanos
  • Arturo Gonzalez-Escribano
Article

Abstract

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.

Keywords

Thread-level speculation Speculative parallelization  Optimistic parallelization Xeon Phi 

References

  1. 1.
    AMD \(\text{ Opteron }^{{\rm TM}}\) 6300 Series processor - quick reference guide. https://www.amd.com/Documents/Opteron_6300_QRG.pdf. Accessed June 2015
  2. 2.
    Intel \(\textregistered \) Xeon \(\text{ Phi }^{{\rm TM}}\) product family: Product brief. https://www-ssl.intel.com/content/dam/www/public/us/en/documents/product-briefs/high-performance-xeon-phi-coprocessor-brief.pdf. Accessed June 2015
  3. 3.
    Intel \(\textregistered \) Xeon \(\text{ Phi }^{{\rm TM}}\) coprocessor instruction set architecture reference manual. https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf. Accessed June 2015
  4. 4.
    Aldea, S., Estebanez, A., Llanos, D., Gonzalez-Escribano, A.: An OpenMP extension that supports thread-level speculation. IEEE Trans. Parallel Distrib. Syst. PP(99), 1–1 (2015). doi:10.1109/TPDS.2015.2393870 Google Scholar
  5. 5.
    Barnes, J.E.: TREE. Institute for Astronomy. University of Hawaii (1997). ftp://hubble.ifa.hawaii.edu/pub/barnes/treecode/
  6. 6.
    Cadambi, S., Coviello, G., Li, C.H., Phull, R., Rao, K., Sankaradass, M., Chakradhar, S.: Cosmic: middleware for high performance and reliable multiprocessing on Xeon Phi coprocessors. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’13, pp. 215–226. ACM, New York (2013). doi:10.1145/2462902.2462921
  7. 7.
    Cai, P., Cai, Y., Chandrasekaran, I., Zheng, J.: A GPU-enabled parallel genetic algorithm for path planning of robotic operators. In: Cai, Y., See, S. (eds.) GPU Comput. Appl., pp. 1–13. Springer, Singapore (2015). doi:10.1007/978-981-287-134-3_1 Google Scholar
  8. 8.
    Cintra, M., Llanos, D.R.: Toward efficient and robust software speculative parallelization on multiprocessors. In: Proceedings of the SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) (2003)Google Scholar
  9. 9.
    Cintra, M., Llanos, D.R.: Design space exploration of a software speculative parallelization scheme. IEEE Trans. Parallel Distrib. Syst. 16(6), 562–576 (2005)CrossRefGoogle Scholar
  10. 10.
    Clarkson, K.L., Mehlhorn, K., Seidel, R.: Four results on randomized incremental constructions. Comput. Geom. Theory Appl. 3(4), 185–212 (1993)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP programming on Intel Xeon Phi coprocessors: An early performance comparison. In: Proceedings of the Many-core Applications Research Community (MARC) Symposium (2012)Google Scholar
  12. 12.
    Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). doi:10.1109/99.660313 CrossRefGoogle Scholar
  13. 13.
    Devroye, L., Mücke, E.P., Zhu, B.: A note on point location in Delaunay triangulations of random points. Algorithmica 22, 477–482 (1998)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Dou, J., Cintra, M.: Compiler estimation of load imbalance overhead in speculative parallelization. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04. IEEE Computer Society, Washington, DC (2004)Google Scholar
  15. 15.
    Estebanez, A., Llanos, D., Gonzalez-Escribano, A.: New data structures to handle speculative parallelization at runtime. Int. J. Parallel Program. 1–20 (2015). doi:10.1007/s10766-014-0347-0
  16. 16.
    Fang, J., Sips, H., Zhang, L., Xu, C., Che, Y., Varbanescu, A.L.: Test-driving Intel Xeon Phi. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, ICPE ’14, pp. 137–148. ACM, New York (2014). doi:10.1145/2568088.2576799
  17. 17.
    Franklin, M., Sohi, G.S.: ARB: a hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45(5), 552–571 (1996). doi:10.1109/12.509907 CrossRefMATHGoogle Scholar
  18. 18.
    Gao, L., Li, L., Xue, J., Yew, P.C.: SEED: a statically-greedy and dynamically-adaptive approach for speculative loop execution. IEEE Trans. Comput. 62(5), 1004–1016 (2013)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Gopal, S., Vijaykumar, T.N., Smith, J., Sohi, G.: Speculative versioning cache. In: High-Performance Computer Architecture, 1998. Proceedings, 1998 Fourth International Symposium on, pp. 195–205 (1998). doi:10.1109/HPCA.1998.650559
  20. 20.
    Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High-Performance Programming. Newnes, Boston (2013)Google Scholar
  21. 21.
    Jimborean, A., Clauss, P., Dollinger, J.F., Loechner, V., Martinez Caamao, J.: Dynamic and speculative polyhedral parallelization using compiler-generated skeletons. Int. J. Parallel Program. 42(4), 529–545 (2014)CrossRefGoogle Scholar
  22. 22.
    Kelsey, K., Bai, T., Ding, C., Zhang, C.: Fast track: a software system for speculative program optimization. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09, pp. 157–168. IEEE Computer Society, Washington, DC (2009). doi:10.1109/CGO.2009.18
  23. 23.
    Khronos: Open Computing Language (OpenCL) (2010). http://www.khronos.org/opencl/, Accessed 2 Dec 2013
  24. 24.
    Krishnan, V., Torrellas, J.: A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48(9), 866–880 (1999)CrossRefGoogle Scholar
  25. 25.
    Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Bala, K., Chew, L.P.: Optimistic parallelism requires abstractions. In: PLDI 2007 Proceedings. ACM (2007)Google Scholar
  26. 26.
    Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Bala, K., Chew, L.P.: Optimistic parallelism requires abstractions. Commun. ACM 52(9), 89–97 (2009)CrossRefGoogle Scholar
  27. 27.
    Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pp. 273–282. ACM, New York (2013). doi:10.1145/2464996.2465013
  28. 28.
    Marcuello, P., Gonzalez, A., Tubella, J.: Speculative multithreaded processors. In: Proceedings of the 12th International Conference on Supercomputing, ICS ’98. ACM, New York (1998)Google Scholar
  29. 29.
    Mücke, E.P., Saias, I., Zhu, B.: Fast randomized point location without preprocessing in two- and three-dimensional Delaunay triangulations. In: SoCG ’96 Proceedings, pp. 274–283 (1996)Google Scholar
  30. 30.
    NVIDIA: NVIDIA CUDA Architecture Introduction and Overview Version 1.1 (2009)Google Scholar
  31. 31.
    Oancea, C.E., Mycroft, A., Harris, T.: A lightweight in-place implementation for software thread-level speculation. In: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09. ACM, New York (2009)Google Scholar
  32. 32.
    Olsen, S., Romoser, B., Zong, Z.: SQLPhi: A SQL-based database engine for Intel Xeon Phi coprocessors. In: Proceedings of the 2014 International Conference on Big Data Science and Computing, BigDataScience ’14, pp. 17:1–17:6. ACM, New York (2014). doi:10.1145/2640087.2644172
  33. 33.
    Park, J., Bikshandi, G., Vaidyanathan, K., Tang, P.T.P., Dubey, P., Kim, D.: Tera-scale 1D FFT with low-communication algorithm and Intel Xeon Phi coprocessors. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 34:1–34:12. ACM, New York (2013). doi:10.1145/2503210.2503242
  34. 34.
    Raman, E., Vahharajani, N., Rangan, R., August, D.I.: Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08. ACM, New York (2008)Google Scholar
  35. 35.
    Rauchwerger, L., Padua, D.: The lrpd test: speculative run-time parallelization of loops with privatization and reduction parallelization (1995). doi:10.1145/207110.207148
  36. 36.
    Rezaei, A., Coviello, G., Li, C.H., Chakradhar, S., Mueller, F.: Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’14, pp. 1–12. ACM, New York (2014). doi:10.1145/2600212.2600215
  37. 37.
    Rotenberg, E., Bennett, S., Smith, J.E.: Trace cache: a low latency approach to high bandwidth instruction fetching. In: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. MICRO 29, pp. 24–35. IEEE Computer Society, Washington, DC (1996)Google Scholar
  38. 38.
    Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pp. 440–451. IEEE Computer Society, Washington, DC (2012). http://dl.acm.org/citation.cfm?id=2337159.2337210
  39. 39.
    Schmidl, D., Cramer, T., Wienke, S., Terboven, C., Mller, M.: Assessing the performance of OpenMP programs on the Intel Xeon Phi. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science, vol. 8097, pp. 547–558. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40047-6_56
  40. 40.
    Sohi, G.S., Breach, S.E., Vijaykumar, T.N.: Multiscalar processors. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, pp. 414–425. ACM, New York (1995). doi:10.1145/223982.224451
  41. 41.
    Tian, C., Feng, M., Gupta, R.: Supporting speculative parallelization in the presence of dynamic data structures. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10. ACM, New York (2010)Google Scholar
  42. 42.
    Tian, C., Feng, M., Nagarajan, V., Gupta, R.: Copy or discard execution model for speculative parallelization on multicores. In: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’41. Washington, DC (2008)Google Scholar
  43. 43.
    Walker, D.W.: The design of a standard message passing interface for distributed memory concurrent computers. Parallel Comput. 20(4), 657–673 (1994). http://portal.acm.org/citation.cfm?id=180103
  44. 44.
    Wallace, S., Calder, B., Tullsen, D.M.: Threaded multiple path execution. In: Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA ’98, pp. 238–249. IEEE Computer Society, Washington, DC (1998). doi:10.1145/279358.279392
  45. 45.
    Yiapanis, P., Rosas-Ham, D., Brown, G., Luján, M.: Optimizing software runtime systems for speculative parallelization. ACM Trans. Archit. Code Optim. 9(4), 39:1–39:27 (2013)CrossRefGoogle Scholar
  46. 46.
    Zhao, Z., Wu, B., Shen, X.: Speculative parallelization needs rigor: probabilistic analysis for optimal speculation of finite-state machine applications. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12. New York (2012)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Alvaro Estebanez
    • 1
  • Diego R. Llanos
    • 1
  • Arturo Gonzalez-Escribano
    • 1
  1. 1.Departamento de InformaticaUniversidad de ValladolidValladolidSpain

Personalised recommendations