HYDRA : Extending Shared Address Programming for Accelerator Clusters

  • Putt Sakdhnagool
  • Amit Sabne
  • Rudolf Eigenmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9519)


This work extends shared address programming to accelerator clusters by pursuing a simple form of shared-address programming, named HYDRA, where the programmer only specifies the parallel regions in the program. We present a fully automatic translation system that generates an MPI + accelerator program from a HYDRA program. Our mechanism ensures scalability of the generated program by optimizing data placement and transfer to and from the limited, discrete memories of accelerator devices. We also present a compiler design built on a high-level IR to support multiple accelerator architectures. Evaluation results demonstrate the scalability of the translated programs on five well-known benchmarks. On average, HYDRA gains a 24.54x speedup over single-accelerator performance when running on a 64-node Intel Xeon Phi cluster and a 27.56x speedup when running on a 64-node NVIDIA GPU cluster.


Memory Allocation Bilateral Filter Runtime System Parallel Loop Program Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported, in part, by the National Science Foundation under grants No. 0916817-CCF and 1449258-ACI. This research used resources of the Keeneland Computing Facility at the Georgia Institute of Technology and the Extreme Sciene and Engineering Discovery Environment (XSEDE), which are supported by the National Science Foundation under awards OCI-0910735 and ACI-1053575, respectively.


  1. 1.
    Bae, H., Mustafa, D., Lee, J.W., Aurangzeb, B., Lin, H., Dave, C., Eigenmann, R., Midkiff, S.: The Cetus source-to-source compiler infrastructure: Overview and evaluation. Int. J. Parallel Program. 41, 1–15 (2012)Google Scholar
  2. 2.
    Bueno, J., Planas, J., Duran, A., Badia, R., Martorell, X., Ayguade, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: IEEE 26th International Parallel Distributed Processing Symposium, IPDPS 2012, pp. 557–568, May 2012Google Scholar
  3. 3.
    Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, pp. 359–368. ACM, NY, USA, New York (2013)Google Scholar
  4. 4.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54. IEEE Computer Society, Washington, DC (2009)Google Scholar
  5. 5.
    UPC Consortium: UPC language specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)Google Scholar
  6. 6.
    Corporation, I.: Intel\(\textregistered \) SDK for OpenCL applications XE R3. (2013).
  7. 7.
    Dwarkadas, S., Cox, A.L., Zwaenepoel, W.: An integrated compile-time/run-time software distributed shared memory system. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pp. 186–197. ACM, NY, USA, New York (1996)Google Scholar
  8. 8.
    Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22, 78–90 (2011)CrossRefGoogle Scholar
  9. 9.
    Kim, J., Seo, S., Lee, J., Nah, J., Jo, G., Lee, J.: SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 341–352. ACM, NY, USA, New York (2012)Google Scholar
  10. 10.
    Kwon, O., Jubair, F., Eigenmann, R., Midkiff, S.: A hybrid approach of OpenMP for clusters. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 75–84 (2012)Google Scholar
  11. 11.
    Kwon, O., Jubair, F., Min, S.-J., Bae, H., Eigenmann, R., Midkiff, S.P.: Automatic scaling of OpenMP beyond shared memory. In: Rajopadhye, S., Mills Strout, M. (eds.) LCPC 2011. LNCS, vol. 7146, pp. 1–15. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in cuda. In: Proceedings of the IEEE High Performance Extreme Computing Conference (2014)Google Scholar
  13. 13.
    Lee, J., Tran, M.T., Odajima, T., Boku, T., Sato, M.: An extension of XcalableMP PGAS lanaguage for multi-node GPU clusters. In: Alexander, M., et al. (eds.) Euro-Par 2011, Part I. LNCS, vol. 7155, pp. 429–439. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)Google Scholar
  15. 15.
    Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)CrossRefGoogle Scholar
  16. 16.
    Potluri, S., Bureddy, D., Wang, H., Subramoni, H., Panda, D.: Extending OpenSHMEM for GPU computing. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 1001–1012, May 2013Google Scholar
  17. 17.
    Ramashekar, T., Bondhugula, U.: Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Archit. Code Optim. 10(4), 60: 1–60: 26 (2013)CrossRefGoogle Scholar
  18. 18.
    Forum, High Performance Fortran: High performance fortran language specification. SIGPLAN Fortran Forum, vol. 12 (4), 1–86, December 1993Google Scholar
  19. 19.
    Thies, W., Karczmarek, M., Gordon, M.I., Maze, D.Z., Wong, J., Hoffman, H., Brown, M., Amarasinghe, S.: Streamit: A compiler for streaming applications. Technical report MIT/LCS Technical Memo LCS-TM-622, Massachusetts Institute of Technology, Cambridge, MA, December 2001Google Scholar
  20. 20.
    Vetter, J., Glassbrook, R., Dongarra, J., Schwan, K., Loftis, B., McNally, S., Meredith, J., Rogers, J., Roth, P., Spafford, K., Yalamanchili, S.: Keeneland: Bringing heterogeneous GPU computing to the computational science community. Comput. Sci. Eng. 13(5), 90–95 (2011)CrossRefGoogle Scholar
  21. 21.
    Yelick, K., Semenzato, L., Pike, G., Miyamoto, C., Liblit, B., Krishnamurthy, A., Hilfinger, P., Graham, S., Gay, D., Colella, P., Aiken, A.: Titanium: A high-performance Java dialect. In: ACM, pp. 10–11 (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Putt Sakdhnagool
    • 1
  • Amit Sabne
    • 1
  • Rudolf Eigenmann
    • 1
  1. 1.School of Electrical and Computer EngineeringPurdue UniversityWest LafayetteUSA

Personalised recommendations