Analysis of Task Offloading for Accelerators

  • Roger Ferrer
  • Vicenç Beltran
  • Marc Gonzàlez
  • Xavier Martorell
  • Eduard Ayguadé
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5952)


As an answer to the forthcoming heterogeneous multicore and accelerator–based architectures, we have proposed some syntactic extensions to C in the form of C pragmas, based on OpenMP, that make easier for programmers to offload parts of their applications to the auxiliary processors. Offloaded tasks can be made more profitable using a simple blocking strategy. And the runtime system is used to better support computation and communication overlap, while moving data to and from accelerators.

In order to prove the feasibility and usefulness of our proposal, we have considered the IBM Cell architecture. The performance of the whole system has been evaluated using HPCC STREAM Triad and several NAS benchmarks. We present their evaluation and a detailed performance breakdown at the level of parallel regions. We also classify the parallel regions according to their suitability to be exploited in accelerators. Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.


Memory Module Runtime System Parallel Region Parallel Loop Multicore Architecture 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine Architecture and its first implementation. IBM Developer Works (November 2005)Google Scholar
  2. 2.
    NVIDIA corporation: NVIDIA CUDA Compute Unified Device Architecture Version 2.0 (2008)Google Scholar
  3. 3.
    NVIDIA corporation: NVIDIA Tesla GPU Computing Technical Brief (2008)Google Scholar
  4. 4.
    OpenMP Architecture Review Board: OpenMP Application Program Interface. Version 3.0 (May 2008),
  5. 5.
    Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20(3), 404–418 (2009)CrossRefGoogle Scholar
  6. 6.
    Ayguadé, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Orti, E.: A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In: Fifth International Workshop on OpenMP, IWOMP (2009)Google Scholar
  7. 7.
    Jin, H., Frumkin, M., Yan, J.: The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011, NASA Ames Research Center (1999)Google Scholar
  8. 8.
    Kusano, K., Satoh, S., Sato, M.: Performance evaluation of the Omni OpenMP compiler. In: Third International Symposium on High Performance Computing, pp. 403–414 (2000)Google Scholar
  9. 9.
    Ferrer, R., Gonzalez, M., Silla, F., Martorell, X., Ayguadé, E.: Evaluation of Memory Performance on the Cell BE with the SARC Programming Model. In: Proceedings of the 9th Workshop on Memory Performance: Dealing with Applications, systems, and architecture (MEDEA 2008) (October 2008)Google Scholar
  10. 10.
    Intel Corporation: Intel Corporation’s Multicore Architecture Briefing (March 2008),
  11. 11.
  12. 12.
    Stanford University: BrookGPU,
  13. 13.
    Stanford University: Brook Language,
  14. 14.
    Group, K.O.W.: The OpenCL Specification (February 2009),
  15. 15.
    Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Su, E., Unnikrishnan, P., Zhang, G.: A Proposal for Task Parallelism in OpenMP. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 1–12. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development 51(5), 593–604 (2007)CrossRefGoogle Scholar
  17. 17.
    Duran, A., Pérez, J.M., Ayguadé, E., Badia, R.M., Labarta, J.: Extending the OpenMP Tasking Model to Allow Dependent Tasks. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 111–122. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  18. 18.
    Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. In: Workshop on General Processing Using GPUs (2006)Google Scholar
  19. 19.
    IBM Corporation: XL C/C++ for Multicore Acceleration (January 2009),
  20. 20.
    O’Brien, K., O’Brien, K., Sura, Z., Chen, T., Zhang, T.: Supporting OpenMP on Cell. International Journal of Parallel Programming (2008)Google Scholar
  21. 21.
    Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Sura, Z., Chen, T., Zhang, T., O’Brien, K., O’Brien, K.: A Novel Asynchronous Software Cache Implementation for the CELL/BE Processor. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 125–140. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  22. 22.
    Group, T.P.: PGI Fortran & C Accelerator Programming Model (December 2008),
  23. 23.
    Rafique, M.M., Butt, A.R., Nikolopoulos, D.S.: Dma-based prefetching for i/o-intensive workloads on the cell architecture. In: CF 2008: Proceedings of the 2008 conference on Computing frontiers, pp. 23–32. ACM, New York (2008)CrossRefGoogle Scholar
  24. 24.
    Chen, T., Zhang, T., Sura, Z., Gonzalez, M.: Prefetching irregular references for software cache on cell. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008)CrossRefGoogle Scholar
  25. 25.
    Ahmed, M.F., Ammar, R.A., Rajasekaran, S.: SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine. In: IFMT 2008: Proceedings of the 1st international forum on Next-generation multicore/manycore technologies, pp. 1–10. ACM, New York (2008)CrossRefGoogle Scholar
  26. 26.
    Beltran, V., Carrera, D., Torres, J., Ayguadé, E.: CellMT: A Cooperative Multithreading Library for the Cell/B.E. In: HiPC 2009: Proceedings of the 16th Annual IEEE International Conference on High Performance Computing. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  27. 27.
    Weltzer, J., Silha, E., May, C., Frey, B., Furukawa, J., Frazier, G.: PowerPC Architecture Book V. 2.02. IBM Corporation (2005)Google Scholar
  28. 28.
    McCalpin, J.D.: STREAM: Sustainable Memory Bandwidth in High Performance Computers (2008),
  29. 29.
    Corder, S., Sheumaker, K.: STREAM Benchmarking: Intel Xeon 5500 Nehalem vs AMD Opteron 2400 Istanbul (2009),
  30. 30.
    Corporation, I.: Intel Xeon Processor 5000 Sequence (2009),
  31. 31.
    Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Labarta, J.: Runtime Address Space Computation for SDSM Systems. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 330–344. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  32. 32.
    Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K.: Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Roger Ferrer
    • 1
  • Vicenç Beltran
    • 1
  • Marc Gonzàlez
    • 1
    • 2
  • Xavier Martorell
    • 1
    • 2
  • Eduard Ayguadé
    • 1
    • 2
  1. 1.Barcelona Supercomputing Center
  2. 2.Departament d’Arquitectura de ComputadorsUniv. Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations