Energy Efficient Stencil Computations on the Low-Power Manycore MPPA-256 Processor

  • Emmanuel PodestáJr.
  • Bruno Marques do Nascimento
  • Márcio CastroEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)


A new class of highly-parallel low-power manycore chips that cope with energy constraints have been unveiled. Sunway’s SW26010 and Kalray’s MPPA-256 are examples of them, featuring more than two hundred cores in a single low-power chip. Although they may present better energy efficiency than general-purpose multicore processors, architectural characteristics such as their limited amount of distributed on-chip memory make the development of efficient scientific parallel applications a challenging task. In this paper we propose and evaluate a new back-end of PSkel, a framework that provides a single high-level abstraction for stencil programming on CPUs and GPUs, for the low-power manycore MPPA-256 processor. This relieves programmers of the burden of explicitly dealing with communications and the hybrid underlying programming model of MPPA-256. Our results showed that the energy consumption of stencil applications running on MPPA-256 is up to 7.34x and 4.71x lower than on an Intel Xeon E5 multicore and NVIDIA Tesla K40 GPU, respectively.


MPPA-256 Manycore PSkel Energy efficiency 


  1. 1.
    Buono, D., Danelutto, M., Lametti, S., Torquati, M.: Parallel patterns for general purpose many-core. In: 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 131–139 (2013).
  2. 2.
    Castro, M., Francesquini, E., Dupros, F., Aochi, H., Navaux, P.O., Méhaut, J.F.: Seismic wave propagation simulations on low-power and performance-centric manycores. Parallel Comput. 54, 108–120 (2016). Scholar
  3. 3.
    Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)CrossRefGoogle Scholar
  4. 4.
    Demmel, J.W.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997)CrossRefGoogle Scholar
  5. 5.
    Francesquini, E., et al.: On the energy efficiency and performance of irregular applications on multicore, NUMA and manycore platforms. J. Parallel Distrib. Comput. 76, 32–48 (2014). Scholar
  6. 6.
    Fu, H., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Inf. Sci. 59(7), 1–16 (2016). Scholar
  7. 7.
    Gysi, T., Grosser, T., Hoefler, T.: MODESTO: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: International Conference on Supercomputing (ICS), pp. 177–186. ACM, Irvine (2015)Google Scholar
  8. 8.
    Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: International Conference on Supercomputing (ICS), pp. 311–320. ACM, Venice (2012)Google Scholar
  9. 9.
    Lutz, T., Fensch, C., Cole, M.: PARTANS: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim. 9(4), 59:1–59:24 (2013)CrossRefGoogle Scholar
  10. 10.
    Morari, A., Tumeo, A., Villa, O., Secchi, S., Valero, M.: Efficient sorting on the Tilera manycore architecture. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 171–178. IEEE Computer Society, New York (2012)Google Scholar
  11. 11.
    Pereira, A.D., Ramos, L., Góes, L.F.W.: PSkel: a stencil programming framework for CPU-GPU systems. Concurr. Comput.: Pract. Exp. 27(17), 4938–4953 (2015)CrossRefGoogle Scholar
  12. 12.
    Pereira, A.D., Rocha, R.C.O., Castro, M., Goes, L.F.W., Dantas, M.A.R.: Extending OpenACC for efficient stencil code generation and execution by skeleton frameworks. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 719–726. IEEE Computer Society, Genoa (2017).
  13. 13.
    Rocha, R.C.O., Pereira, A.D., Ramos, L., Ges, L.F.W.: TOAST: automatic tiling for iterative stencil computations on GPUs. Concurr. Comput.: Pract. Exp. 29(8), 1–13 (2017). Scholar
  14. 14.
    Souza, M.A., et al.: CAP bench: a benchmark suite for performance and energy evaluation of low-power many-core processors. Concurr. Comput.: Pract. Exp. 29, e3892 (2016). Scholar
  15. 15.
    Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL - a portable skeleton library for high-level GPU programming. In: IEEE International Symposium on Parallel and Distributed Processing Workshops (IPDPSW), pp. 1176–1182. IEEE Computer Society, Shanghai (2011)Google Scholar
  16. 16.
    Thorarensen, S., Cuello, R., Kessler, C., Li, L., Barry, B.: Efficient execution of SkePU skeleton programs on the low-power multicore processor Myriad2. In: Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 398–402 (2016).
  17. 17.
    Totoni, E., Behzad, B., Ghike, S., Torrellas, J.: Comparing the power and performance of intel’s SCC to state-of-the-art CPUs and GPUs. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 78–87. IEEE Computer Society, New Brunswick (2012).
  18. 18.
    Varghese, A., Edwards, B., Mitra, G., Rendell, A.P.: Programming the adapteva epiphany 64-core network-on-chip coprocessor. In: International Parallel Distributed Processing Symposium Workshops (IPDPSW), pp. 984–992. IEEE Computer Society, Phoenix (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Graduate Program in Computer Science (PPGCC)Federal University of Santa Catarina (UFSC)FlorianópolisBrazil

Personalised recommendations