Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

  • Lee W. Howes
  • Anton Lokhmotov
  • Alastair F. Donaldson
  • Paul H. J. Kelly
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5409)


On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hofstee, H.P.: Power efficient processor architecture and the Cell processor. In: Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA), pp. 258–262. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  2. 2.
    ClearSpeed Technology: The CSX architecture,
  3. 3.
    Smith, J.E.: Decoupled access/execute computer architectures. ACM Trans. Comput. Syst. 2(4), 289–308 (1984)CrossRefGoogle Scholar
  4. 4.
    Watson, I., Rawsthorne, A.: Decoupled pre-fetching for distributed shared memory. In: Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS), Washington, DC, USA, pp. 252–261. IEEE Computer Society, Los Alamitos (1995)Google Scholar
  5. 5.
    Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC) (2008)Google Scholar
  6. 6.
    Topham, N., Rawsthorne, A., McLean, C., Mewissen, M., Bird, P.: Compiling and optimizing for decoupled architectures. In: Proceedings of Supercomputing (SC), p. 40 (1995)Google Scholar
  7. 7.
    Lau, D.L., Gonzalez, J.G.: The closest-to-mean filter: an edge preserving smoother for Gaussian environments. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2593–2596. IEEE Press, Los Alamitos (1997)Google Scholar
  8. 8.
    Warren, H.S.: Hacker’s Delight. Addison-Wesley, Boston (2002)Google Scholar
  9. 9.
    Carter, L., Gatlin, K.S.: Towards an optimal bit-reversal permutation program. In: Proceedings of Foundations of Computer Science (FOCS), pp. 544–555 (1998)Google Scholar
  10. 10.
    Wright, C.: IBM software development kit for multicore acceleration. Roadrunner tutorial LA-UR-08-2819 (2008),
  11. 11.
    Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (PLDI), pp. 167–178. ACM, New York (2007)CrossRefGoogle Scholar
  12. 12.
    Saltz, J.H., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Trans. Comput. (5), 603–612 (1991)CrossRefGoogle Scholar
  13. 13.
    Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of Supercomputing (SC), pp. 83–92 (2006)Google Scholar
  14. 14.
    Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a programming model for the Cell BE architecture. In: Proceedings of Supercomputing (SC), pp. 86–96 (2006)Google Scholar
  15. 15.
    Lokhmotov, A., Mycroft, A., Richards, A.: Delayed side-effects ease multi-core programming. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 641–650. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  16. 16.
    Codeplay Software: Portable high-performance compilers,
  17. 17.
    Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2002)Google Scholar
  18. 18.
    Griebl, M.: Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, Habilitation Thesis (2004)Google Scholar
  19. 19.
    Gaster, B.R.: Streams: Emerging from a shared memory model. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 134–145. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Howes, L.W., Lokhmotov, A., Kelly, P.H., Field, A.J.: Optimising component composition using indexed dependence metadata. In: Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Lee W. Howes
    • 1
  • Anton Lokhmotov
    • 1
  • Alastair F. Donaldson
    • 2
  • Paul H. J. Kelly
    • 1
  1. 1.Department of ComputingImperial College LondonLondonUK
  2. 2.Codeplay SoftwareEdinburghUK

Personalised recommendations