Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications

  • Rosilde Corvino
  • Abdoulaye Gamatié
  • Pierre Boulet
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6271)


Due to the complexity of modern data parallel applications such as image processing applications, automatic approach to infer suitable and efficient hardware realizations are more and more required. Typically, the optimization of data transfer and storage micro-architecture has a key role for the data parallelism. In this paper, we propose a comprehensive method to explore the mapping of a high-level representation of an application into a customizable hardware accelerator. The high-level representation is in a language called Array-OL. The customizable architecture uses FIFO queues and double buffering mechanism to mask the latency of data transfers and external memory access. The mapping of a high-level representation onto the given architecture is performed by applying a set of loop transformations in Array-OL. A method based on integer partition is used to reduce the space of explored solutions.


design space exploration data parallel applications image processing Array-OL hardware architecture data management 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Catthoor, F., et al.: Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, Dordrecht (2002)CrossRefMATHGoogle Scholar
  2. 2.
    Balasa, F., Kjeldsberg, P., Vandecappelle, A., Palkovic, M., Hu, Q., Zhu, H., Catthoor, F.: Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications. Journal of Signal Processing Systems 53(1), 51–71 (2008)CrossRefGoogle Scholar
  3. 3.
    Hiser, J.D., Davidson, J.W., Whalley, D.B.: Fast, Accurate Design Space Exploration of Embedded Systems Memory Configurations. In: SAC 2007: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 699–706. ACM, New York (2007)CrossRefGoogle Scholar
  4. 4.
    Hu, Q., Kjeldsberg, P.G., Vandecappelle, A., Palkovic, M., Catthoor, F.: Incremental hierarchical memory size estimation for steering of loop transformations. ACM Transactions on Design Automation of Electronic Systems 12(4), 50 (2007)CrossRefGoogle Scholar
  5. 5.
    Chen, Y., Byna, S., Sun, X.-H., Thakur, R., Gropp, W.: Hiding I/O latency with pre-execution prefetching for parallel applications. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–10 (2008)Google Scholar
  6. 6.
    Panda, P.R., Catthoor, F., Dutt, N.D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vandercappelle, A., Kjeldsberg, P.G.: Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems 6(2), 149–206 (2001)CrossRefGoogle Scholar
  7. 7.
    Kung, H.T.: Why systolic architectures. Computer 15(1), 37–46 (1982)CrossRefGoogle Scholar
  8. 8.
    Amar, A., Boulet, P., Dumont, P.: Projection of the Array-OL Specification Language onto the Kahn Process Network Computation Model. In: ISPAN 2005: Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms and Networks, pp. 496–503 (2005)Google Scholar
  9. 9.
    Kim, D., Managuli, R., Kim, Y.: Data cache and direct memory access in programming mediaprocessors. IEEE Micro 21(4), 33–42 (2001)CrossRefGoogle Scholar
  10. 10.
    Ascia, G., Catania, V., Di Nuovo, A.G., Palesi, M., Patti, D.: Efficient design space exploration for application specific systems-on-a-chip. Journal of Systems Architecture 53(10), 733–750 (2007)CrossRefGoogle Scholar
  11. 11.
    Glitia, C., Dumont, P., Boulet, P.: Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing. In: Multidimensional Systems and Signal Processing. Springer, Netherlands (2010)Google Scholar
  12. 12.
    de Lavarene, B.C., Alleysson, D., Durette, B., Herault, J.: Efficient demosaicing through recursive filtering. In: IEEE International Conference on Image Processing (ICIP 2007), vol. 2 (October 2007)Google Scholar
  13. 13.
    Hérault, J., Durette, B.: Modeling visual perception for image processing. In: Sandoval, F., Prieto, A.G., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 662–675. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Glitia, C., Boulet, P.: High level loop transformations for systematic signal processing embedded applications. In: Bereković, M., Dimopoulos, N., Wong, S. (eds.) SAMOS 2008. LNCS, vol. 5114, pp. 187–196. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Maximizing loop parallelism and improving data locality via loop fusion and distribution, pp. 301–320. Springer, Heidelberg (2006)Google Scholar
  16. 16.
    Hannig, F., Dutta, H., Teich, J.: Parallelization approaches for hardware accelerators – loop unrolling versus loop partitioning. In: Architecture of Computing Systems – ARCS 2009, pp. 16–27 (2009)Google Scholar
  17. 17.
    Xue, J.: Loop tiling for parallelism. Kluwer Academic Publishers, Dordrecht (2000)CrossRefMATHGoogle Scholar
  18. 18.
    Panda, P.R., Nakamura, H., Dutt, N.D., Nicolau, A.: Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers 48, 142–149 (1999)CrossRefGoogle Scholar
  19. 19.
    Rosilde, C.: Design Space Exploration for data-dominated image applications with non-affine array references. PhD thesis (2009)Google Scholar
  20. 20.
    Liu, L., Nagaraj, P., Upadhyaya, S., Sridhar, R.: Defect analysis and defect tolerant design of multi-port srams. J. Electron. Test. 24(1-3), 165–179 (2008)CrossRefGoogle Scholar
  21. 21.
    Imondi, G.C., Zenzo, M., Fazio, M.A.: Pipelined Burst Memory Access, US patent (August 2008)Google Scholar
  22. 22.
    Schreiber, R., Aditya, S., Mahlke, S., Kathail, V., Rau, B., Cronquist, D., Sivaraman, M.: Pico-npa: High-level synthesis of nonprogrammable hardware accelerators. The Journal of VLSI Signal Processing 31(2), 127–142 (2002)CrossRefMATHGoogle Scholar
  23. 23.
    Ahmed, N., Mateev, N., Pingali, K.: Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming 29(5), 493–544 (2001)CrossRefMATHGoogle Scholar
  24. 24.
    Rahwan, T., Ramchurn, S., Jennings, N., Giovannucci, A.: An anytime algorithm for optimal coalition structure generation. Journal of Artificial Intelligence Research (JAIR) 34, 521–567 (2009)MathSciNetMATHGoogle Scholar
  25. 25.
    Gamatié, A., Le Beux, S., Piel, É., Atitallah, R.B., Etien, A., Marquet, P., Dekeyser, J.-L.: A model driven design framework for massively parallel embedded systems. In: ACM Transactions on Embedded Computing Systems (TECS) ©. ACM, New York (to appear 2010), Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Rosilde Corvino
    • 1
  • Abdoulaye Gamatié
    • 1
  • Pierre Boulet
    • 1
  1. 1.LIFL - UMR CNRS/USTL 8022 Inria LilleNord Europe Parc Scientique de la Haute BorneVilleneuve d’AscqFrance

Personalised recommendations