Journal of Signal Processing Systems

, Volume 59, Issue 3, pp 281–296 | Cite as

Decoupled Processors Architecture for Accelerating Data Intensive Applications using Scratch-Pad Memory Hierarchy

  • Athanasios Milidonis
  • Nikolaos Alachiotis
  • Vasileios Porpodas
  • Harris Michail
  • Georgios Panagiotakopoulos
  • Athanasios P. Kakarountas
  • Costas E. Goutis


We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.


Decoupled Scratch pad 



This work was supported by the project PENED 2003 No 03ΕD507, which is funded in 75% by the European Union- European Social fund and in 25% by the Greek state-Greek Secretariat for Research and Technology.


  1. 1.
    Smith, J. E. (1982). “Decoupled Access/Execute Architectures”, Proceedings of the 9th International Symposium on Computer Architecture, pp. 112–119, May.Google Scholar
  2. 2.
    Talla, D., John, L. K. (2001). “MediaBreeze: A Decoupled Architecture for Accelerating Multimedia Applications” ACM Computer Architecture News, ACM Press, ISSN 0163-5964, pp. 62–67, vol. 29. no. 5, December.Google Scholar
  3. 3.
    Thies, W., Karczmarek, M., Amarasinghe, S. (2002). “StreamIt: A language for streaming applications,” in Int’l Conference on Compiler Construction, Apr.Google Scholar
  4. 4.
    Buck, I. (2003). “Brook Specification v0.2,”, October.Google Scholar
  5. 5.
    Gupta, S., Miranda, M., Catthoor, F., Gupta, R. (2000). “Analysis of high-level address code transformations for programmable processors,” Procedings ACM Conference on Design and Test in Europe 2000, Paris, France, pp. 9–13, March.Google Scholar
  6. 6.
    Miranda, M., Catthoor, F., Janssen, M., & De Man, H. (1998). High-level Address Optimisation and Synthesis Techniques for Data-transfer Intensive Applications. IEEE Transactions on VLSI Systems, 6(4), 677–686.CrossRefGoogle Scholar
  7. 7.
    Panda, P. R., Catthoor, F. et al. (2001). Data and memory optimizations for embedded systems. ACM TODAES, April.Google Scholar
  8. 8.
    Kandemir, M. T., & Choudhary, A. (2002). Compiler-directed scratch pad memory hierarchy design and management. New Orleans, USA: DAC.Google Scholar
  9. 9.
    Francesco, P., Marchal, P., Atienza, D., Benini, L., Catthoor, F., Mendias, J. (2004). “An integrated Hardware/Software Approach For Run-Time scratch-pad Management”, Proceedings of the 41st annual conference on Design automation, June 07–11, San Diego, CA, USA.Google Scholar
  10. 10.
    Kandemir, M., et al. (2004). A Compiler Based Approach for Dynamically Managing Scratch-pad Memories in Embedded Systems. IEEE Transactions on Computer-Aided Design, 23(2), 243–260.CrossRefGoogle Scholar
  11. 11.
    Issenin, I., Brockmeyer, E., Miranda, M., Dutt, N. (2004). Data reuse analysis technique for software-controlled memory hierarchies. In proceedings of the Conference on Design Automation and Test in Europe (DATE ), pp. 202–207.Google Scholar
  12. 12.
    Dasygenis, M., Brockmeyer, E., Durinck, B., Catthoor, F., Soudris, D., & Thanailakis, A. (2006). A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(3), 279–291.CrossRefGoogle Scholar
  13. 13.
    Kurian, L., Hulina, T., Coraor, L. D. (1994). “Memory Latency Effects in Decoupled Architectures”, IEEE Transactions on Computers, 43(10), October.Google Scholar
  14. 14.
    Jones, G. P., Topham, N. P. (1997). “A Comparison of Data Prefetching on an Access Decoupled and Superscalar Machine” Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 1997, North Carolina, US.Google Scholar
  15. 15.
    Mathew, B., Davis, A. (2004). “A Loop Accelerator for Low Power Embedded VLIW Processors”, Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, September 08–10, Stockholm, Sweden.Google Scholar
  16. 16.
    Mowry, T. C., Lam, M. S., Gupta, A. (1991). “Design and Evaluation of a Compiler Algorithm for Prefetching”, Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, October.Google Scholar
  17. 17.
    Rich, K. D., Farrens, M. K. (2000). “Code Partitioning in Decoupled Compilers” European Conference on Parallel Processing (Euro–Par), pp.1008–1017.Google Scholar
  18. 18.
    Kurdah, F. J., Parker, A. C. (1999). “REAL: a program for register allocation”, Proc. EuroPar Conf., Toulouse, France, pp.668–676, Sep.Google Scholar
  19. 19.
    Burger, D., Austin, T. M. (1997). “The simplescalar toolset, Version 2.0,” Comp. Sciences Dept, UW, Tech. Rep., June.Google Scholar
  20. 20.
    Zhang, Y., Parikh, D., Sankaranarayanan, K., Skadron, K., & Stan, M. (2003). HotLeakage: A temperature-Aware Model of Subthreshold and Gate Leakage for Architects. Charlottesville: University of Virginia.Google Scholar
  21. 21.
    Reinman, G., Jouppi, N. (1999). “An integrated cache timing and power model”, Technical report, Compaq Western Research Lab.Google Scholar
  22. 22.
    Lee, C., Potkonjak, M., Mangione-Smith, W. H. (1997). “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems” International Symposium on Microarchitecture.Google Scholar
  23. 23.
    Stobach, P. (1998). “A new technique in scene adaptive coding”, European Signal processing Conference (EUSIPCO).Google Scholar
  24. 24.
    Francesco, P., Marchal, P., Atienza, D., Benini, L., Catthoor, F., Mendias, J. M. (2004). “An integrated hardware/software approach for run-time scratchpad management”, Proceedings of the 41st annual conference on Design automation, 238–243.Google Scholar
  25. 25.
    Absar, J., Catthoor, F. (2006). “Analysis of scratch-pad and data-cache performance using statistical methods, Proceedings of the 2006 conference on Asia South Pacific design automation”, 820–825.Google Scholar
  26. 26.
    Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., Marwedel, P. “Comparison of Cache and Scratch-Pad based Memory Systems with respect to Performance, Area and Energy Consumption”, Technical Report 762, University of Dortmun.Google Scholar
  27. 27.
    Absar, J., and Catthoor, F. (2005). “Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access”. In proceedings of the Conference on Design Automation and Test in Europe (DATE), 1162–1167Google Scholar
  28. 28.
    Kudriavtsev, A., and Kogge, P. SMT possibilities for decoupled architecture, Technical Committee on Computer Architecture (TCCA) Newsletter: Papers from MEmory access DEcoupling for superscalar and multiple issue Architectures (MEDEA-2000)Google Scholar
  29. 29.
    Van Achteren, T., Lauwereins, R., Catthoor, F. (2000) “Systematic Data Reuse Exploration Methodology for Irregular Access Patterns”13th International Symposium on System Synthesis (ISSS), Madrid, Spain, Proceedings. IEEE Computer Society, pp.115–122, SeptemberGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Athanasios Milidonis
    • 1
  • Nikolaos Alachiotis
    • 1
  • Vasileios Porpodas
    • 1
  • Harris Michail
    • 1
  • Georgios Panagiotakopoulos
    • 1
  • Athanasios P. Kakarountas
    • 1
  • Costas E. Goutis
    • 1
  1. 1.VLSI Design Lab., Electrical & Computer Engineering DepartmentUniversity of PatrasPatrasGreece

Personalised recommendations