Skip to main content
Log in

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

  • Article
  • Published:
Science China Physics, Mechanics & Astronomy Aims and scope Submit manuscript

Abstract

The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Compton K, Hauck S. Reconfigurable computing: A survey of systems and software. ACM Comput Surveys, 2002, 2: 171–210

    Article  Google Scholar 

  2. Bigdeli A, Biglari-Abhari M, Leung S H S, et al. Multimedia extensions for a reconfigurable processor. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, China, 20–22 Oct. 2004. 426–429

  3. Singh H, Lee M H, Lu G M, et al. Morphosys: An integrated reconfigurable system for data-parallel and compute intensive applications. IEEE Trans Comput, 2000, 49: 456–481

    Article  Google Scholar 

  4. Mei B F, Vernalde S, Verkest D, et al. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Lisbon, Portugal, 2003. 61–70

    Google Scholar 

  5. Ganesan M K A, Singh S, May F, et al. H. 264 decoder at HD resolution on a coarse grain dynamically reconfigurable architecture. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Amsterdam, Netherlands, 2007. 467–471

    Google Scholar 

  6. Liu L B, Deng C C, Wang D, et al. An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications. In: Proceedings of IEEE Custom Integrated Circuits Conference, San Jose, California, USA, 2013. 1–4

    Google Scholar 

  7. PACT XPP technology. White paper of reconfiguration on XPP-III processors. July, 2006

    Google Scholar 

  8. Suzuki M, Hasegawa Y, Tuan V M, et al. A cost-effective context memory structure for dynamically reconfigurable processors. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Rhodes Island, Greece, 2006

    Google Scholar 

  9. Jafri S M A H, Hemani A, Paul K, et al. Compression based efficient and agile configuration mechanism for coarse grained reconfigurable architectures. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Anchorage, Alaska, USA, 2011. 290–293

    Google Scholar 

  10. Kim Y, Mahapatra R N. Dynamic context compression for low-power coarse-grained reconfigurable architecture. IEEE Trans Very Large Scale Integration Syst, 2010, 18: 15–28

    Article  Google Scholar 

  11. Tunbunheng V, Suzuki M, Amano H. RoMultiC: Fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices. In: Proceedings of IEEE International Field-Programmable Technology Conference, Singapore, 2005. 129–136

    Google Scholar 

  12. Veredas F J, Scheppler M, Moffat W, et al. Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 106–111

    Google Scholar 

  13. Wang Y S, Liu L B, Zhu M, et al. Hierarchical representation of on-chip context to reduce reconfiguration time and implementation area for coarse-grained reconfigurable architecture. Sci China-Inf Sci, 2013, 56: 1–20

    Google Scholar 

  14. Dimitroulakos G, Galanis M, Goutis C. Alleviating the data memory bandwidth bottleneck in coarse-grained reconfigurable arrays. In: Proceedings of 16th IEEE International Conference on Application-Specific Systems, Architecture Processors, Samos, Greece, 2005. 161–168

    Google Scholar 

  15. Kim Y, Lee J, Shrivastava A, et al. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Trans Computer-Aided Design Integrated Circuits Syst, 2011, 30: 1599–1609

    Article  Google Scholar 

  16. Wang Y S, Liu L B, Yin S Y, et al. On-chip memory hierarchy in one coarse-grained reconfigurable architecture to compress memory space and to reduce reconfiguration time and data-reference time. IEEE Trans Very Large Scale Integration Syst, in press

  17. Joint Video Team of ITU-T and ISO/IEC JTC 1. Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264 ISO/IEC 14496-10 AVC). Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, ed, 2003

    Google Scholar 

  18. Chen Y B, Li Z D, Guo L, et al. Architecture design of low-power motion estimation based on DHS-NPDS for H.264/AVC. Sci China-Inf Sci, 2012, 55: 2234–2242

    Article  MathSciNet  Google Scholar 

  19. Wang Z M, Yao Z B, Guo H X, et al. Bitstream decoding and SEU-induced failure analysis in SRAM-based FPGAs. Sci China-Inf Sci, 2012, 55: 971–982

    Article  Google Scholar 

  20. Baik H, Sihn K H, Kim Y, et al. Analysis and parallelization of H.264 decoder on cell broadband engine architecture. In: Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Cairo, Egypt, 2007. 791–795

    Google Scholar 

  21. Lowe D. Distinctive image features from scale-invariant key points. Int J Comput Vision, 2004, 60: 91–110

    Article  Google Scholar 

  22. Zheng Z, Zhu Y X, Wang X, et al. Revealing feasibility of FMM on ASIC: efficient implementation of N-Body problem on FPGA. In: Proceedings of IEEE International Conference on Computational Science and Engineering, Hong Kong, China, 11–13, Dec. 2010. 132–139

    Google Scholar 

  23. Joch A, Kossentini F, Schwarz H, et al. Performance comparison of video coding standards using Lagrangian coder control. In: Proceedings of International Conference on Image Processing, Rochester, New York, USA, 22–25, Sep. 2002. II-501–II-504 vol.2

    Google Scholar 

  24. Mei B F, Veredas F J, Masschelein B, Mappingan H. 264/AVC decoder onto the ADRES reconfigurable architecture. In: Proceedings of International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 622–625

    Google Scholar 

  25. Zhang W L, Liu L B, Yin S Y, et al. An efficient VLSI architecture of speeded-up robust feature extraction for high resolution and high frame rate video. Sci China-Inf Sci, 2013, 56: 1–14

    MathSciNet  Google Scholar 

  26. Zhao G H, Shen F F, Wang Z Y, et al. A high quality image reconstruction method based on nonconvex decoding. Sci China-Inf Sci, 2013, 56: 1–10

    Google Scholar 

  27. Huang J, Huang T Z, Zhao X L, et al. Image restoration with shifting reflective boundary conditions. Sci China-Inf Sci, 2013, 56: 1–15

    MathSciNet  Google Scholar 

  28. Nian Y J, Wan J W, Tang Y, et al. Near lossless compression of hyperspectral images based on distributed source coding. Sci China-Inf Sci, 2012, 55: 2646–2655

    Article  MATH  MathSciNet  Google Scholar 

  29. Ma C M, Chen H, Yu J Y, et al. A novel conflict-free parallel memory access scheme for FFT constant geometry architectures. Sci China-Inf Sci, 2013, 56: 1–9

    Google Scholar 

  30. Liu L B, Jia W, Yin S Y, et al. ReSSIM: a mixed-level simulator for dynamic coarse-grained reconfigurable processor. Sci China-Inf Sci, 2013, 56: 1–16

    Google Scholar 

  31. Chen T S, Chen Y J, Guo Q, et al. Statistical performance comparisons of computers. In: Proceedings of IEEE 18th International Symposium on High Performance Computer Architecture, New Orleans, USA, 25–29, Feb. 2012. 1–12

    Chapter  Google Scholar 

  32. Zhou X, Li E Q, Chen Y, Implementation of H. 264 decoder on general-purpose processors with media instructions. In: Proceedings of SPIE Conference on Image and Video Communications and Processing, Santa Clara, USA, 2003. 224–235

    Google Scholar 

  33. Intel company. Specification of the Intel Pentium 4 Processor. 2012

    Google Scholar 

  34. Chuang T D, Tsung P K, Lin P C, et al. A 59.5mW scalable/multi-view video decoder chip for quad/3D full HDTV and video streaming applications. In: Proceedings of IEEE International Solid-State Circuits Conference, San Francisco, USA, 2010. 262–263

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to LeiBo Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, C., Liu, L., Yin, S. et al. Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays. Sci. China Phys. Mech. Astron. 57, 2214–2227 (2014). https://doi.org/10.1007/s11433-014-5610-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11433-014-5610-2

Keywords

Navigation