Abstract
The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.
Similar content being viewed by others
References
Compton K, Hauck S. Reconfigurable computing: A survey of systems and software. ACM Comput Surveys, 2002, 2: 171–210
Bigdeli A, Biglari-Abhari M, Leung S H S, et al. Multimedia extensions for a reconfigurable processor. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, China, 20–22 Oct. 2004. 426–429
Singh H, Lee M H, Lu G M, et al. Morphosys: An integrated reconfigurable system for data-parallel and compute intensive applications. IEEE Trans Comput, 2000, 49: 456–481
Mei B F, Vernalde S, Verkest D, et al. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Lisbon, Portugal, 2003. 61–70
Ganesan M K A, Singh S, May F, et al. H. 264 decoder at HD resolution on a coarse grain dynamically reconfigurable architecture. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Amsterdam, Netherlands, 2007. 467–471
Liu L B, Deng C C, Wang D, et al. An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications. In: Proceedings of IEEE Custom Integrated Circuits Conference, San Jose, California, USA, 2013. 1–4
PACT XPP technology. White paper of reconfiguration on XPP-III processors. July, 2006
Suzuki M, Hasegawa Y, Tuan V M, et al. A cost-effective context memory structure for dynamically reconfigurable processors. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Rhodes Island, Greece, 2006
Jafri S M A H, Hemani A, Paul K, et al. Compression based efficient and agile configuration mechanism for coarse grained reconfigurable architectures. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Anchorage, Alaska, USA, 2011. 290–293
Kim Y, Mahapatra R N. Dynamic context compression for low-power coarse-grained reconfigurable architecture. IEEE Trans Very Large Scale Integration Syst, 2010, 18: 15–28
Tunbunheng V, Suzuki M, Amano H. RoMultiC: Fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices. In: Proceedings of IEEE International Field-Programmable Technology Conference, Singapore, 2005. 129–136
Veredas F J, Scheppler M, Moffat W, et al. Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 106–111
Wang Y S, Liu L B, Zhu M, et al. Hierarchical representation of on-chip context to reduce reconfiguration time and implementation area for coarse-grained reconfigurable architecture. Sci China-Inf Sci, 2013, 56: 1–20
Dimitroulakos G, Galanis M, Goutis C. Alleviating the data memory bandwidth bottleneck in coarse-grained reconfigurable arrays. In: Proceedings of 16th IEEE International Conference on Application-Specific Systems, Architecture Processors, Samos, Greece, 2005. 161–168
Kim Y, Lee J, Shrivastava A, et al. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Trans Computer-Aided Design Integrated Circuits Syst, 2011, 30: 1599–1609
Wang Y S, Liu L B, Yin S Y, et al. On-chip memory hierarchy in one coarse-grained reconfigurable architecture to compress memory space and to reduce reconfiguration time and data-reference time. IEEE Trans Very Large Scale Integration Syst, in press
Joint Video Team of ITU-T and ISO/IEC JTC 1. Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264 ISO/IEC 14496-10 AVC). Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, ed, 2003
Chen Y B, Li Z D, Guo L, et al. Architecture design of low-power motion estimation based on DHS-NPDS for H.264/AVC. Sci China-Inf Sci, 2012, 55: 2234–2242
Wang Z M, Yao Z B, Guo H X, et al. Bitstream decoding and SEU-induced failure analysis in SRAM-based FPGAs. Sci China-Inf Sci, 2012, 55: 971–982
Baik H, Sihn K H, Kim Y, et al. Analysis and parallelization of H.264 decoder on cell broadband engine architecture. In: Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Cairo, Egypt, 2007. 791–795
Lowe D. Distinctive image features from scale-invariant key points. Int J Comput Vision, 2004, 60: 91–110
Zheng Z, Zhu Y X, Wang X, et al. Revealing feasibility of FMM on ASIC: efficient implementation of N-Body problem on FPGA. In: Proceedings of IEEE International Conference on Computational Science and Engineering, Hong Kong, China, 11–13, Dec. 2010. 132–139
Joch A, Kossentini F, Schwarz H, et al. Performance comparison of video coding standards using Lagrangian coder control. In: Proceedings of International Conference on Image Processing, Rochester, New York, USA, 22–25, Sep. 2002. II-501–II-504 vol.2
Mei B F, Veredas F J, Masschelein B, Mappingan H. 264/AVC decoder onto the ADRES reconfigurable architecture. In: Proceedings of International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 622–625
Zhang W L, Liu L B, Yin S Y, et al. An efficient VLSI architecture of speeded-up robust feature extraction for high resolution and high frame rate video. Sci China-Inf Sci, 2013, 56: 1–14
Zhao G H, Shen F F, Wang Z Y, et al. A high quality image reconstruction method based on nonconvex decoding. Sci China-Inf Sci, 2013, 56: 1–10
Huang J, Huang T Z, Zhao X L, et al. Image restoration with shifting reflective boundary conditions. Sci China-Inf Sci, 2013, 56: 1–15
Nian Y J, Wan J W, Tang Y, et al. Near lossless compression of hyperspectral images based on distributed source coding. Sci China-Inf Sci, 2012, 55: 2646–2655
Ma C M, Chen H, Yu J Y, et al. A novel conflict-free parallel memory access scheme for FFT constant geometry architectures. Sci China-Inf Sci, 2013, 56: 1–9
Liu L B, Jia W, Yin S Y, et al. ReSSIM: a mixed-level simulator for dynamic coarse-grained reconfigurable processor. Sci China-Inf Sci, 2013, 56: 1–16
Chen T S, Chen Y J, Guo Q, et al. Statistical performance comparisons of computers. In: Proceedings of IEEE 18th International Symposium on High Performance Computer Architecture, New Orleans, USA, 25–29, Feb. 2012. 1–12
Zhou X, Li E Q, Chen Y, Implementation of H. 264 decoder on general-purpose processors with media instructions. In: Proceedings of SPIE Conference on Image and Video Communications and Processing, Santa Clara, USA, 2003. 224–235
Intel company. Specification of the Intel Pentium 4 Processor. 2012
Chuang T D, Tsung P K, Lin P C, et al. A 59.5mW scalable/multi-view video decoder chip for quad/3D full HDTV and video streaming applications. In: Proceedings of IEEE International Solid-State Circuits Conference, San Francisco, USA, 2010. 262–263
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, C., Liu, L., Yin, S. et al. Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays. Sci. China Phys. Mech. Astron. 57, 2214–2227 (2014). https://doi.org/10.1007/s11433-014-5610-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11433-014-5610-2