Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

Yang, Chen; Liu, LeiBo; Yin, ShouYi; Wei, ShaoJun

doi:10.1007/s11433-014-5610-2

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

Article
Published: 21 October 2014

Volume 57, pages 2214–2227, (2014)
Cite this article

Science China Physics, Mechanics & Astronomy Aims and scope Submit manuscript

Chen Yang¹,
LeiBo Liu¹,
ShouYi Yin¹ &
…
ShaoJun Wei¹

3 Citations
Explore all metrics

Abstract

The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Row-based configuration mechanism for a 2-D processing element array in coarse-grained reconfigurable architecture

Article 06 September 2014

Reducing Storage Costs of Reconfiguration Contexts by Sharing Instruction Memory Cache Blocks

Exploiting Partial Reconfiguration on a Dynamic Coarse Grained Reconfigurable Architecture

References

Compton K, Hauck S. Reconfigurable computing: A survey of systems and software. ACM Comput Surveys, 2002, 2: 171–210
Article Google Scholar
Bigdeli A, Biglari-Abhari M, Leung S H S, et al. Multimedia extensions for a reconfigurable processor. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, China, 20–22 Oct. 2004. 426–429
Singh H, Lee M H, Lu G M, et al. Morphosys: An integrated reconfigurable system for data-parallel and compute intensive applications. IEEE Trans Comput, 2000, 49: 456–481
Article Google Scholar
Mei B F, Vernalde S, Verkest D, et al. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Lisbon, Portugal, 2003. 61–70
Google Scholar
Ganesan M K A, Singh S, May F, et al. H. 264 decoder at HD resolution on a coarse grain dynamically reconfigurable architecture. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Amsterdam, Netherlands, 2007. 467–471
Google Scholar
Liu L B, Deng C C, Wang D, et al. An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications. In: Proceedings of IEEE Custom Integrated Circuits Conference, San Jose, California, USA, 2013. 1–4
Google Scholar
PACT XPP technology. White paper of reconfiguration on XPP-III processors. July, 2006
Google Scholar
Suzuki M, Hasegawa Y, Tuan V M, et al. A cost-effective context memory structure for dynamically reconfigurable processors. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Rhodes Island, Greece, 2006
Google Scholar
Jafri S M A H, Hemani A, Paul K, et al. Compression based efficient and agile configuration mechanism for coarse grained reconfigurable architectures. In: Proceedings of IEEE International Parallel & Distributed Processing Symposium, Anchorage, Alaska, USA, 2011. 290–293
Google Scholar
Kim Y, Mahapatra R N. Dynamic context compression for low-power coarse-grained reconfigurable architecture. IEEE Trans Very Large Scale Integration Syst, 2010, 18: 15–28
Article Google Scholar
Tunbunheng V, Suzuki M, Amano H. RoMultiC: Fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices. In: Proceedings of IEEE International Field-Programmable Technology Conference, Singapore, 2005. 129–136
Google Scholar
Veredas F J, Scheppler M, Moffat W, et al. Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In: Proceedings of IEEE International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 106–111
Google Scholar
Wang Y S, Liu L B, Zhu M, et al. Hierarchical representation of on-chip context to reduce reconfiguration time and implementation area for coarse-grained reconfigurable architecture. Sci China-Inf Sci, 2013, 56: 1–20
Google Scholar
Dimitroulakos G, Galanis M, Goutis C. Alleviating the data memory bandwidth bottleneck in coarse-grained reconfigurable arrays. In: Proceedings of 16th IEEE International Conference on Application-Specific Systems, Architecture Processors, Samos, Greece, 2005. 161–168
Google Scholar
Kim Y, Lee J, Shrivastava A, et al. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Trans Computer-Aided Design Integrated Circuits Syst, 2011, 30: 1599–1609
Article Google Scholar
Wang Y S, Liu L B, Yin S Y, et al. On-chip memory hierarchy in one coarse-grained reconfigurable architecture to compress memory space and to reduce reconfiguration time and data-reference time. IEEE Trans Very Large Scale Integration Syst, in press
Joint Video Team of ITU-T and ISO/IEC JTC 1. Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264 ISO/IEC 14496-10 AVC). Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, ed, 2003
Google Scholar
Chen Y B, Li Z D, Guo L, et al. Architecture design of low-power motion estimation based on DHS-NPDS for H.264/AVC. Sci China-Inf Sci, 2012, 55: 2234–2242
Article MathSciNet Google Scholar
Wang Z M, Yao Z B, Guo H X, et al. Bitstream decoding and SEU-induced failure analysis in SRAM-based FPGAs. Sci China-Inf Sci, 2012, 55: 971–982
Article Google Scholar
Baik H, Sihn K H, Kim Y, et al. Analysis and parallelization of H.264 decoder on cell broadband engine architecture. In: Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Cairo, Egypt, 2007. 791–795
Google Scholar
Lowe D. Distinctive image features from scale-invariant key points. Int J Comput Vision, 2004, 60: 91–110
Article Google Scholar
Zheng Z, Zhu Y X, Wang X, et al. Revealing feasibility of FMM on ASIC: efficient implementation of N-Body problem on FPGA. In: Proceedings of IEEE International Conference on Computational Science and Engineering, Hong Kong, China, 11–13, Dec. 2010. 132–139
Google Scholar
Joch A, Kossentini F, Schwarz H, et al. Performance comparison of video coding standards using Lagrangian coder control. In: Proceedings of International Conference on Image Processing, Rochester, New York, USA, 22–25, Sep. 2002. II-501–II-504 vol.2
Google Scholar
Mei B F, Veredas F J, Masschelein B, Mappingan H. 264/AVC decoder onto the ADRES reconfigurable architecture. In: Proceedings of International Conference on Field Programmable Logic and Applications, Tampere, Finland, 2005. 622–625
Google Scholar
Zhang W L, Liu L B, Yin S Y, et al. An efficient VLSI architecture of speeded-up robust feature extraction for high resolution and high frame rate video. Sci China-Inf Sci, 2013, 56: 1–14
MathSciNet Google Scholar
Zhao G H, Shen F F, Wang Z Y, et al. A high quality image reconstruction method based on nonconvex decoding. Sci China-Inf Sci, 2013, 56: 1–10
Google Scholar
Huang J, Huang T Z, Zhao X L, et al. Image restoration with shifting reflective boundary conditions. Sci China-Inf Sci, 2013, 56: 1–15
MathSciNet Google Scholar
Nian Y J, Wan J W, Tang Y, et al. Near lossless compression of hyperspectral images based on distributed source coding. Sci China-Inf Sci, 2012, 55: 2646–2655
Article MATH MathSciNet Google Scholar
Ma C M, Chen H, Yu J Y, et al. A novel conflict-free parallel memory access scheme for FFT constant geometry architectures. Sci China-Inf Sci, 2013, 56: 1–9
Google Scholar
Liu L B, Jia W, Yin S Y, et al. ReSSIM: a mixed-level simulator for dynamic coarse-grained reconfigurable processor. Sci China-Inf Sci, 2013, 56: 1–16
Google Scholar
Chen T S, Chen Y J, Guo Q, et al. Statistical performance comparisons of computers. In: Proceedings of IEEE 18th International Symposium on High Performance Computer Architecture, New Orleans, USA, 25–29, Feb. 2012. 1–12
Chapter Google Scholar
Zhou X, Li E Q, Chen Y, Implementation of H. 264 decoder on general-purpose processors with media instructions. In: Proceedings of SPIE Conference on Image and Video Communications and Processing, Santa Clara, USA, 2003. 224–235
Google Scholar
Intel company. Specification of the Intel Pentium 4 Processor. 2012
Google Scholar
Chuang T D, Tsung P K, Lin P C, et al. A 59.5mW scalable/multi-view video decoder chip for quad/3D full HDTV and video streaming applications. In: Proceedings of IEEE International Solid-State Circuits Conference, San Francisco, USA, 2010. 262–263
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Microelectronics, Tsinghua University, Beijing, 100084, China
Chen Yang, LeiBo Liu, ShouYi Yin & ShaoJun Wei

Authors

Chen Yang
View author publications
You can also search for this author in PubMed Google Scholar
LeiBo Liu
View author publications
You can also search for this author in PubMed Google Scholar
ShouYi Yin
View author publications
You can also search for this author in PubMed Google Scholar
ShaoJun Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to LeiBo Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, C., Liu, L., Yin, S. et al. Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays. Sci. China Phys. Mech. Astron. 57, 2214–2227 (2014). https://doi.org/10.1007/s11433-014-5610-2

Download citation

Received: 24 September 2014
Accepted: 28 September 2014
Published: 21 October 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s11433-014-5610-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

Abstract

Access this article

Similar content being viewed by others

Row-based configuration mechanism for a 2-D processing element array in coarse-grained reconfigurable architecture

Reducing Storage Costs of Reconfiguration Contexts by Sharing Instruction Memory Cache Blocks

Exploiting Partial Reconfiguration on a Dynamic Coarse Grained Reconfigurable Architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

Abstract

Access this article

Similar content being viewed by others

Row-based configuration mechanism for a 2-D processing element array in coarse-grained reconfigurable architecture

Reducing Storage Costs of Reconfiguration Contexts by Sharing Instruction Memory Cache Blocks

Exploiting Partial Reconfiguration on a Dynamic Coarse Grained Reconfigurable Architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation