A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform

Chiper, Doru Florin

doi:10.1007/s00034-021-01718-5

A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform

Published: 09 April 2021

Volume 40, pages 5351–5366, (2021)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Doru Florin Chiper ORCID: orcid.org/0000-0002-3322-4663¹

186 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The forward and inverse DCT has many applications in digital signal processing area, but, due to its high arithmetic complexity, it is necessary to find efficient software implementations or even to find VLSI implementations for them. Existing fast algorithms for IDCT or DCT have a SFG graph that is not very regular and modular and, even more importantly, the topology of the interconnection network is not easy to implement in VLSI due to the so-called retrograde indexing. Due to this problem, there are few pipeline implementations for IDCT and DCT, although pipelining is an efficient engineering solution that allows high speed performance with a reduced hardware complexity and power consumption. In this paper, we present an efficient solution to successfully reformulate the IDCT algorithm with a focus on developing a modular and regular computation structure that can be easily implemented using a pipelined VLSI architecture. Using new restructuring input sequences that can be computed in parallel with the new SFG graph in a pipeline manner, a novel efficient fast algorithm for the computation of inverse discrete cosine transform is presented. The obtained SFG graph has the best structure that can be obtained for IDCT, avoiding the so-called retrograde indexing and being highly regular and modular. Moreover, the obtained SFG graph is scalable, being easy to extend to larger values of N that is a power of 2. It can also be used to obtain a generalization of a radix-2 algorithm for length \(N = p \cdot 2^{m}\), where “p” is a prime number. This algorithm is based on a recursive decomposition of the computation of the inverse DCT that requires a reduced number of arithmetic operations and has a regular and simple computational structure that can be easily implemented in VLSI in a pipeline manner. Its main advantages are its simple, regular and modular computational structure and its high potential to be pipelined so that it can be used to obtain an efficient pipeline VLSI implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recursive Convolutions of Unit Rectangle Function and Some Applications

Article 16 August 2022

A review of convolutional neural network architectures and their optimizations

Article 22 June 2022

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Article 01 February 2023

Data availability

This manuscript has no associated data.

References

M. Ayinala, K.K. Parhi, FFT architectures for real-valued signals based on radix-23 and radix-24 algorithms. IEEE Trans. Circuits Syst. Regul. Pap. 60(9), 2422–2430 (2013)
Article MathSciNet MATH Google Scholar
F.M. Bayer, R.J. Cintra, A. Madanayake, U.S. Potluri, Multiplierless approximate 4-point DCT VLSI architectures for transform block coding. Electron. Lett. 49(24), 1532–1534 (2013)
Article Google Scholar
F.M. Bayer, R.J. Cintra, DCT-like transform for image compression requires 14 additions only. Electron. Lett. 48(15), 919–921 (2012)
Article Google Scholar
S. Boussakta, O. Alshibami, Fast algorithm for the 3-D DCT-II. IEEE Trans. Signal Process. 52(7), 992–1001 (2004)
Article MathSciNet MATH Google Scholar
Y.H. Chan, W.C. Siu, On the realization of discrete cosine transform using the distributed arithmetic. IEEE Trans. Circuits Syst. 39(9), 705–712 (1992)
Article MATH Google Scholar
Y.-H. Chan, W.-C. Siu, Generalized approach for the realization of discrete cosine transform using cyclic convolutions, in Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing, Minneapolis, USA, 277–280 (1993)
C.-H. Chang, C.-L. Wang, Y.-T. Chang, Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse. IEEE Trans. Signal Process. 48(11), 3206–3216 (2000)
Article MATH Google Scholar
Y.N. Chang, An efficient VLSI architecture for normal I/O order pipeline FFT design. IEEE Trans. Circuits Syst. Exp. Briefs 55(12), 1234–1238 (2008)
Article Google Scholar
Z. Chen, Q. Han, W.-K. Cham, Low-complexity order-64 integer cosine transform design and its application in HEVC. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2407–2412 (2018)
Article Google Scholar
C. Cheng, K.K. Parhi, A novel systolic array structure for DCT. IEEE Trans. Circuits Syst. Express Briefs 52(7), 366–369 (2005)
Article Google Scholar
C. Cheng, K.K. Parhi, High-throughput VLSI architecture for FFT computation. IEEE Trans. Circuits Syst. Express Briefs 54(10), 863–867 (2007)
Article Google Scholar
C. Cheng, K.K. Parhi, Hardware efficient fast DCT based on novel cyclic convolution structures. IEEE Trans. Signal Process. 54(11), 4419–4434 (2006)
Article MATH Google Scholar
D.F. Chiper, A new systolic array algorithm for memory-based VLSI array implementation of DCT. In Proceedings of Second IEEE Symposium on Computers and Communications (ISCC 1997). p. 297–301 (1997)
D.F. Chiper, M.N.S. Swamy, M.O. Ahmad, An efficient unified framework for the VLSI implementation of a prime-length DCT/IDCT with high throughput. IEEE Trans. Signal Process. Regul. Pap. 54(6), 2925–2936 (2007)
Article MATH Google Scholar
D.F. Chiper, P. Ungureanu, Novel VLSI algorithm and architecture with good quantization properties for a high-throughput area efficient systolic array implementation of DCT. EURASIP J. Adv. Signal Process. 1, 1–14 (2011)
Google Scholar
D.F. Chiper, A structured dual split-radix algorithm for the discrete Hartley transform of length 2^N. Circuits Syst. Signal Process. 37(1), 290–304 (2018)
Article MathSciNet MATH Google Scholar
R.J. Cintra, F.M. Bayer, C.J. Tablada, Low-complexity 8-point DCT approximations based on integer functions. Signal Process. 99, 201–214 (2014)
Article Google Scholar
R.J. Cintra, F.M. Bayer, A DCT approximation for image compression. IEEE Signal Process. Lett. 18(10), 579–582 (2011)
Article Google Scholar
V.A. Coutinho, R.J. Cintra, F.M. Bayer, Low-complexity multidimensional DCT approximations for high-order tensor data Decorrelation. IEEE Trans. Image Process. 26(5), 2296–2310 (2017)
Article MathSciNet MATH Google Scholar
W.H. Fang, M.L. Wu, An efficient unified systolic architecture for the computation of discrete trigonometric transforms. in Proceedings of IEEE Symposium on Circuits and Systems ISCAS 1997), vol. 3, p. 2092–2095 (1997)
W.H. Fang, M.L. Wu, Unified fully-pipelined implementations of one- and two-dimensional real discrete trigonometric transforms. IEICE Trans. Fund. Electron. Commun. Comput. Sci. 82(10), 2219–2230 (1999)
Google Scholar
M. Garrido, K.K. Parhi, A pipelined FFT architecture for real-valued signals. IEEE Trans. Circuits Syst. Regul. Pap. 56(12), 2634–2643 (2009)
Article MathSciNet MATH Google Scholar
J.I. Guo, C.M. Liu, C.W. Jen, A new array architecture for prime-length discrete cosine transform. IEEE Trans. Signal Process. 41, 1 (1993)
MATH Google Scholar
H. Hsiao, L. Chen, T.-D. Chiueh, C.-T. Chen, High throughput CORDIC-based systolic array design for the discrete cosine transform. IEEE Trans. Circuits Syst. Video Technol. 5(3), 218–255 (1995)
Article Google Scholar
S.F. Hsiao, W.R. Shiue, J.M. Tseng, A cost efficient and fully-pipelinable architecture for DCT/IDCT. IEEE Trans. Consum. Electron. 45(3), 515–525 (1999)
Article Google Scholar
S.F. Hsiao, W.R. Shiue, J.M. Tseng, Design and implementation of a novel linear array DCT/IDCT processor with complexity of order. IEEE Proc. Vis. Images Signal Process. 147(10), 400–408 (2000)
Article Google Scholar
S.F. Hsiao, W.R. Shiue, J.M. Tseng, Design and implementation of a novel linear-array DCT/IDCT processor with complexity of order log2N. IEE Proc. Vis. Image Signal Process. 147, 5 (2000)
Article Google Scholar
H.S. Hou, A fast recursive algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process. 35(10), 1445–1461 (1987)
Google Scholar
Y.M. Huang, J.L. Wu, C.L. Chang, A generalized output pruning algorithm for matrix-vector multiplication and its application to compute pruning discrete cosine transform. IEEE Trans. Signal Process. 48(2), 561–563 (2000)
Article MathSciNet MATH Google Scholar
M. Jridi, A. Alfalou, P.K. Meher, A generalized algorithm and reconfigurable architecture for efficient and scalable orthogonal approximation of DCT. IEEE Trans. Circuits Syst. Regul. Pap. 62(2), 449–457 (2015)
Article Google Scholar
M. Jridi, P.K. Meher, A scalable approximate DCT architectures for efficient HEVC compliant video coding. IEEE Trans. Circuits Syst. Video Technol. 27(8), 1815–1825 (2017)
Article Google Scholar
D.W. Kim, et al., A compatible DCT/IDCT architecture using hardwired distributed arithmetic. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS’2001), vol. II, p. 457–460 ( 2001)
C.W. Kok, Fast algorithm for computing discrete cosine transform. IEEE Trans. Signal Process. 45(3), 757–760 (1997)
Article Google Scholar
S.Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, 1988).
Google Scholar
B.G. Lee, A new algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process. 32(12), 1243–1245 (1984)
MATH Google Scholar
A. Madisetti, A.N. Wilson, A 100 Mhz 2-D 8x8 DCT/IDCT processor for HDTV applications. IEEE Trans. Circuits Syst. Video Technol. 5(2), 158–165 (1995)
Article Google Scholar
V. Madisetti, W. Douglas, Digital Signal Processing Handbook on CD (CRC Press, Boca Raton, 1999).
Google Scholar
E.P. Mariatos, D.E. Metafas, J.A. Hallas, C.E. Goutis, A fast DCT processor based on special purpose CORDIC rotators. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS 1994), vol.4, p. 271–274 (1994)
M. Masera, G. Masera, M. Martina, An area-efficient variable-size fixed-point DCT architecture for HEVC encoding. IEEE Trans. Circuits Syst. Video Technol. 30(1), 232–242 (2020)
Article Google Scholar
P.K. Meher, Systolic designs for DCT using low-complexity concurrent convolutional formulation. IEEE Trans. Circuits Syst. Video Technol. 16(9), 1041–1050 (2006)
Article Google Scholar
J. Nikara, J. Takola, D. Akopian, J. Saarinen, Pipeline architecture for DCT/IDCT. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS 2001), vol. 4, p. 902–905 (2001)
U.S. Potluri, A. Madanayake, R.J. Cintra, F.M. Bayer, S. Kulasekera, A. Edirisuriya, Improved 8-point approximate DCT for image and video compression requiring only 14 additions. IEEE Trans. Circuits Syst. 61(6), 1727–1740 (2014)
Article Google Scholar
M.T. Sun, T.C. Chen, A.M. Gottlieb, VLSI implementation of a 16_16 discrete cosine transform. IEEE Trans. Circuits Syst. 36(4), 610–617 (1989)
Article Google Scholar
C.J. Tablada, F.M. Bayer, R.J. Cintra, A class of DCT approximations based on the Feig-Winograd algorithm. Signal Process. 113(8), 38–51 (2015)
Article Google Scholar
E.H. Wold, A.M. Derspain, Pipeline and parallel-pipeline FFT processors for VLSI implementations. IEEE Trans. Comput. 33(5), 414–426 (1984)
Article MATH Google Scholar
T. Xanthopoulos, A.P. Chandrakasan, A low-power IDCT macrocell for MPEG2MP@ML exploring data distribution properties for minimal activity. IEEE J. Solid-State Circuits 34(5), 693–703 (1999)
Article Google Scholar
J. Xie, P.K. Meher, J. He, Hardware-efficient realization of prime-length DCT based on distributed arithmetic. IEEE Trans. Comput. 62(6), 1170–1178 (2013)
Article MathSciNet MATH Google Scholar
S. Yu, E.E. Swartzlander, DCT implementation with distributed arithmetic. IEEE Trans. Comput. 50(9), 985–991 (2001)
Article Google Scholar
H.D. Yun, S.U. Lee, On the fixed-point-error analysis of several fast DCT algorithms. IEEE Trans. Circuits Syst. Video Technol. 3(2), 27–41 (1993)
Google Scholar
F. Zhou, P. Komerup, High speed DCT/IDCT using a pipelined CORDIC algorithm. in Proceedings of 12th Symposium on Computer Architecture, p. 180–187, (1995).

Download references

Acknowledgement

This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS – UEFISCDI, project number PN-III-P4-ID-PCE2020-0713, within PNCDI III.

Author information

Authors and Affiliations

Applied Electronics Department, Technical University “Gheorghe Asachi” Iasi, Iasi, Romania
Doru Florin Chiper

Authors

Doru Florin Chiper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Doru Florin Chiper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chiper, D.F. A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform. Circuits Syst Signal Process 40, 5351–5366 (2021). https://doi.org/10.1007/s00034-021-01718-5

Download citation

Received: 27 May 2020
Revised: 21 March 2021
Accepted: 24 March 2021
Published: 09 April 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00034-021-01718-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform

Abstract

Access this article

Similar content being viewed by others

Recursive Convolutions of Unit Rectangle Function and Some Applications

A review of convolutional neural network architectures and their optimizations

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Data availability

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform

Abstract

Access this article

Similar content being viewed by others

Recursive Convolutions of Unit Rectangle Function and Some Applications

A review of convolutional neural network architectures and their optimizations

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Data availability

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation