Abstract
The forward and inverse DCT has many applications in digital signal processing area, but, due to its high arithmetic complexity, it is necessary to find efficient software implementations or even to find VLSI implementations for them. Existing fast algorithms for IDCT or DCT have a SFG graph that is not very regular and modular and, even more importantly, the topology of the interconnection network is not easy to implement in VLSI due to the so-called retrograde indexing. Due to this problem, there are few pipeline implementations for IDCT and DCT, although pipelining is an efficient engineering solution that allows high speed performance with a reduced hardware complexity and power consumption. In this paper, we present an efficient solution to successfully reformulate the IDCT algorithm with a focus on developing a modular and regular computation structure that can be easily implemented using a pipelined VLSI architecture. Using new restructuring input sequences that can be computed in parallel with the new SFG graph in a pipeline manner, a novel efficient fast algorithm for the computation of inverse discrete cosine transform is presented. The obtained SFG graph has the best structure that can be obtained for IDCT, avoiding the so-called retrograde indexing and being highly regular and modular. Moreover, the obtained SFG graph is scalable, being easy to extend to larger values of N that is a power of 2. It can also be used to obtain a generalization of a radix-2 algorithm for length \(N = p \cdot 2^{m}\), where “p” is a prime number. This algorithm is based on a recursive decomposition of the computation of the inverse DCT that requires a reduced number of arithmetic operations and has a regular and simple computational structure that can be easily implemented in VLSI in a pipeline manner. Its main advantages are its simple, regular and modular computational structure and its high potential to be pipelined so that it can be used to obtain an efficient pipeline VLSI implementation.
Similar content being viewed by others
Data availability
This manuscript has no associated data.
References
M. Ayinala, K.K. Parhi, FFT architectures for real-valued signals based on radix-23 and radix-24 algorithms. IEEE Trans. Circuits Syst. Regul. Pap. 60(9), 2422–2430 (2013)
F.M. Bayer, R.J. Cintra, A. Madanayake, U.S. Potluri, Multiplierless approximate 4-point DCT VLSI architectures for transform block coding. Electron. Lett. 49(24), 1532–1534 (2013)
F.M. Bayer, R.J. Cintra, DCT-like transform for image compression requires 14 additions only. Electron. Lett. 48(15), 919–921 (2012)
S. Boussakta, O. Alshibami, Fast algorithm for the 3-D DCT-II. IEEE Trans. Signal Process. 52(7), 992–1001 (2004)
Y.H. Chan, W.C. Siu, On the realization of discrete cosine transform using the distributed arithmetic. IEEE Trans. Circuits Syst. 39(9), 705–712 (1992)
Y.-H. Chan, W.-C. Siu, Generalized approach for the realization of discrete cosine transform using cyclic convolutions, in Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing, Minneapolis, USA, 277–280 (1993)
C.-H. Chang, C.-L. Wang, Y.-T. Chang, Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse. IEEE Trans. Signal Process. 48(11), 3206–3216 (2000)
Y.N. Chang, An efficient VLSI architecture for normal I/O order pipeline FFT design. IEEE Trans. Circuits Syst. Exp. Briefs 55(12), 1234–1238 (2008)
Z. Chen, Q. Han, W.-K. Cham, Low-complexity order-64 integer cosine transform design and its application in HEVC. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2407–2412 (2018)
C. Cheng, K.K. Parhi, A novel systolic array structure for DCT. IEEE Trans. Circuits Syst. Express Briefs 52(7), 366–369 (2005)
C. Cheng, K.K. Parhi, High-throughput VLSI architecture for FFT computation. IEEE Trans. Circuits Syst. Express Briefs 54(10), 863–867 (2007)
C. Cheng, K.K. Parhi, Hardware efficient fast DCT based on novel cyclic convolution structures. IEEE Trans. Signal Process. 54(11), 4419–4434 (2006)
D.F. Chiper, A new systolic array algorithm for memory-based VLSI array implementation of DCT. In Proceedings of Second IEEE Symposium on Computers and Communications (ISCC 1997). p. 297–301 (1997)
D.F. Chiper, M.N.S. Swamy, M.O. Ahmad, An efficient unified framework for the VLSI implementation of a prime-length DCT/IDCT with high throughput. IEEE Trans. Signal Process. Regul. Pap. 54(6), 2925–2936 (2007)
D.F. Chiper, P. Ungureanu, Novel VLSI algorithm and architecture with good quantization properties for a high-throughput area efficient systolic array implementation of DCT. EURASIP J. Adv. Signal Process. 1, 1–14 (2011)
D.F. Chiper, A structured dual split-radix algorithm for the discrete Hartley transform of length 2N. Circuits Syst. Signal Process. 37(1), 290–304 (2018)
R.J. Cintra, F.M. Bayer, C.J. Tablada, Low-complexity 8-point DCT approximations based on integer functions. Signal Process. 99, 201–214 (2014)
R.J. Cintra, F.M. Bayer, A DCT approximation for image compression. IEEE Signal Process. Lett. 18(10), 579–582 (2011)
V.A. Coutinho, R.J. Cintra, F.M. Bayer, Low-complexity multidimensional DCT approximations for high-order tensor data Decorrelation. IEEE Trans. Image Process. 26(5), 2296–2310 (2017)
W.H. Fang, M.L. Wu, An efficient unified systolic architecture for the computation of discrete trigonometric transforms. in Proceedings of IEEE Symposium on Circuits and Systems ISCAS 1997), vol. 3, p. 2092–2095 (1997)
W.H. Fang, M.L. Wu, Unified fully-pipelined implementations of one- and two-dimensional real discrete trigonometric transforms. IEICE Trans. Fund. Electron. Commun. Comput. Sci. 82(10), 2219–2230 (1999)
M. Garrido, K.K. Parhi, A pipelined FFT architecture for real-valued signals. IEEE Trans. Circuits Syst. Regul. Pap. 56(12), 2634–2643 (2009)
J.I. Guo, C.M. Liu, C.W. Jen, A new array architecture for prime-length discrete cosine transform. IEEE Trans. Signal Process. 41, 1 (1993)
H. Hsiao, L. Chen, T.-D. Chiueh, C.-T. Chen, High throughput CORDIC-based systolic array design for the discrete cosine transform. IEEE Trans. Circuits Syst. Video Technol. 5(3), 218–255 (1995)
S.F. Hsiao, W.R. Shiue, J.M. Tseng, A cost efficient and fully-pipelinable architecture for DCT/IDCT. IEEE Trans. Consum. Electron. 45(3), 515–525 (1999)
S.F. Hsiao, W.R. Shiue, J.M. Tseng, Design and implementation of a novel linear array DCT/IDCT processor with complexity of order. IEEE Proc. Vis. Images Signal Process. 147(10), 400–408 (2000)
S.F. Hsiao, W.R. Shiue, J.M. Tseng, Design and implementation of a novel linear-array DCT/IDCT processor with complexity of order log2N. IEE Proc. Vis. Image Signal Process. 147, 5 (2000)
H.S. Hou, A fast recursive algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process. 35(10), 1445–1461 (1987)
Y.M. Huang, J.L. Wu, C.L. Chang, A generalized output pruning algorithm for matrix-vector multiplication and its application to compute pruning discrete cosine transform. IEEE Trans. Signal Process. 48(2), 561–563 (2000)
M. Jridi, A. Alfalou, P.K. Meher, A generalized algorithm and reconfigurable architecture for efficient and scalable orthogonal approximation of DCT. IEEE Trans. Circuits Syst. Regul. Pap. 62(2), 449–457 (2015)
M. Jridi, P.K. Meher, A scalable approximate DCT architectures for efficient HEVC compliant video coding. IEEE Trans. Circuits Syst. Video Technol. 27(8), 1815–1825 (2017)
D.W. Kim, et al., A compatible DCT/IDCT architecture using hardwired distributed arithmetic. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS’2001), vol. II, p. 457–460 ( 2001)
C.W. Kok, Fast algorithm for computing discrete cosine transform. IEEE Trans. Signal Process. 45(3), 757–760 (1997)
S.Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, 1988).
B.G. Lee, A new algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process. 32(12), 1243–1245 (1984)
A. Madisetti, A.N. Wilson, A 100 Mhz 2-D 8x8 DCT/IDCT processor for HDTV applications. IEEE Trans. Circuits Syst. Video Technol. 5(2), 158–165 (1995)
V. Madisetti, W. Douglas, Digital Signal Processing Handbook on CD (CRC Press, Boca Raton, 1999).
E.P. Mariatos, D.E. Metafas, J.A. Hallas, C.E. Goutis, A fast DCT processor based on special purpose CORDIC rotators. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS 1994), vol.4, p. 271–274 (1994)
M. Masera, G. Masera, M. Martina, An area-efficient variable-size fixed-point DCT architecture for HEVC encoding. IEEE Trans. Circuits Syst. Video Technol. 30(1), 232–242 (2020)
P.K. Meher, Systolic designs for DCT using low-complexity concurrent convolutional formulation. IEEE Trans. Circuits Syst. Video Technol. 16(9), 1041–1050 (2006)
J. Nikara, J. Takola, D. Akopian, J. Saarinen, Pipeline architecture for DCT/IDCT. in Proceedings of IEEE International Symposium on Circuits and Systems, (ISCAS 2001), vol. 4, p. 902–905 (2001)
U.S. Potluri, A. Madanayake, R.J. Cintra, F.M. Bayer, S. Kulasekera, A. Edirisuriya, Improved 8-point approximate DCT for image and video compression requiring only 14 additions. IEEE Trans. Circuits Syst. 61(6), 1727–1740 (2014)
M.T. Sun, T.C. Chen, A.M. Gottlieb, VLSI implementation of a 16_16 discrete cosine transform. IEEE Trans. Circuits Syst. 36(4), 610–617 (1989)
C.J. Tablada, F.M. Bayer, R.J. Cintra, A class of DCT approximations based on the Feig-Winograd algorithm. Signal Process. 113(8), 38–51 (2015)
E.H. Wold, A.M. Derspain, Pipeline and parallel-pipeline FFT processors for VLSI implementations. IEEE Trans. Comput. 33(5), 414–426 (1984)
T. Xanthopoulos, A.P. Chandrakasan, A low-power IDCT macrocell for MPEG2MP@ML exploring data distribution properties for minimal activity. IEEE J. Solid-State Circuits 34(5), 693–703 (1999)
J. Xie, P.K. Meher, J. He, Hardware-efficient realization of prime-length DCT based on distributed arithmetic. IEEE Trans. Comput. 62(6), 1170–1178 (2013)
S. Yu, E.E. Swartzlander, DCT implementation with distributed arithmetic. IEEE Trans. Comput. 50(9), 985–991 (2001)
H.D. Yun, S.U. Lee, On the fixed-point-error analysis of several fast DCT algorithms. IEEE Trans. Circuits Syst. Video Technol. 3(2), 27–41 (1993)
F. Zhou, P. Komerup, High speed DCT/IDCT using a pipelined CORDIC algorithm. in Proceedings of 12th Symposium on Computer Architecture, p. 180–187, (1995).
Acknowledgement
This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS – UEFISCDI, project number PN-III-P4-ID-PCE2020-0713, within PNCDI III.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chiper, D.F. A Structured Fast Algorithm for the VLSI Pipeline Implementation of Inverse Discrete Cosine Transform. Circuits Syst Signal Process 40, 5351–5366 (2021). https://doi.org/10.1007/s00034-021-01718-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01718-5