## Abstract

Convolution of data with a long-tap filter is often implemented by overlap save algorithm (OSA) using fast Fourier transform (FFT). But there are some redundant computations in the traditional OSA because the FFT is applied to the overlapped data (concatenation of previous block and the current block) while the DFT computations are recursive. In this paper, we first analyze the redundancy by decomposing the OSA into two processes related to the previous and current block. Then we eliminate the redundant computations by introducing a new transform which is applied only to the current data, not to the overall overlapped data. Hence the size of transform is reduced by half compared to the traditional OSA. The new transform is in the form of DFT and it can be implemented by defining a new butterfly structure. However we implement it by a cascade of twiddle factor and conventional FFT in this paper, in order to use the FFT libraries in PC and DSP. The computational complexity in this case is analyzed and compared with the existing methods. In the experiment, the proposed method is applied to several block convolutions and partitioned-block convolutions. The CPU time is reduced more than expected from the arithmetic analysis, which implies that the reduced transform size gives additional advantage in data manipulation.

### Similar content being viewed by others

## References

Oppenhiem, A. V., & Schafer, R. W. (1989).

*Discrete-time signal processing*. Englewood Cliffs: Prentice-Hall.Agarwal, R. C., & Burrus, C. S. (1978). Number theoretic transforms to implement fast digital convolution.

*Proceedings of IEEE, 63*(4), 550–560.Mou, Z.-J., & Duhamel, P. (1991). Short-length FIR filters and their use in fast nonrecursive filtering.

*IEEE Transactions on Signal Processing, 39*(6), 1322–1332.Duhamel, P. (1986). Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data.

*IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-34*(2), 285–295.Johnson, S. G., & Frigo, M. (2007). A modified split-radix fft with fewer arithmetic operations.

*IEEE Transactions on Signal Processing, 55*(1), 111–119.Vetterli, M. (1988). Running FIR and IIR filtering using multirate filter banks.

*Transactions on Acoustics, Speech, and Signal Processing, 36*(5), 730–738.Gardner, W. G. (1995). Efficient convolution without input–output delay.

*Journal of Audio Engineering Society, 43*(3), 127–136.Torger, A., & Farina, A. (2001). Real-time partitioned convolution for ambiophonics surround sound. In

*IEEE workshop on applications of signal processing to audio and acoustics*(pp. 21–24).Shynk, J. J. (1992). Frequency-domain and multirate adaptive filtering.

*IEEE Signal Processing Magazine*,*9*, 14–37.Farina, A., Glasgal, R., Armelloni, E., & Torger, A. (2001). Ambiophonic principles for the recording and reproduction of surround sound for music. In

*19th AES conference*(pp. 21–24).Matusiak, R. (1997).

*Implementing fast Fourier transform algorithms of real-valued sequences with the TMS320 DSP family*. Application Report of Texas Instruments.Prati, G. (1978). A discrete adaptive equalizer based on the overlap save filtering technique. In

*Canadian communications and power conference*(pp. 141–144).Narasimha, M. J. (2006). Modified overlap add and overlap save convolution algorithms for real signals.

*IEEE Signal Processing Letters, 13*(11), 669–671.Kuk, J. G., Kim, S. Y., & Cho, N. I. (2009). An overlap save algorithm for block convolution with reduced complexity. In

*IEEE international conference on acoustics, speech and signal processing*(pp. 605–608).Intel Performance Libraries. Intel integrated performance primitives website. http://software.intel.com/en-us/intel-ipp/.

## Acknowledgements

This research was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Knowledge Economy (MKE).

## Author information

### Authors and Affiliations

### Corresponding author

## Appendices

### Appendix 1: Properties of QDFT

### 1.1 Reversibility of QDFT

### Proof

Substituting \(X_k^q\) in inverse QDFT in Eq. 17 with forward QDFT, we have

and several algebraic steps give

The summation \(\sum_{k=0}^{N-1} W_N^{(m-n)k}\) in Eq. 23 is zero for all values of *m* except for the case when *m* − *n* = *pN*, which results in *N*. It can therefore be replaced by an infinite sum of Kronecker delta functions with respect to *p* and Eq. 23 is reduced to

In Eq. 24, the summation has a non-zero value when *p* = 0 because the *x*
_{
n
} is defined as 0 outside [0, *N* − 1] and the non-zero value is *x*
_{
n
}. Hence the reversibility of QDFT is proved. □

### 1.2 Convolution Property of QDFT

**Property** Multiplication of two sequences in QDFT domain \(\mathbf{X}_k^q \mathbf{G}_k^q\) corresponds to \(\sum_{k=0}^n x_k g_{n-k} +j\sum_{k=n+1}^{N-1} x_k g_{n-k+N}\) in time domain.

### Proof

The *N*-point QDFT based block convolution can be written as \(\frac{1}{N} \sum_{k=0}^{N-1}X_k^q G_k^q W_N^{-n(k+\frac{3}{4})}\) and several algebraic steps give

As in Eq. 25, the summation with the index *k* is zero for all values of *m* except for the case when *l* + *m* − *n* = *pN*(*p* ∈ ℤ), which results in *N*. It can therefore be replaced by an infinite sum of Kronecker delta functions with respect to *p*. We may also extend the limits of *m* to infinity, with the understanding that the **x** and **g** sequences are defined as 0 outside [0, *N* − 1]. Continuing with the derivation, we have

where *g*
_{n − l + pN} has non-zero values only if *p* = 0 or *p* = 1. Hence this can be rewritten as

and the convolution property of QDFT is proved. □

### Appendix 2: Direct radix-2 and radix-4 Implementation of QDFT

QDFT can be implemented in a similar way to the conventional FFT. To generalize the discussion, we consider the *N*-point transform \(T_s^N\) as

where *s* means the amount of shift of frequency index. We can have DFT by *s* = 0, ODFT by *s* = 1/2 and QDFT by *s* = 3/4 from Eq. 27. Based on the definition of \(T_s^N\), applying Cooley–Tukey decomposition to \(T_s^N\) yields two shorter transforms of size *N*/2 and thus radix-2 structure as

where 0 ≤ *k* < *N*/2. That is, the *N*-point transform \(T_s^N\) is decomposed into two *N*/2-point transforms \(T_{s/2}^{N/2}\) and \(T_{(s+1)/2}^{N/2}\). The butterfly structure corresponding to Eq. 28 is shown in Fig. 2 where directed line means that the data is multiplied by − 1.

\(T_s^N\) can also be implemented by radix-4 structure where \(T_s^N\) is decomposed into four shorter transforms of size *N*/4 : \(T_{s/4}^{N/4}\),\(T_{s/4+1}^{N/4}\),\(T_{s/4+2}^{N/4}\) and \(T_{s/4+3}^{N/4}\) as in Eq. 29.

The butterfly structure of Eq. 29 is shown in Fig. 3 where directed line and dotted line mean that the data is multiplied by − 1 and *j*, respectively. The directed and dotted line, of course, means − *j*.

## Rights and permissions

## About this article

### Cite this article

Kuk, J.G., Kim, S. & Cho, N.I. A New Overlap Save Algorithm for Fast Block Convolution and Its Implementation Using FFT.
*J Sign Process Syst* **63**, 143–152 (2011). https://doi.org/10.1007/s11265-010-0466-9

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11265-010-0466-9