Skip to main content
Log in

Waveform based speech coding using nonlinear predictive techniques: a systematic review

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech coding is a technique that compresses speech signals into a smaller digital form, making it easier to transmit or store, while still maintaining the quality and intelligibility of the speech. The review aimed to identify and analyses the most effective waveform-based nonlinear speech coding prediction techniques, including the use of neural networks and polynomial filters. The study analyzed 29 publications from 2000 to 2023 and found that neural network-based models are widely used for speech compression, with RNN topologies being favored due to their ability to introduce nonlinearity and nonstationary. While nonlinear adaptive speech prediction techniques have been explored for speech coding, further research is needed to optimize the adaptive algorithms used in these models. The review also identified a need for future research to address quality performance and computational cost, and suggested further exploration of RNN predictor models. The methodology used in this study involved a computer science approach that follows three main phases: planning, conducting, and reporting. Six different stages were followed, including determining research questions, defining research approach, study selection criteria, quality measurement criteria, data extraction strategy, and synthesizing extracted data. Overall, this study highlights the need for continued research in the development and improvement of neural network-based speech compression models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The findings of this systematic literature review are based on previously published studies. All necessary references, including authors, titles, publication years, and sources, have been provided in the reference section. No primary data were generated during this study. Readers can access the data by referring to the original publications cited in this manuscript. Contact information for corresponding authors can be found within the respective publications. No additional datasets or supplementary materials were used. The methods, including study selection and data synthesis, are described in the Methods section. The search strategy employed and databases used are also provided. We acknowledge the authors of the included studies for their contributions to the existing literature.

Abbreviations

ADC:

Analog to digital converter

ADPCM:

Adaptive differential pulse code modulation

APCM:

Algebraic pulse code modulation

ATC:

Adaptive transform coding

BPTT:

Backpropagation through time

CELP:

Code excited linear prediction

CNN:

Convolutional neural network

CWT:

Continuous wavelet transform

DCT:

Discrete Cosine transform

DM:

Delta modulation

DPCM:

Differential pulse code modulation

DWT:

Discrete wavelet transform

FFT:

Fast Fourier transform

GRUs:

Gated recurrent units

ITU-T:

International telecommunication union telecommunication

LMS:

Least Mean squares

LPC:

Linear Predictive coding

LSTM:

Long Short-term memory

MDCT:

Modified discrete cosine transform

MELP:

Mixed excitation linear prediction

MLP:

Multilayer perceptron

MOS:

Mean opinion score

MSE:

Mean squared error

IMA:

Interactive Multimedia Association

SNR:

Signal-to-noise ratio

SEGSNR:

Segmental signal-to-noise ratio

PCM:

Pulse-code modulation

POLQA:

Perceptual Objective Listening Quality Assessment

TIMIT:

Texas Instruments Misspoken Telephone Corpus

NN:

Neural network

PCM:

Pulse Code Modulation

RELP:

Residual Excited Linear Prediction

RLS:

Recursive Least Squares

RNN:

Recurrent Neural Network

RTRL:

Real-time recurrent learning

SBC:

Sub-Band Coding

SELP:

Stochastic excitation linear prediction

SLR:

System literature review

VFC:

Variance fractal compression

VoIP:

Voice over internet protocol

References

  • Alipoor, G. H., & Savoji, M. H. (2006). Speech coding using non-linear prediction based on Volterra series expansion. SPECOM

  • Alipoor, G., & Savoji, M. H. (2007). Nonlinear speech coding using backward adaptive variable-length quadratic filters. In ISPA 2007 - Proceeding of the 5th international symposium on image and signal processing and analysis, (pp. 185–189). https://doi.org/10.1109/ISPA.2007.4383687.

  • Alipoor, G., & Savoji, M. H. (2012). Wide-band speech coding using kernel methods and bandwidth extension based on parametric stereo. In 2012 Proceedings of the 20th European signal processing conference (EUSIPCO) (pp. 2767–2771). IEEE

  • Alqushaibi, A., Abdulkadir, S. J., Rais, H. M., & Al-Tashi, Q. (2020). A review of weight optimization techniques in recurrent neural networks. In 2020 international conference on computational intelligence (ICCI) (pp. 196–201). IEEE

  • Ashdown, I. (2006, September). Extended parallel pulse code modulation of LEDs. In Sixth international conference on solid state lighting (Vol. 6337, pp. 169–178). SPIE. https://doi.org/10.1117/12.679674.

  • G. Bellec, Scherr, F., Hajek, E., Salaj, D., Legenstein, R., & Maass, W. (2019). Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets. 1–37. [Online], Available: http://arxiv.org/abs/1901.09049.

  • Berglund, K. (2004). Speech compression and tone detection in a real-time system

  • Besacier, L., Bergamini, C., Vaufreydaz, D., & Castelli, E. (2001, October). The effect of speech and audio compression on speech recognition performance. In 2001 IEEE fourth workshop on multimedia signal processing (Cat. No. 01TH8564) (pp. 301–306). IEEE.

  • Cernak, M., & Asaei, A. (2016). Cognitive speech coding (No. REP_WORK). Idiap

  • Chavan, K., Jawale, P., Pzatil, S., & Mumbai, N. (2016). SPEECH CODING. Vol. 40, no. 40, pp. 117–120.

  • Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2015). On the properties of neural machine translation: Encoder–decoder approaches (pp. 103–111): https://doi.org/10.3115/v1/w14-4012.

  • D'Alessandro, G., Zanuy, M. F., & Piazza, F. (2002, May). A new subband non linear prediction coding algorithm for narrowband speech signal: The nADPCMB⊥ MLT coding scheme. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I-1025). IEEE. https://doi.org/10.1109/icassp.2002.5743969.

  • Despotovic, V., Görtz, N., & Peric, Z. (2012, September). Low-order volterra long-term predictors. In Speech communication; 10. ITG symposium (pp. 1–4). VDE

  • Despotović, V., & Perić, Z. (2013, November). Design of nonlinear predictors for adaptive predictive coding of speech signals. In 2013 21st telecommunications forum Telfor (TELFOR) (pp. 490–497). IEEE. https://doi.org/10.1109/TELFOR.2013.6716274.

  • Despotović, V., Görtz, N., & Perić, Z. (2012). Improved non-linear long-term predictors based on Volterra filters. International Symposium Electronics in Marine, 2, 231–234.

    Google Scholar 

  • Faundez-Zanuy, M. (2015) Nonlinear predictive models computation in ADPCM schemes1. In Eurpean signal processing conference (Vol. 2015, pp. 6–9, 2000).

  • Faúndez-Zanuy, M. (2003). Wide band sub-band speech coding using non-linear prediction. In ICASSP, IEEE international conference on acoustic speech signal processing—Proceedings (Vol. 2, no. 1, pp. 181–184) https://doi.org/10.1109/icassp.2003.1202324.

  • Faundez-Zanuy, M. (2005). Nonlinear speech processing: Overview and possibilities in speech coding. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (Vol. 3445 LNAI, no. 4, pp. 15–42). https://doi.org/10.1007/11520153_2.

  • Faúndez-Zanuy, M. (2001). Nonlinear vectorial prediction with neural nets. In Lecture notes in Computer Science (including Subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics), (Vol. 2085 LNCS, no. PART 2, pp. 754–761) https://doi.org/10.1007/3-540-45723-2_91.

  • Faúndez-Zanuy, M. (2003, June). Non-linear speech coding with MLP, RBF and Elman based prediction1. In International work-conference on artificial neural networks (pp. 671–678). Berlin, Heidelberg. Springer. https://doi.org/10.1007/3-540-44869-1_85.

  • Faundez-Zanuy, M. (2006). Speech coding through adaptive combined nonlinear prediction. Speech Communication, 48(7), 838–847. https://doi.org/10.1016/j.specom.2005.09.007

    Article  Google Scholar 

  • Franeese, M. F. (1998). Marcos Fatindez-Zanuy *, pp. 345–348, 1998.

  • Abou Haidar, G., Achkar, R., & Dourgham, H. (2016, November). A comparative simulation study of the real effect of PCM, DM and DPCM systems on audio and image modulation. In 2016 IEEE international multidisciplinary conference on engineering technology (IMCET) (pp. 144–149). IEEE

  • Haque, M., & Bhattacharyya, K. (2016). A review on speech filtering and its different techniques. Journal of Engineering Technology, 4(1), 196–200.

    Google Scholar 

  • Izumi, T., & Iiguni, Y. (2006). Data compression of nonlinear time series using a hybrid linear/nonlinear predictor. Signal Processing, 86(9), 2439–2446. https://doi.org/10.1016/j.sigpro.2005.11.013

    Article  Google Scholar 

  • Jagtap, S. K., Mulye, M. S., & Uplane, M. D. (2015). Speech coding techniques. Procedia Computer Science, 49(1), 253–263. https://doi.org/10.1016/j.procs.2015.04.251

    Article  Google Scholar 

  • Jayasankar, U., Thirumal, V., & Ponnurangam, D. (2021). A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. Journal of King Saud University-Computer and Information Sciences, 33(2), 119–140. https://doi.org/10.1016/j.jksuci.2018.05.006

    Article  Google Scholar 

  • Kaladharan, N. (2017). A review of different speech coding methods. International Journal of Electricals and Electronics Engineering Telecommunication, 6(2), 96–103.

    Google Scholar 

  • Karpathy, A., Johnson, J., & Fei-Fei, L. (2015). Visualizing and understanding recurrent networks, pp. 1–12. http://arxiv.org/abs/1506.02078.

  • Keles, H. Y., Rozhon, J., Ilk, H. G., & Voznak, M. (2019). DeepVoCoder: A CNN model for compression and coding of narrow band speech. IEEE Access, 7, 75081–75089.

    Article  Google Scholar 

  • Kitchenham, B., & Charters, S. M. (2007). Guidelines for performing systematic literature reviews in software engineering, EBSE Technical Report EBSE-2007-01, Software Engineering Group School of Computer Science and Ma.

  • Kleijn, W. B., Lim, F. S., Luebs, A., Skoglund, J., Stimberg, F., Wang, Q., & Walters, T. C. (2018, April). Wavenet based low rate speech coding. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 676–680). IEEE. https://doi.org/10.1109/ICASSP.2018.8462529.

  • Kofod-Petersen, A. (2012). How to do a structured literature review in computer science. Ver. 0.1. October, 1

  • Laskov, L., Georgieva, V., & Dimitrov, K. (2020). Analysis of pulse code modulation in MATLAB/octave environment. In 2020 55th international science conference on information, communication energy system technology. (ICEST 2020-Proceeding) (pp. 77–80). https://doi.org/10.1109/ICEST49890.2020.9232755

  • Li, Z. N., Drew, M. S., Liu, J., Li, Z. N., Drew, M. S., & Liu, J. (2021). Basic audio compression techniques. Fundamentals of Multimedia, 479–504

  • Ling, Z. H., Ai, Y., Gu, Y., & Dai, L. R. (2018). Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(5), 883–894. https://doi.org/10.1109/TASLP.2018.2798811

    Article  Google Scholar 

  • Lotfidereshgi, R., & Gournay, P. (2018, April). Speech prediction using an adaptive recurrent neural network with application to packet loss concealment. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5394–5398). IEEE.

  • Mansour, C., Achkar, R., & Haidar, G. A. (2012). Simulation of DPCM and ADM systems. In Proceedings—2012 14th international conference modelling and simulation, (UKSim 2012) (no. 4, pp. 416–421). https://doi.org/10.1109/UKSim.2012.64.

  • Mishra, S. (2016). A survey paper on different data compression techniques Saumya Mishra Shraddha Singh.

  • Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks: A systematic review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880

    Article  Google Scholar 

  • S. Nosouhian, Nosouhian, F., & Khoshouei, A. K. (2021). A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU. Preprint, no. July, pp. 1–7, https://doi.org/10.20944/preprints202107.0252.v1.

  • O’Shaughnessy, D. (2023). Review of methods for coding of speech signals. EURASIP Journal of Audio, Speech, Music Processing, 1, 2023. https://doi.org/10.1186/s13636-023-00274-x

    Article  Google Scholar 

  • Bäckström, T. (2017). Speech coding with code-excited linear prediction (pp. 37–41). Springer.

  • Pandey, S., & Banerjee, A. (2022). Optimal non-uniform sampling by branch-and-bound approach for speech coding. IEEE Access, 10, 2797–2812. https://doi.org/10.1109/ACCESS.2021.3138068

    Article  Google Scholar 

  • Pérez-Ortiz, J. A., Calera-Rubio, J., & Forcada, M. L. (2001, September). A comparison between recurrent neural architectures for real-time nonlinear prediction of speech signals. In Neural networks for signal processing XI: Proceedings of the 2001 IEEE signal processing society workshop (IEEE Cat. No. 01TH8584) (pp. 73–81). IEEE. https://doi.org/10.1109/nnsp.2001.943112.

  • Polynomial, A., Volterra, V., & Wiener, N. (1958) 10. Adaptive Volterra Filters.

  • Qu, L., Lyu, J., Li, W., Ma, D., & Fan, H. (2021). Features injected recurrent neural networks for short-term traffic speed prediction. Neurocomputing, 451, 290–304. https://doi.org/10.1016/j.neucom.2021.03.054

    Article  Google Scholar 

  • Raina, S. B., Raina, R., & Agarwal, V. (2014). Wireless speech coding : A systematic review.

  • Ray, M., Chandra, M., & Patil, B. P. (2015). Speech coding techniques for VoIP applications: A technical review. World Applied Sciences Journal. https://doi.org/10.5829/idosi.wasj.2015.33.05.148

    Article  Google Scholar 

  • Riera-Palou, F., Den Brinker, A. C., & Gerrits, A. J. (2004, November). A hybrid parametric-waveform approach to bit stream scalable audio coding. In Conference record of the thirty-eighth asilomar conference on signals, systems and computers, 2004. (Vol. 2, pp. 2250–2254). IEEE. https://doi.org/10.1109/acssc.2004.1399568.

  • Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-t erm memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. https://doi.org/10.1016/j.physd.2019.132306

    Article  MathSciNet  Google Scholar 

  • Somers, H. (1999). An overview of digital. Structure. https://doi.org/10.1016/B978-0-12-373580-5.50038-7

    Article  Google Scholar 

  • Stachurski, J., & McCree, A. (2000, September). Combining parametric and waveform-matching coders for low bit-rate speech coding. In 2000 10th European signal processing conference (pp. 1–4). IEEE.

  • Tanaka, H., & Shimamura, T. (2004, September). Nonlinear predictive analysis of speech by iterative approach. In 2004 12th European signal processing conference (pp. 2055–2058). IEEE

  • Taware, D., & Handore, S. (2014). Speech compression techniques. 2(12), 1–7.

  • Townshend, B. (1991). Nonlinear prediction of speech. In Proceedings of ICASSP, IEEE international conference on acoustics speech and signal processing (Vol. 1, pp. 425–428). https://doi.org/10.1109/icassp.1991.150367

  • USNA. (2021). Lesson 20 : Analog to digital conversion. Ece, no. c, 2021, [Online]. Available: https://www.usna.edu/ECE/ec312/Lessons/wireless/EC312_Lesson_20_Analog_to_Digital_Course_Notes.pdf.

  • Varoglu, E., & Hacioglu, K. (2000). Recurrent neural network speech predictor based on dynamical systems approach. IEE Proceedings-Vision, Image and Signal Processing, 147(2), 149–156.

    Article  Google Scholar 

  • Wang, A., Sun, Z., & Zhang, X. (2002, June). A non-linear prediction speech coding system based on ANN. In Proceedings of the 4th world congress on intelligent control and automation (Cat. No. 02EX527) (Vol. 1, pp. 607–611). IEEE

  • Wang, G. (2006). Stability study of the SB-ADPCM coder. Signal Processing, 86(2), 319–330. https://doi.org/10.1016/j.sigpro.2005.05.011

    Article  Google Scholar 

  • Yan, W., Zhang, J., Zhang, S., & Wen, P. (2018). A novel pipelined neural IIR adaptive filter for speech prediction. Applied Acoustics, 141, 64–70. https://doi.org/10.1016/j.apacoust.2018.06.007

    Article  Google Scholar 

  • Yoshimura, T., Hashimoto, K., Oura, K., Nankaku, Y., & Tokuda, K. (2019, May). Speaker-dependent WaveNet-based delay-free ADPCM speech coding. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7145–7149). IEEE.

  • Zacarias-Morales, N., Pancardo, P., Hernández-Nolasco, J. A., & Garcia-Constantino, M. (2021). Attention-inspired artificial neural networks for speech processing: A systematic review. Symmetry (Basel), 13(2), 1–43. https://doi.org/10.3390/sym13020214

    Article  Google Scholar 

  • Zhang, G. A., Gu, J. Y., Bao, Z. H., Xu, C., & Zhang, S. B. (2014). Joint routing and channel assignment algorithms in cognitive wireless mesh networks. Transactions on Emerging Telecommunications and Technology, 25(3), 294–307. https://doi.org/10.1002/ett

    Article  Google Scholar 

  • Zhao, Z., Liu, H., & Fingscheidt, T. (2018, September). Nonlinear prediction of speech by echo state networks. In 2018 26th European signal processing conference (EUSIPCO) (pp. 2085–2089). IEEE. https://doi.org/10.23919/EUSIPCO.2018.8553190.

  • Zhao, H., & Zhang, J. (2009). Pipelined Chebyshev functional link artificial recurrent neural network for nonlinear adaptive filter. IEEE Transactions on Systems, Man, and Cybernetics, Part B Cybernetics, 40(1), 162–172. https://doi.org/10.1109/TSMCB.2009.2024313

    Article  Google Scholar 

  • Zhen, K., et al. (2022). Scalable and efficient neural speech coding: A hybrid design. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 12–25. https://doi.org/10.1109/TASLP.2021.3129353

    Article  Google Scholar 

  • Zhen, K., Sung, J., Lee, M. S., Beack, S., & Kim, M. (2021). Scalable and efficient neural speech coding: A hybrid design. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 12–25.

    Article  Google Scholar 

Download references

Acknowledgements

I would like to express my gratitude to the German Academic Exchange Service (DAAD) for providing funding for my PhD studies, including support for tuition fees, research expenses, and a stipend for living expenses. I am also thankful to Jomo Kenyatta University of Agriculture and Technology (JKUAT) for hosting me as a PhD student and providing invaluable academic resources and support.

Funding

This research was supported by a scholarship from DAAD, which provided funding for tuition fees, research expenses, and a stipend for the author during my PhD studies at Jomo Kenyatta University of Agriculture and Technology (JKUAT), Kenya.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gebremichael Kibret Sheferaw.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Table 9 and 10.

Table 9 List of publication and source information of selected papers
Table 10 The studies of information extraction the studies of information extraction related to the research questions

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheferaw, G.K., Mwangi, W., Kimwele, M. et al. Waveform based speech coding using nonlinear predictive techniques: a systematic review. Int J Speech Technol 26, 1031–1059 (2023). https://doi.org/10.1007/s10772-023-10072-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10072-7

Keywords

Navigation