Abstract
Speech coding is a technique that compresses speech signals into a smaller digital form, making it easier to transmit or store, while still maintaining the quality and intelligibility of the speech. The review aimed to identify and analyses the most effective waveform-based nonlinear speech coding prediction techniques, including the use of neural networks and polynomial filters. The study analyzed 29 publications from 2000 to 2023 and found that neural network-based models are widely used for speech compression, with RNN topologies being favored due to their ability to introduce nonlinearity and nonstationary. While nonlinear adaptive speech prediction techniques have been explored for speech coding, further research is needed to optimize the adaptive algorithms used in these models. The review also identified a need for future research to address quality performance and computational cost, and suggested further exploration of RNN predictor models. The methodology used in this study involved a computer science approach that follows three main phases: planning, conducting, and reporting. Six different stages were followed, including determining research questions, defining research approach, study selection criteria, quality measurement criteria, data extraction strategy, and synthesizing extracted data. Overall, this study highlights the need for continued research in the development and improvement of neural network-based speech compression models.
Similar content being viewed by others
Data availability
The findings of this systematic literature review are based on previously published studies. All necessary references, including authors, titles, publication years, and sources, have been provided in the reference section. No primary data were generated during this study. Readers can access the data by referring to the original publications cited in this manuscript. Contact information for corresponding authors can be found within the respective publications. No additional datasets or supplementary materials were used. The methods, including study selection and data synthesis, are described in the Methods section. The search strategy employed and databases used are also provided. We acknowledge the authors of the included studies for their contributions to the existing literature.
Abbreviations
- ADC:
-
Analog to digital converter
- ADPCM:
-
Adaptive differential pulse code modulation
- APCM:
-
Algebraic pulse code modulation
- ATC:
-
Adaptive transform coding
- BPTT:
-
Backpropagation through time
- CELP:
-
Code excited linear prediction
- CNN:
-
Convolutional neural network
- CWT:
-
Continuous wavelet transform
- DCT:
-
Discrete Cosine transform
- DM:
-
Delta modulation
- DPCM:
-
Differential pulse code modulation
- DWT:
-
Discrete wavelet transform
- FFT:
-
Fast Fourier transform
- GRUs:
-
Gated recurrent units
- ITU-T:
-
International telecommunication union telecommunication
- LMS:
-
Least Mean squares
- LPC:
-
Linear Predictive coding
- LSTM:
-
Long Short-term memory
- MDCT:
-
Modified discrete cosine transform
- MELP:
-
Mixed excitation linear prediction
- MLP:
-
Multilayer perceptron
- MOS:
-
Mean opinion score
- MSE:
-
Mean squared error
- IMA:
-
Interactive Multimedia Association
- SNR:
-
Signal-to-noise ratio
- SEGSNR:
-
Segmental signal-to-noise ratio
- PCM:
-
Pulse-code modulation
- POLQA:
-
Perceptual Objective Listening Quality Assessment
- TIMIT:
-
Texas Instruments Misspoken Telephone Corpus
- NN:
-
Neural network
- PCM:
-
Pulse Code Modulation
- RELP:
-
Residual Excited Linear Prediction
- RLS:
-
Recursive Least Squares
- RNN:
-
Recurrent Neural Network
- RTRL:
-
Real-time recurrent learning
- SBC:
-
Sub-Band Coding
- SELP:
-
Stochastic excitation linear prediction
- SLR:
-
System literature review
- VFC:
-
Variance fractal compression
- VoIP:
-
Voice over internet protocol
References
Alipoor, G. H., & Savoji, M. H. (2006). Speech coding using non-linear prediction based on Volterra series expansion. SPECOM
Alipoor, G., & Savoji, M. H. (2007). Nonlinear speech coding using backward adaptive variable-length quadratic filters. In ISPA 2007 - Proceeding of the 5th international symposium on image and signal processing and analysis, (pp. 185–189). https://doi.org/10.1109/ISPA.2007.4383687.
Alipoor, G., & Savoji, M. H. (2012). Wide-band speech coding using kernel methods and bandwidth extension based on parametric stereo. In 2012 Proceedings of the 20th European signal processing conference (EUSIPCO) (pp. 2767–2771). IEEE
Alqushaibi, A., Abdulkadir, S. J., Rais, H. M., & Al-Tashi, Q. (2020). A review of weight optimization techniques in recurrent neural networks. In 2020 international conference on computational intelligence (ICCI) (pp. 196–201). IEEE
Ashdown, I. (2006, September). Extended parallel pulse code modulation of LEDs. In Sixth international conference on solid state lighting (Vol. 6337, pp. 169–178). SPIE. https://doi.org/10.1117/12.679674.
G. Bellec, Scherr, F., Hajek, E., Salaj, D., Legenstein, R., & Maass, W. (2019). Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets. 1–37. [Online], Available: http://arxiv.org/abs/1901.09049.
Berglund, K. (2004). Speech compression and tone detection in a real-time system
Besacier, L., Bergamini, C., Vaufreydaz, D., & Castelli, E. (2001, October). The effect of speech and audio compression on speech recognition performance. In 2001 IEEE fourth workshop on multimedia signal processing (Cat. No. 01TH8564) (pp. 301–306). IEEE.
Cernak, M., & Asaei, A. (2016). Cognitive speech coding (No. REP_WORK). Idiap
Chavan, K., Jawale, P., Pzatil, S., & Mumbai, N. (2016). SPEECH CODING. Vol. 40, no. 40, pp. 117–120.
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2015). On the properties of neural machine translation: Encoder–decoder approaches (pp. 103–111): https://doi.org/10.3115/v1/w14-4012.
D'Alessandro, G., Zanuy, M. F., & Piazza, F. (2002, May). A new subband non linear prediction coding algorithm for narrowband speech signal: The nADPCMB⊥ MLT coding scheme. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I-1025). IEEE. https://doi.org/10.1109/icassp.2002.5743969.
Despotovic, V., Görtz, N., & Peric, Z. (2012, September). Low-order volterra long-term predictors. In Speech communication; 10. ITG symposium (pp. 1–4). VDE
Despotović, V., & Perić, Z. (2013, November). Design of nonlinear predictors for adaptive predictive coding of speech signals. In 2013 21st telecommunications forum Telfor (TELFOR) (pp. 490–497). IEEE. https://doi.org/10.1109/TELFOR.2013.6716274.
Despotović, V., Görtz, N., & Perić, Z. (2012). Improved non-linear long-term predictors based on Volterra filters. International Symposium Electronics in Marine, 2, 231–234.
Faundez-Zanuy, M. (2015) Nonlinear predictive models computation in ADPCM schemes1. In Eurpean signal processing conference (Vol. 2015, pp. 6–9, 2000).
Faúndez-Zanuy, M. (2003). Wide band sub-band speech coding using non-linear prediction. In ICASSP, IEEE international conference on acoustic speech signal processing—Proceedings (Vol. 2, no. 1, pp. 181–184) https://doi.org/10.1109/icassp.2003.1202324.
Faundez-Zanuy, M. (2005). Nonlinear speech processing: Overview and possibilities in speech coding. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (Vol. 3445 LNAI, no. 4, pp. 15–42). https://doi.org/10.1007/11520153_2.
Faúndez-Zanuy, M. (2001). Nonlinear vectorial prediction with neural nets. In Lecture notes in Computer Science (including Subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics), (Vol. 2085 LNCS, no. PART 2, pp. 754–761) https://doi.org/10.1007/3-540-45723-2_91.
Faúndez-Zanuy, M. (2003, June). Non-linear speech coding with MLP, RBF and Elman based prediction1. In International work-conference on artificial neural networks (pp. 671–678). Berlin, Heidelberg. Springer. https://doi.org/10.1007/3-540-44869-1_85.
Faundez-Zanuy, M. (2006). Speech coding through adaptive combined nonlinear prediction. Speech Communication, 48(7), 838–847. https://doi.org/10.1016/j.specom.2005.09.007
Franeese, M. F. (1998). Marcos Fatindez-Zanuy *, pp. 345–348, 1998.
Abou Haidar, G., Achkar, R., & Dourgham, H. (2016, November). A comparative simulation study of the real effect of PCM, DM and DPCM systems on audio and image modulation. In 2016 IEEE international multidisciplinary conference on engineering technology (IMCET) (pp. 144–149). IEEE
Haque, M., & Bhattacharyya, K. (2016). A review on speech filtering and its different techniques. Journal of Engineering Technology, 4(1), 196–200.
Izumi, T., & Iiguni, Y. (2006). Data compression of nonlinear time series using a hybrid linear/nonlinear predictor. Signal Processing, 86(9), 2439–2446. https://doi.org/10.1016/j.sigpro.2005.11.013
Jagtap, S. K., Mulye, M. S., & Uplane, M. D. (2015). Speech coding techniques. Procedia Computer Science, 49(1), 253–263. https://doi.org/10.1016/j.procs.2015.04.251
Jayasankar, U., Thirumal, V., & Ponnurangam, D. (2021). A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. Journal of King Saud University-Computer and Information Sciences, 33(2), 119–140. https://doi.org/10.1016/j.jksuci.2018.05.006
Kaladharan, N. (2017). A review of different speech coding methods. International Journal of Electricals and Electronics Engineering Telecommunication, 6(2), 96–103.
Karpathy, A., Johnson, J., & Fei-Fei, L. (2015). Visualizing and understanding recurrent networks, pp. 1–12. http://arxiv.org/abs/1506.02078.
Keles, H. Y., Rozhon, J., Ilk, H. G., & Voznak, M. (2019). DeepVoCoder: A CNN model for compression and coding of narrow band speech. IEEE Access, 7, 75081–75089.
Kitchenham, B., & Charters, S. M. (2007). Guidelines for performing systematic literature reviews in software engineering, EBSE Technical Report EBSE-2007-01, Software Engineering Group School of Computer Science and Ma.
Kleijn, W. B., Lim, F. S., Luebs, A., Skoglund, J., Stimberg, F., Wang, Q., & Walters, T. C. (2018, April). Wavenet based low rate speech coding. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 676–680). IEEE. https://doi.org/10.1109/ICASSP.2018.8462529.
Kofod-Petersen, A. (2012). How to do a structured literature review in computer science. Ver. 0.1. October, 1
Laskov, L., Georgieva, V., & Dimitrov, K. (2020). Analysis of pulse code modulation in MATLAB/octave environment. In 2020 55th international science conference on information, communication energy system technology. (ICEST 2020-Proceeding) (pp. 77–80). https://doi.org/10.1109/ICEST49890.2020.9232755
Li, Z. N., Drew, M. S., Liu, J., Li, Z. N., Drew, M. S., & Liu, J. (2021). Basic audio compression techniques. Fundamentals of Multimedia, 479–504
Ling, Z. H., Ai, Y., Gu, Y., & Dai, L. R. (2018). Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(5), 883–894. https://doi.org/10.1109/TASLP.2018.2798811
Lotfidereshgi, R., & Gournay, P. (2018, April). Speech prediction using an adaptive recurrent neural network with application to packet loss concealment. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5394–5398). IEEE.
Mansour, C., Achkar, R., & Haidar, G. A. (2012). Simulation of DPCM and ADM systems. In Proceedings—2012 14th international conference modelling and simulation, (UKSim 2012) (no. 4, pp. 416–421). https://doi.org/10.1109/UKSim.2012.64.
Mishra, S. (2016). A survey paper on different data compression techniques Saumya Mishra Shraddha Singh.
Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks: A systematic review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
S. Nosouhian, Nosouhian, F., & Khoshouei, A. K. (2021). A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU. Preprint, no. July, pp. 1–7, https://doi.org/10.20944/preprints202107.0252.v1.
O’Shaughnessy, D. (2023). Review of methods for coding of speech signals. EURASIP Journal of Audio, Speech, Music Processing, 1, 2023. https://doi.org/10.1186/s13636-023-00274-x
Bäckström, T. (2017). Speech coding with code-excited linear prediction (pp. 37–41). Springer.
Pandey, S., & Banerjee, A. (2022). Optimal non-uniform sampling by branch-and-bound approach for speech coding. IEEE Access, 10, 2797–2812. https://doi.org/10.1109/ACCESS.2021.3138068
Pérez-Ortiz, J. A., Calera-Rubio, J., & Forcada, M. L. (2001, September). A comparison between recurrent neural architectures for real-time nonlinear prediction of speech signals. In Neural networks for signal processing XI: Proceedings of the 2001 IEEE signal processing society workshop (IEEE Cat. No. 01TH8584) (pp. 73–81). IEEE. https://doi.org/10.1109/nnsp.2001.943112.
Polynomial, A., Volterra, V., & Wiener, N. (1958) 10. Adaptive Volterra Filters.
Qu, L., Lyu, J., Li, W., Ma, D., & Fan, H. (2021). Features injected recurrent neural networks for short-term traffic speed prediction. Neurocomputing, 451, 290–304. https://doi.org/10.1016/j.neucom.2021.03.054
Raina, S. B., Raina, R., & Agarwal, V. (2014). Wireless speech coding : A systematic review.
Ray, M., Chandra, M., & Patil, B. P. (2015). Speech coding techniques for VoIP applications: A technical review. World Applied Sciences Journal. https://doi.org/10.5829/idosi.wasj.2015.33.05.148
Riera-Palou, F., Den Brinker, A. C., & Gerrits, A. J. (2004, November). A hybrid parametric-waveform approach to bit stream scalable audio coding. In Conference record of the thirty-eighth asilomar conference on signals, systems and computers, 2004. (Vol. 2, pp. 2250–2254). IEEE. https://doi.org/10.1109/acssc.2004.1399568.
Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-t erm memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. https://doi.org/10.1016/j.physd.2019.132306
Somers, H. (1999). An overview of digital. Structure. https://doi.org/10.1016/B978-0-12-373580-5.50038-7
Stachurski, J., & McCree, A. (2000, September). Combining parametric and waveform-matching coders for low bit-rate speech coding. In 2000 10th European signal processing conference (pp. 1–4). IEEE.
Tanaka, H., & Shimamura, T. (2004, September). Nonlinear predictive analysis of speech by iterative approach. In 2004 12th European signal processing conference (pp. 2055–2058). IEEE
Taware, D., & Handore, S. (2014). Speech compression techniques. 2(12), 1–7.
Townshend, B. (1991). Nonlinear prediction of speech. In Proceedings of ICASSP, IEEE international conference on acoustics speech and signal processing (Vol. 1, pp. 425–428). https://doi.org/10.1109/icassp.1991.150367
USNA. (2021). Lesson 20 : Analog to digital conversion. Ece, no. c, 2021, [Online]. Available: https://www.usna.edu/ECE/ec312/Lessons/wireless/EC312_Lesson_20_Analog_to_Digital_Course_Notes.pdf.
Varoglu, E., & Hacioglu, K. (2000). Recurrent neural network speech predictor based on dynamical systems approach. IEE Proceedings-Vision, Image and Signal Processing, 147(2), 149–156.
Wang, A., Sun, Z., & Zhang, X. (2002, June). A non-linear prediction speech coding system based on ANN. In Proceedings of the 4th world congress on intelligent control and automation (Cat. No. 02EX527) (Vol. 1, pp. 607–611). IEEE
Wang, G. (2006). Stability study of the SB-ADPCM coder. Signal Processing, 86(2), 319–330. https://doi.org/10.1016/j.sigpro.2005.05.011
Yan, W., Zhang, J., Zhang, S., & Wen, P. (2018). A novel pipelined neural IIR adaptive filter for speech prediction. Applied Acoustics, 141, 64–70. https://doi.org/10.1016/j.apacoust.2018.06.007
Yoshimura, T., Hashimoto, K., Oura, K., Nankaku, Y., & Tokuda, K. (2019, May). Speaker-dependent WaveNet-based delay-free ADPCM speech coding. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7145–7149). IEEE.
Zacarias-Morales, N., Pancardo, P., Hernández-Nolasco, J. A., & Garcia-Constantino, M. (2021). Attention-inspired artificial neural networks for speech processing: A systematic review. Symmetry (Basel), 13(2), 1–43. https://doi.org/10.3390/sym13020214
Zhang, G. A., Gu, J. Y., Bao, Z. H., Xu, C., & Zhang, S. B. (2014). Joint routing and channel assignment algorithms in cognitive wireless mesh networks. Transactions on Emerging Telecommunications and Technology, 25(3), 294–307. https://doi.org/10.1002/ett
Zhao, Z., Liu, H., & Fingscheidt, T. (2018, September). Nonlinear prediction of speech by echo state networks. In 2018 26th European signal processing conference (EUSIPCO) (pp. 2085–2089). IEEE. https://doi.org/10.23919/EUSIPCO.2018.8553190.
Zhao, H., & Zhang, J. (2009). Pipelined Chebyshev functional link artificial recurrent neural network for nonlinear adaptive filter. IEEE Transactions on Systems, Man, and Cybernetics, Part B Cybernetics, 40(1), 162–172. https://doi.org/10.1109/TSMCB.2009.2024313
Zhen, K., et al. (2022). Scalable and efficient neural speech coding: A hybrid design. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 12–25. https://doi.org/10.1109/TASLP.2021.3129353
Zhen, K., Sung, J., Lee, M. S., Beack, S., & Kim, M. (2021). Scalable and efficient neural speech coding: A hybrid design. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 12–25.
Acknowledgements
I would like to express my gratitude to the German Academic Exchange Service (DAAD) for providing funding for my PhD studies, including support for tuition fees, research expenses, and a stipend for living expenses. I am also thankful to Jomo Kenyatta University of Agriculture and Technology (JKUAT) for hosting me as a PhD student and providing invaluable academic resources and support.
Funding
This research was supported by a scholarship from DAAD, which provided funding for tuition fees, research expenses, and a stipend for the author during my PhD studies at Jomo Kenyatta University of Agriculture and Technology (JKUAT), Kenya.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sheferaw, G.K., Mwangi, W., Kimwele, M. et al. Waveform based speech coding using nonlinear predictive techniques: a systematic review. Int J Speech Technol 26, 1031–1059 (2023). https://doi.org/10.1007/s10772-023-10072-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10072-7