Abstract
In the recent past years, deep-learning-based machine learning methods have demonstrated remarkable success for a wide range of learning tasks in multiple domains. They are suitable for complex classification and regression problems in applications such as computer vision, speech recognition and other pattern analysis branches. The purpose of this article is to contribute a timely review and introduction of state-of-the-art and popular discriminative DNN, CNN and RNN deep learning techniques, the basic framework and algorithms, hardware implementations, applications in speech, and the overall benefits of deep learning.
Similar content being viewed by others
References
O. Abdel-Hamid, A.R. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
F. Abuzaid., Optimizing cpu performance for convolutional neural networks. Online. Available: http://cs231n.stanford.edu/reports/2015/pdfs/fabuzaid final report.pdf
M. Alwani, H. Chen, M. Ferdman, P. Milder, Fused-layer CNN accelerators, in The 49th Annual IEEE/ACM International Symposium on Microarchitecture (IEEE Press, New Jersey, 2016)
P. Angelov, A. Sperduti, Challenges in deep learning, in Proceedings of ESANN (2016), pp. 489–495
A. Ansari, K. Gunnam, T. Ogunfunmi, An efficient reconfigurable hardware accelerator for convolutional neural networks, in 51st Asilomar Conference on Signals, Systems, and Computers (IEEE, 2017), pp. 1337–1341
A. Ansari, T. Ogunfunmi, An Efficient Network Agnostic Architecture Design and Analysis for Convolutional Neural Networks. submitted to the IEEE JETCAS, Special Issue on Customized sub-systems and circuits for deep learning (2019)
A. Bhandare, M. Bhide, P. Gokhale, R. Chandavarkar, Applications of convolutional neural networks. Int. J. Comput. Sci. Inf. Technol. 7, 2206–2215 (2016)
C.M. Bishop, Pattern Recognition and Machine Learning (Springer, Berlin, 2006)
S. Böck, M. Schedl, Polyphonic piano note transcription with recurrent neural networks, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 121–124
W.M. Campbell, K.T. Assaleh, C.C. Broun, Speaker recognition with polynomial classifiers. IEEE Trans. Speech Audio Process. 10(4), 205–212 (2002)
S. Chakradhar, M. Sankaradas, V. Jakkula, S. Cadambi, A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Comput. Archit. News 38(3), 247–257 (2010)
J.H. Chen, A. Gersho, Adaptive postfiltering for quality enhancement of coded speech. IEEE Trans. Speech Audio Process. 3(1), 59–71 (1995)
Y.H. Chen, J. Emer, V. Sze, Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput. Archit. News 44(3), 367–379 (2016)
Y.H. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn: efficient primitives for deep learning (2014). arXiv preprint arXiv:1410.0759
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1724–1734
J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling (2014). arXiv preprint. arXiv:1412.3555
D.A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUS). arXiv preprint arXiv:1511.07289
J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, S. Zhang, Understanding performance differences of FPGAs and GPUs, in 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). (IEEE, 2018), pp. 93–96
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
L. Deng, N. Jaitly, Deep discriminative and generative models for speech pattern recognition, in Handbook of Pattern Recognition and Computer Vision, ed. by C.H. Chen (World Scientific, Singapore, 2016), pp. 27–52
R. Dey, F.M. Salemt, Gate-variants of Gated Recurrent Unit (GRU) neural networks, in 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (2017), pp. 1597–1600
J.S. Edwards, R.P. Ramachandran, U. Thayasivam, Robust speaker verification with a two classifier format and feature enhancement, in IEEE international symposium on circuits and systems (ISCAS) (2017), pp. 1–4
M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)
A. Fazel, S. Chakrabartty, An overview of statistical pattern recognition techniques for speaker verification. IEEE Circuits Syst. Mag. 11(2), 62–81 (2011)
J. Fowers, G. Brown, P. Cooke, G. Stitt, A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (2012), pp. 47–56
S.W. Fu, Y. Tsao, X. Lu, H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks (2017). arXiv preprint arXiv:1703.02205
S.W. Fu, Y. Tsao, X. Lu, SNR-aware convolutional neural network modeling for speech enhancement, in Interspeech (2016), pp. 3768–3772
F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM, in 9th International Conference on Artificial Neural Networks (ICANN) (1999), pp. 850–855
P.K. Ghosh, A. Tsiartas, S. Narayanan, Robust voice activity detection using long-term signal variability. IEEE Trans. Audio Speech Lang. Process. 19(3), 600–613 (2011)
X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (2011), pp. 315–323
I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning (MIT Press, Cambridge, 2016)
S. Han, B. Dally, Efficient methods and hardware for deep learning. University Lecture (2017)
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (2016), pp. 243–254
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
C.W. Huang, S. Narayanan, Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition (2017). arXiv preprint arXiv:1706.02901
N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, R. Boyle, et al., In-datacenter performance analysis of a tensor processing unit, in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (2017), pp. 1–12
P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang. Process. 15(4), 1435–1447 (2007)
H.K. Kim, R.V. Cox, R.C. Rose, Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments. IEEE Trans. Speech Audio Process. 10(8), 591–604 (2002)
D.P. Kingma, M. Welling, Auto-encoding variational bayes (2013). arXiv preprint. arXiv:1312.6114
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
P.W. Koh, P. Liang, Understanding black-box predictions via influence functions, in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017), pp. 1885–1894
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105
H.T. Kung, B. McDanel, S.Q. Zhang, Mapping systolic arrays onto 3D circuit structures: accelerating convolutional neural network inference, in IEEE Workshop on Signal Processing Systems (2018)
G. Lacey, G.W. Taylor, S. Areibi, Deep learning on fpgas: past, present, and future (2016). arXiv preprint arXiv:1602.04283
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015)
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Z. Li, J. Eichel, A. Mishra, A. Achkar, K. Naik, A CPU-based algorithm for traffic optimization based on sparse convolutional neural networks, in Electrical and Computer Engineering (CCECE), 2017 IEEE 30th Canadian Conference IEEE (2017), pp. 1–5
M. Lin, Q. Chen, S. Yan, Network in network (2013). arXiv preprint. arXiv:1312.4400
Z.C. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for sequence learning (2015). arXiv preprint. arXiv:1506.00019
P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)
A. Makhzani, B. Frey, K-sparse autoencoders (2013). arXiv preprint. arXiv:1312.5663
T. May, S. Van De Par, A. Kohlrausch, Noise-robust speaker recognition combining missing data techniques and universal background modeling. IEEE Trans. Audio, SpeechLang. Process. 20(1), 108–121 (2012)
A. McCree, Reducing speech coding distortion for speaker identification, in Ninth International Conference on Spoken Language Processing (2006)
W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
M. McLaren, Y. Lei, N. Scheffer, L. Ferrer, Application of convolutional neural networks to speaker recognition in noisy conditions, in Fifteenth Annual Conference of the International Speech Communication Association (2014)
J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15(5), 1711–1723 (2007)
V. Mitra, H. Franco, Time-frequency convolutional networks for robust speech recognition. in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (IEEE, 2015), pp. 317–323
H. Muckenhirn, M.M. Doss, S. Marcell, Towards directly modeling raw speech signal for speaker verification using CNNs, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 4884–4888
R.W. Mudrowsky, R.P. Ramachandran, U. Thayasivam, S.S. Shetty, Robust speaker recognition in the presence of speech coding distortion for remote access applications, in Proceedings of the International Conference on Data Mining (DMIN) (2016), p. 176
C. Murphy, Y. Fu, Xilinx all programmable devices: a superior platform for compute-intensive systems. Xilinx White Paper (2017)
V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of the 27th International Conference on Machine Learning (ICML) (2010), pp. 807–814
E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, D. Marr, Accelerating recurrent neural networks in analytics servers: comparison of FPGA, CPU, GPU, and ASIC, in 2016 26th International Conference on Field Programmable Logic and Applications (FPL) (IEEE, 2016), pp. 1–4
E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, G. Boudoukh, et al., Can FPGAs beat GPUs in accelerating next-generation deep neural networks? in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM, 2017), pp. 5–14
R. Ondusko, M. Marbach, R.P. Ramachandran, L.M. Head, Blind signal-to-noise ratio estimation of speech based on vector quantizer classifiers and decision level fusion. J. Signal Process. Syst. 89(2), 335–345 (2017)
K. Ovtcharov, O. Ruwase, J.Y. Kim, J. Fowers, K. Strauss, E.S. Chung, Accelerating deep convolutional neural networks using specialized hardware. Microsoft Res. Whitepaper 2(11), 1–4 (2015)
G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)
M. Parchami, W.P. Zhu, B. Champagne, E. Plourde, Recent developments in speech enhancement in the short-time Fourier transform domain. IEEE Circuits Syst. Mag. 16(3), 45–77 (2016)
C. Poultney, S. Chopra, Y.L. Cun, Efficient learning of sparse representations with an energy-based model, in Advances in Neural Information Processing Systems (2007), pp. 1137–1144
Y. Qian, M. Bi, T. Tan, K. Yu, Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
V. Ramamoorthy, N.S. Jayant, R.V. Cox, M.M. Sondhi, Enhancement of ADPCM speech coding with backward-adaptive algorithms for postfiltering and noise feedback. IEEE J. Select. Areas Commun. 6(2), 364–382 (1988)
S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders: explicit invariance during feature extraction, in Proceedings of the 28th International Conference on International Conference on Machine Learning (Omnipress, 2011), pp. 833–840
F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation (No. ICS-8506). California Univ San Diego La Jolla Inst for Cognitive Science (1985)
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, A.C. Berg, Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules, in Advances in Neural Information Processing Systems (2017), pp. 3856–3866
T.N. Sainath, R.J. Weiss, A. Senior, K.W. Wilson, O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Sixteenth Annual Conference of the International Speech Communication Association (2015)
J. Schmidhuber, Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Y. Shen, M. Ferdman, P. Milder, Maximizing CNN accelerator efficiency through resource partitioning, in ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (2017), pp. 535–547
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint. arXiv:1409.1556
B.Y. Smolenski, R.P. Ramachandran, Usable speech processing: a filterless approach in the presence of interference. IEEE Circuits Syst. Mag. 11(2), 8–22 (2011)
B.V. Srinivasan, Y. Luo, D. Garcia-Romero, D.N. Zotkin, R.A. Duraiswami, symmetric kernel partial least squares framework for speaker recognition. IEEE Trans. Audio Speech Lang. Process. 21(7), 1415–1423 (2013)
K. Sundararajan, D.L. Woodard, Deep learning for biometrics: a survey. ACM Comput. Surv. (CSUR) 51(3), 65 (2018)
V. Sze, Y.H. Chen, T.J. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1–9
R. Togneri, D. Pullella, An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 11(2), 23–61 (2011)
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou, Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016) pp. 5200–5204
Z. Tufekci, J.N. Gowdy, S. Gurbuz, E. Patterson, Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition. Speech Commun. 48(10), 1294–1307 (2006)
V. Vanhoucke, A. Senior, M.Z. Mao, Improving the speed of neural networks on CPUs, in Proceedings of Deep Learning and Unsupervised Feature Learning NIPS Workshop (2011)
P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in ACM Proceedings of the 25th International Conference on Machine Learning(2008), pp. 1096–1103
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
N. Wang, P.C. Ching, N. Zheng, T. Lee, Robust speaker recognition using denoised vocal source and vocal tract features. IEEE Trans. Audio Speech Lang. Process. 19(1), 196–205 (2011)
Y. Wang, L. Neves, F. Metze, Audio-based multimedia event detection using deep recurrent neural networks, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 2742–2746
F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in International Conference on Latent Variable Analysis and Signal Separation (Springer, Cham, 2015), pp. 91–99
P. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015)
N. Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, M. Sturge-Apple, Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In spoken language technology workshop (slt) (2012), pp. 455–460
R. Zazo Candil, T.N. Sainath, G. Simko, C. Parada, Feature learning with raw-waveform CLDNNs for voice activity detection, in Interspeech (2016), pp. 3668–3672
M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision (Springer, Cham, 2014), pp. 818–833
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, Optimizing fpga-based accelerator design for deep convolutional neural networks, in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM, 2015), pp. 161–170
Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech (2018). arXiv preprint arXiv:1806.09411
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ogunfunmi, T., Ramachandran, R.P., Togneri, R. et al. A Primer on Deep Learning Architectures and Applications in Speech Processing. Circuits Syst Signal Process 38, 3406–3432 (2019). https://doi.org/10.1007/s00034-019-01157-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01157-3