Abstract
The cost function gradient vanishing or exploding problem and slow convergence speed are key issues when training deep neural networks (DNNs). In this paper, we investigate the forward and backward propagation processes of DNN training and explore the properties of the activation function and derivative function (ADF) employed. The outputs’ distribution of ADF with near-zero mean is proposed to reduce gradient problems. Additionally, the constant energy transfer of propagating data in the training process is also proposed to speed up convergence further. Based on wavelet frame theory, we derive a novel ADF, i.e., tight frame wavelet activation function (TFWAF) and tight frame wavelet derivative function (TFWDF) of the Mexican hat wavelet, to stabilize and accelerate DNN training. The nonlinearity of wavelet functions can strengthen the learning capacity of DNN models, while the sparse property of wavelets derived can reduce the overfitting problem and enhance the robustness of models. Experiments demonstrate that the proposed method stabilizes the DNN training process and accelerates convergence.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study are available from the corresponding author on reasonable request.
References
Chen Z, Zhang P (2022) Lightweight full-band and sub-band fusion network for real time speech enhancement. Proc Interspeech 2022:921–925
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Zhang S, Lei M, Yan Z, Dai L (2018) Deep-fsmn for large vocabulary continuous speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5869–5873
Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, Wu J, Zhang B, Xie L (2020) Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Duta IC, Liu L, Zhu F, Shao L (2021) Improved residual networks for image and video recognition. In: 2020 25th international conference on pattern recognition (ICPR). IEEE, pp 9415–9422
Bengio Y et al (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, pp 770–778
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. arXiv e-prints
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Erhan D, Manzagol P-A, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Artificial intelligence and statistics, pp 153–160
LeCun YA, Bottou L, Orr GB, Müller K-R (2012) Efficient backprop. In: Neural networks: tricks of the trade. Springer, Berlin, pp 9–48
Bengio Y, LeCun Y et al (2007) Scaling learning algorithms towards AI. Large-scale Kernel Mach 34(5):1–41
Erhan D, Bengio Y, Courville AC, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(3):625–660
Dauphin YN, Bengio Y (2013) Big neural networks waste capacity. arXiv preprint arXiv:1301.3583
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Ziegler T, Fritsche M, Kuhn L, Donhauser K (2019) Efficient smoothing of dilated convolutions for image segmentation. arXiv preprint arXiv:1903.07992
Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev 53(8):5455–5516
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: a light-weight, power efficient, and general purpose convolutional neural network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9190–9200
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456
Clevert D-A, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289
Raiko T, Valpola H, LeCun Y (2012) Deep learning made easier by linear transformations in perceptrons. In: Artificial intelligence and statistics, pp 924–932
Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in neural information processing systems, vol 30
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: 2015 IEEE international conference on computer vision, pp 1026–1034
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning, pp 807–814
Le QV, Jaitly N, Hinton GE (2015) A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th international conference on machine learning, vol 30, p 3
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941
Misra D (2019) Mish: A self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681 4(2):10–48550
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 315–323
Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2892–2900
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853
Krizhevsky A, Hinton G (2010) Convolutional deep belief networks on cifar-10. Technical report, Toronto University
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics, pp 249–256
Luo G (2010) An efficient DSP implantation of wavelet audio coding for digital communication. In: 4th international conference on digital society, pp 66–71
Vyas A, Paik J (2018) Applications of multiscale transforms to image denoising: survey. In: 2018 international conference on electronics, information, and communication, pp 1–3
Cao H, Luo G (2016) Wavelet speech enhancement based on improved Teager energy operator and local variance analysis. In: Artificial intelligence science and technology, pp 541–552
Zhang Q, Benveniste A (1992) Wavelet networks. IEEE Trans Neural Netw 3(6):889–898
Zhang Q (1997) Using wavelet network in nonparametric estimation. IEEE Trans Neural Netw 8(2):227–236
Billings SA, Wei H-L (2005) A new class of wavelet networks for nonlinear system identification. IEEE Trans Neural Netw 16(4):862–874
Wei H-L, Billings SA, Zhao Y, Guo L (2010) An adaptive wavelet neural network for spatio-temporal system identification. Neural Netw 23(10):1286–1299
Alexandridis AK, Zapranis AD (2013) Wavelet neural networks: a practical guide. Neural Netw 42:1–27
Sidney Burrus C, Gopinath RA, Guo H (1998) Introduction to wavelets and wavelet transforms: a primer, 1st edn. Prentice Hall, New Jersey
Donoho DL (1993) Unconditional bases are optimal bases for data compression and for statistical estimation. Appl Comput Harmonic Anal 1(1):100–115
Daubechies I (1990) The wavelet transform, time-frequency localization and signal analysis. IEEE Trans Inf Theory 36(5):961–1005
Chui CK, Shi X (1993) Inequalities of Littlewood–Paley type for frames and wavelets. SIAM J Math Anal 24(1):263–277
Daubechies I (1992) 3. Ten lectures on wavelets. SIAM, Pennsylvania, pp 53–105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
Saxe AM, McClelland JL, Ganguli S (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120
Liu H, Wang D, Liu J, Liu S (2015) Range tunable optical fiber micro-fabry-pérot interferometer for pressure sensing. IEEE Photonics Technol Lett 28(4):402–405
Lin H, Luo G, Cao H, Fang X, Zhou F (2019) Complex nonlinear system modelling and parameters identification by deep neural networks. DEStech Transactions on Computer Science and Engineering (AICAE)
Warden P (2018) Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209
Kim B, Chang S, Lee J, Sung D (2021) Broadcasted residual learning for efficient keyword spotting. arXiv preprint arXiv:2106.04140
Majumdar S, Ginsburg B (2020) Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition. arXiv preprint arXiv:2004.08531
Funding
This study was funded by the Guangzhou Municipal Science and Technology Bureau of China (Research Grant No. 202002030133).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors certify that there is no actual or potential conflict of interest in relation to this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cao, H. Fast deep learning with tight frame wavelets. Neural Comput & Applic 36, 4885–4905 (2024). https://doi.org/10.1007/s00521-023-09260-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09260-y