Abstract
With the increasing popularity of deep learning, deep learning architectures are being utilized in speech recognition. Deep learning based speech recognition became the state-of-the-art method for speech recognition tasks due to their outstanding performance over other methods. Generally, deep learning architectures are trained with a variant of gradient descent optimization. Mini-batch gradient descent is a variant of gradient descent optimization which updates network parameters after traversing a number of training instances. One limitation of mini-batch gradient descent is the random selection of mini-batch samples from training set. This situation is not preferred in speech recognition which requires training features to collapse all possible variations in speech databases. In this study, to overcome this limitation, hybrid mini-batch sample selection strategies are proposed. The proposed hybrid strategies use gender and accent features of speech databases in a hybrid way to select mini-batch samples when training deep learning architectures. Experimental results justify that using hybrid of gender and accent features is more successful in terms of speech recognition performance than using only one feature. The proposed hybrid mini-batch sample selection strategies would benefit other application areas that have metadata information, including image recognition and machine vision.
Similar content being viewed by others
Availability of data and materials
The datasets generated and/or analyzed during the current study are available in the VCTK Corpus repository, https://datashare.ed.ac.uk/handle/10283/3443
Abbreviations
- RNN :
-
Recurrent Neural Networks
- BiRNN :
-
Bidirectional Recurrent Neural Networks
- CNN :
-
Convolutional Neural Networks
- LSTM :
-
Long-Short Term Memory
- BLSTM :
-
Bidirectional Long-Short Term Memory
- CRNN :
-
Convolutional Recurrent Neural Networks
- CLDNN :
-
Convolutional Long-Short Term Memory Deep Neural Network
- GD :
-
Gradient Descent
- CTC :
-
Connectionist Temporal Classification
- LER :
-
Label Error Rate
- VCTK :
-
Voice Cloning Toolkit
References
Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, pp 1002–1012
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444
Dai X, Yan X, Zhou K, Wang Y, Yang H, Cheng J (2020) Convolutional embedding for edit distance. In proceedings of the 43rd international ACM SIGIR conference on Research and Development in information retrieval (pp. 599-608)
Deng L, Yu D (2014) Deep learning: methods and applications. Found. Trends Signal Process 7(3–4):197–387
Doetsch P, Golik P, Ney H (2017) A comprehensive study of batch construction strategies for recurrent neural networks in mxnet. arXiv preprint, arXiv:1705.02414, 1–4
Dokuz Y, Tufekci Z (2021) Mini-batch sample selection strategies for deep learning based speech recognition. Appl Acoust 171:107573
Garain A, Singh PK, Sarkar R (2021) FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst Appl 168:114416
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press
Graves A (2012) Connectionist temporal classification. In: Supervised Sequence Labelling with Recurrent Neural Networks. Springer, Berlin, Heidelberg, pp 61–93
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks, proceedings of the 31st international conference on international conference on machine learning, pp. II–1764–II–1772
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In proceedings of the 23rd international conference on machine learning (pp. 369-376)
Graves A, Jaitly N, Mohamed AR (2013) Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE
Hourri S, Kharroubi J (2020) A deep learning approach for speaker recognition. Int J Speech Technol 23(1):123–131
Hussain W, Sadiq MT, Siuly S, Rehman AU (2021) Epileptic seizure detection using 1 D-convolutional long short-term memory neural networks. Appl Acoust 177:107941
Joseph KJ, Singh K, Balasubramanian VN (2019) Submodular batch selection for training deep neural networks. arXiv preprint, arXiv:1906.08771, 1–9
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady 10(8):707–710
Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 661–670)
Liang Y, He F, Zeng X (2020) 3D mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integr Comput-Aided Eng 27(4):417–435
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA) (pp. 1–4). IEEE
Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks, arXiv preprint, arXiv:1511.06343, 1–20
Maas A, Xie Z, Jurafsky D, Ng A (2015) Lexicon-free conversational speech recognition with neural networks, proceedings of the 2015 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, pp. 345–354
Mei M, He F (2021) Multi-label learning based target detecting from multi-frame data. IET Image Process 15:3638–3644
Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Comm 111:44–55
Park JS, Kim HG, Kim DG, Yu IJ, Lee HK (2018) Paired mini-batch training: a new deep network training for image forensics and steganalysis. Signal Process Image Commun 67:132–139
Peng X, Li L, Wang FY (2019) Accelerating minibatch stochastic gradient descent using typicality sampling. IEEE Trans Neural Networks Learn Syst 31:4649–4659
Quan Q, He F, Li H (2021) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput 37(2):245–259
Ruder S (2016) An overview of gradient descent optimization algorithms, arXiv preprint, arXiv:1609.04747, 1–14
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). IEEE
Souli S, Amami R, Yahia SB (2021) A robust pathological voices recognition system based on DCNN and scattering transform. Appl Acoust 177:107854
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE
Veaux C, Yamagishi J, MacDonald K (2019) Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR)
Wang D, Wang X, Lv S (2019) End-to-end mandarin speech recognition combining CNN and BLSTM. Symmetry 11(5):644
Wang Z, Zhang T, Shao Y, Ding B (2021) LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust 172:107647
Watanabe S, Hori T, Kim S, Hershey JR, Hayashi T (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE J Sel Top Signal Process 11(8):1240–1253
Yu D, Deng L (2016) Automatic speech recognition a deep learning approach. Springer, p 347
Zheng L, Duffner S, Idrissi K, Garcia C, Baskurt A (2016) Siamese multi-layer perceptrons for dimensionality reduction and face identification. Multimed Tools Appl 75(9):5055–5073
Funding
The authors did not receive any funding for this research.
Author information
Authors and Affiliations
Contributions
Yesim Dokuz: Conceptualization, Methodology, Software, Data Curation, Resources, Writing - Original Draft, Visualization. Zekeriya Tufekci: Investigation, Supervision, Validation, Writing- Reviewing and Editing.
Corresponding author
Ethics declarations
Ethics approval
Ethical approval is not necessary for this study.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dokuz, Y., Tüfekci, Z. Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. Multimed Tools Appl 81, 9969–9988 (2022). https://doi.org/10.1007/s11042-022-12304-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12304-5