Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Dokuz, Yesim; Tüfekci, Zekeriya

doi:10.1007/s11042-022-12304-5

Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Published: 14 February 2022

Volume 81, pages 9969–9988, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

249 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

With the increasing popularity of deep learning, deep learning architectures are being utilized in speech recognition. Deep learning based speech recognition became the state-of-the-art method for speech recognition tasks due to their outstanding performance over other methods. Generally, deep learning architectures are trained with a variant of gradient descent optimization. Mini-batch gradient descent is a variant of gradient descent optimization which updates network parameters after traversing a number of training instances. One limitation of mini-batch gradient descent is the random selection of mini-batch samples from training set. This situation is not preferred in speech recognition which requires training features to collapse all possible variations in speech databases. In this study, to overcome this limitation, hybrid mini-batch sample selection strategies are proposed. The proposed hybrid strategies use gender and accent features of speech databases in a hybrid way to select mini-batch samples when training deep learning architectures. Experimental results justify that using hybrid of gender and accent features is more successful in terms of speech recognition performance than using only one feature. The proposed hybrid mini-batch sample selection strategies would benefit other application areas that have metadata information, including image recognition and machine vision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the VCTK Corpus repository, https://datashare.ed.ac.uk/handle/10283/3443

Abbreviations

RNN :: Recurrent Neural Networks
BiRNN :: Bidirectional Recurrent Neural Networks
CNN :: Convolutional Neural Networks
LSTM :: Long-Short Term Memory
BLSTM :: Bidirectional Long-Short Term Memory
CRNN :: Convolutional Recurrent Neural Networks
CLDNN :: Convolutional Long-Short Term Memory Deep Neural Network
GD :: Gradient Descent
CTC :: Connectionist Temporal Classification
LER :: Label Error Rate
VCTK :: Voice Cloning Toolkit

References

Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, pp 1002–1012
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444
Article Google Scholar
Dai X, Yan X, Zhou K, Wang Y, Yang H, Cheng J (2020) Convolutional embedding for edit distance. In proceedings of the 43rd international ACM SIGIR conference on Research and Development in information retrieval (pp. 599-608)
Deng L, Yu D (2014) Deep learning: methods and applications. Found. Trends Signal Process 7(3–4):197–387
Article MathSciNet Google Scholar
Doetsch P, Golik P, Ney H (2017) A comprehensive study of batch construction strategies for recurrent neural networks in mxnet. arXiv preprint, arXiv:1705.02414, 1–4
Dokuz Y, Tufekci Z (2021) Mini-batch sample selection strategies for deep learning based speech recognition. Appl Acoust 171:107573
Article Google Scholar
Garain A, Singh PK, Sarkar R (2021) FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst Appl 168:114416
Article Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press
Graves A (2012) Connectionist temporal classification. In: Supervised Sequence Labelling with Recurrent Neural Networks. Springer, Berlin, Heidelberg, pp 61–93
Chapter Google Scholar
Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks, proceedings of the 31st international conference on international conference on machine learning, pp. II–1764–II–1772
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In proceedings of the 23rd international conference on machine learning (pp. 369-376)
Graves A, Jaitly N, Mohamed AR (2013) Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE
Hourri S, Kharroubi J (2020) A deep learning approach for speaker recognition. Int J Speech Technol 23(1):123–131
Article Google Scholar
Hussain W, Sadiq MT, Siuly S, Rehman AU (2021) Epileptic seizure detection using 1 D-convolutional long short-term memory neural networks. Appl Acoust 177:107941
Article Google Scholar
Joseph KJ, Singh K, Balasubramanian VN (2019) Submodular batch selection for training deep neural networks. arXiv preprint, arXiv:1906.08771, 1–9
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady 10(8):707–710
MathSciNet Google Scholar
Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 661–670)
Liang Y, He F, Zeng X (2020) 3D mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integr Comput-Aided Eng 27(4):417–435
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA) (pp. 1–4). IEEE
Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks, arXiv preprint, arXiv:1511.06343, 1–20
Maas A, Xie Z, Jurafsky D, Ng A (2015) Lexicon-free conversational speech recognition with neural networks, proceedings of the 2015 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, pp. 345–354
Mei M, He F (2021) Multi-label learning based target detecting from multi-frame data. IET Image Process 15:3638–3644
Article Google Scholar
Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Comm 111:44–55
Article Google Scholar
Park JS, Kim HG, Kim DG, Yu IJ, Lee HK (2018) Paired mini-batch training: a new deep network training for image forensics and steganalysis. Signal Process Image Commun 67:132–139
Article Google Scholar
Peng X, Li L, Wang FY (2019) Accelerating minibatch stochastic gradient descent using typicality sampling. IEEE Trans Neural Networks Learn Syst 31:4649–4659
Quan Q, He F, Li H (2021) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput 37(2):245–259
Article Google Scholar
Ruder S (2016) An overview of gradient descent optimization algorithms, arXiv preprint, arXiv:1609.04747, 1–14
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). IEEE
Souli S, Amami R, Yahia SB (2021) A robust pathological voices recognition system based on DCNN and scattering transform. Appl Acoust 177:107854
Article Google Scholar
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE
Veaux C, Yamagishi J, MacDonald K (2019) Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR)
Wang D, Wang X, Lv S (2019) End-to-end mandarin speech recognition combining CNN and BLSTM. Symmetry 11(5):644
Article Google Scholar
Wang Z, Zhang T, Shao Y, Ding B (2021) LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust 172:107647
Article Google Scholar
Watanabe S, Hori T, Kim S, Hershey JR, Hayashi T (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE J Sel Top Signal Process 11(8):1240–1253
Yu D, Deng L (2016) Automatic speech recognition a deep learning approach. Springer, p 347
Zheng L, Duffner S, Idrissi K, Garcia C, Baskurt A (2016) Siamese multi-layer perceptrons for dimensionality reduction and face identification. Multimed Tools Appl 75(9):5055–5073

Download references

Funding

The authors did not receive any funding for this research.

Author information

Authors and Affiliations

Department of Computer Engineering, Nigde Omer Halisdemir University, Nigde, Turkey
Yesim Dokuz
Department of Computer Engineering, Cukurova University, Adana, Turkey
Zekeriya Tüfekci

Authors

Yesim Dokuz
View author publications
You can also search for this author in PubMed Google Scholar
Zekeriya Tüfekci
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yesim Dokuz: Conceptualization, Methodology, Software, Data Curation, Resources, Writing - Original Draft, Visualization. Zekeriya Tufekci: Investigation, Supervision, Validation, Writing- Reviewing and Editing.

Corresponding author

Correspondence to Yesim Dokuz.

Ethics declarations

Ethics approval

Ethical approval is not necessary for this study.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dokuz, Y., Tüfekci, Z. Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. Multimed Tools Appl 81, 9969–9988 (2022). https://doi.org/10.1007/s11042-022-12304-5

Download citation

Received: 20 May 2021
Revised: 09 January 2022
Accepted: 14 January 2022
Published: 14 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11042-022-12304-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Availability of data and materials

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Availability of data and materials

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation