Skip to main content
Log in

Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the increasing popularity of deep learning, deep learning architectures are being utilized in speech recognition. Deep learning based speech recognition became the state-of-the-art method for speech recognition tasks due to their outstanding performance over other methods. Generally, deep learning architectures are trained with a variant of gradient descent optimization. Mini-batch gradient descent is a variant of gradient descent optimization which updates network parameters after traversing a number of training instances. One limitation of mini-batch gradient descent is the random selection of mini-batch samples from training set. This situation is not preferred in speech recognition which requires training features to collapse all possible variations in speech databases. In this study, to overcome this limitation, hybrid mini-batch sample selection strategies are proposed. The proposed hybrid strategies use gender and accent features of speech databases in a hybrid way to select mini-batch samples when training deep learning architectures. Experimental results justify that using hybrid of gender and accent features is more successful in terms of speech recognition performance than using only one feature. The proposed hybrid mini-batch sample selection strategies would benefit other application areas that have metadata information, including image recognition and machine vision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the VCTK Corpus repository, https://datashare.ed.ac.uk/handle/10283/3443

Abbreviations

RNN :

Recurrent Neural Networks

BiRNN :

Bidirectional Recurrent Neural Networks

CNN :

Convolutional Neural Networks

LSTM :

Long-Short Term Memory

BLSTM :

Bidirectional Long-Short Term Memory

CRNN :

Convolutional Recurrent Neural Networks

CLDNN :

Convolutional Long-Short Term Memory Deep Neural Network

GD :

Gradient Descent

CTC :

Connectionist Temporal Classification

LER :

Label Error Rate

VCTK :

Voice Cloning Toolkit

References

  1. Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, pp 1002–1012

  2. Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444

    Article  Google Scholar 

  3. Dai X, Yan X, Zhou K, Wang Y, Yang H, Cheng J (2020) Convolutional embedding for edit distance. In proceedings of the 43rd international ACM SIGIR conference on Research and Development in information retrieval (pp. 599-608)

  4. Deng L, Yu D (2014) Deep learning: methods and applications. Found. Trends Signal Process 7(3–4):197–387

    Article  MathSciNet  Google Scholar 

  5. Doetsch P, Golik P, Ney H (2017) A comprehensive study of batch construction strategies for recurrent neural networks in mxnet. arXiv preprint, arXiv:1705.02414, 1–4

  6. Dokuz Y, Tufekci Z (2021) Mini-batch sample selection strategies for deep learning based speech recognition. Appl Acoust 171:107573

    Article  Google Scholar 

  7. Garain A, Singh PK, Sarkar R (2021) FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst Appl 168:114416

    Article  Google Scholar 

  8. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press

  9. Graves A (2012) Connectionist temporal classification. In: Supervised Sequence Labelling with Recurrent Neural Networks. Springer, Berlin, Heidelberg, pp 61–93

    Chapter  Google Scholar 

  10. Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks, proceedings of the 31st international conference on international conference on machine learning, pp. II–1764–II–1772

  11. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In proceedings of the 23rd international conference on machine learning (pp. 369-376)

  12. Graves A, Jaitly N, Mohamed AR (2013) Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE

  13. Hourri S, Kharroubi J (2020) A deep learning approach for speaker recognition. Int J Speech Technol 23(1):123–131

    Article  Google Scholar 

  14. Hussain W, Sadiq MT, Siuly S, Rehman AU (2021) Epileptic seizure detection using 1 D-convolutional long short-term memory neural networks. Appl Acoust 177:107941

    Article  Google Scholar 

  15. Joseph KJ, Singh K, Balasubramanian VN (2019) Submodular batch selection for training deep neural networks. arXiv preprint, arXiv:1906.08771, 1–9

  16. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady 10(8):707–710

    MathSciNet  Google Scholar 

  17. Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 661–670)

  18. Liang Y, He F, Zeng X (2020) 3D mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integr Comput-Aided Eng 27(4):417–435

  19. Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA) (pp. 1–4). IEEE

  20. Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks, arXiv preprint, arXiv:1511.06343, 1–20

  21. Maas A, Xie Z, Jurafsky D, Ng A (2015) Lexicon-free conversational speech recognition with neural networks, proceedings of the 2015 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, pp. 345–354

  22. Mei M, He F (2021) Multi-label learning based target detecting from multi-frame data. IET Image Process 15:3638–3644

    Article  Google Scholar 

  23. Nicolson A, Paliwal KK (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Comm 111:44–55

    Article  Google Scholar 

  24. Park JS, Kim HG, Kim DG, Yu IJ, Lee HK (2018) Paired mini-batch training: a new deep network training for image forensics and steganalysis. Signal Process Image Commun 67:132–139

    Article  Google Scholar 

  25. Peng X, Li L, Wang FY (2019) Accelerating minibatch stochastic gradient descent using typicality sampling. IEEE Trans Neural Networks Learn Syst 31:4649–4659

  26. Quan Q, He F, Li H (2021) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput 37(2):245–259

    Article  Google Scholar 

  27. Ruder S (2016) An overview of gradient descent optimization algorithms, arXiv preprint, arXiv:1609.04747, 1–14

  28. Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). IEEE

  29. Souli S, Amami R, Yahia SB (2021) A robust pathological voices recognition system based on DCNN and scattering transform. Appl Acoust 177:107854

    Article  Google Scholar 

  30. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE

  31. Veaux C, Yamagishi J, MacDonald K (2019) Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR)

  32. Wang D, Wang X, Lv S (2019) End-to-end mandarin speech recognition combining CNN and BLSTM. Symmetry 11(5):644

    Article  Google Scholar 

  33. Wang Z, Zhang T, Shao Y, Ding B (2021) LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl Acoust 172:107647

    Article  Google Scholar 

  34. Watanabe S, Hori T, Kim S, Hershey JR, Hayashi T (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE J Sel Top Signal Process 11(8):1240–1253

  35. Yu D, Deng L (2016) Automatic speech recognition a deep learning approach. Springer, p 347

  36. Zheng L, Duffner S, Idrissi K, Garcia C, Baskurt A (2016) Siamese multi-layer perceptrons for dimensionality reduction and face identification. Multimed Tools Appl 75(9):5055–5073

Download references

Funding

The authors did not receive any funding for this research.

Author information

Authors and Affiliations

Authors

Contributions

Yesim Dokuz: Conceptualization, Methodology, Software, Data Curation, Resources, Writing - Original Draft, Visualization. Zekeriya Tufekci: Investigation, Supervision, Validation, Writing- Reviewing and Editing.

Corresponding author

Correspondence to Yesim Dokuz.

Ethics declarations

Ethics approval

Ethical approval is not necessary for this study.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dokuz, Y., Tüfekci, Z. Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. Multimed Tools Appl 81, 9969–9988 (2022). https://doi.org/10.1007/s11042-022-12304-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12304-5

Keywords

Navigation