Persian speech recognition using deep learning

Veisi, Hadi; Haji Mani, Armita

doi:10.1007/s10772-020-09768-x

Persian speech recognition using deep learning

Published: 06 November 2020

Volume 23, pages 893–905, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

532 Accesses
18 Citations
Explore all metrics

Abstract

Up to now, various methods are used for Automatic Speech Recognition (ASR), and among which the Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs) are the most important ones. One of the existing challenges is increasing the accuracy and efficiency of these systems. One way to enhance the accuracy of them is by improving the acoustic model (AM). In this paper, for the first time, the combination of deep belief network (DBN), for extracting features of speech signals, and Deep Bidirectional Long Short-Term Memory (DBLSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an AM on the Farsdat Persian speech data set. The obtained results show that the use of a deep neural network (DNN) compared to a shallow network improves the results. Also, using the bidirectional network increases the accuracy of the model in comparison with the unidirectional network, in both deep and shallow networks. Comparing obtained results with the HMM and Kaldi-DNN indicates that using DBLSTM with features extracted from the DBN increases the accuracy of Persian phoneme recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for time series classification: a review

Article 02 March 2019

A review on the long short-term memory model

Article 13 May 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

References

Almas Ganj, F. (2004). Shenava 2-Persian continuous speech recognition software Paper presented at the the first workshop on Persian language and computer, Iran.
Almas Ganj, F., Seyed Salehi, S. A., Bijankhan, M., Sameti, H., & Sheikh Zadegan, J. (2000). Shenava 1-Persian continuous speech recognition system. Paper presented at the The 9th Iranian electrical engineering conference.
Andrew, G., & Bilmes, J. (2013). Backpropagation in sequential deep belief networks. In Neural information processing systems (NIPS).
Anusuya, M., & Katti, S. K. (2010). Speech recognition by machine, a review. arXiv preprint arXiv:1001.2267.
Azadi Yazdi, S. (2013). Deep learning for speech recognition. Sharif University of Technology.
BabaAli, B. (2016). A state-of-the-art and efficient framework for Persian Speech Reognition. [Research]. Signal and Data Processing, 13(3), 51–62. https://doi.org/10.18869/acadpub.jsdp.13.3.51.
Article Google Scholar
Babaali, B., & Sameti, H. (2004). The sharif speaker-independent large vocabulary speech recognition system. In The 2nd workshop on information technology & its disciplines (WITID 2004) (pp. 24–26).
Bijankhan, M., & Sheikhzadegan, J. (2006). The spoken data of Persian language. Paper presented at the second workshop on Persian language and computer.
Bijankhan, M., Sheikhzadegan, J., & Roohani, M. (1994). FARSDAT-the speech database of Farsi spoken language. In 1994: proceedings Australian conference on speech science and technology.
Brocki, Ł., & Marasek, K. (2015). Deep belief neural networks and bidirectional long-short term memory hybrid for speech recognition. Archives of Acoustics, 40(2), 191–195.
Article Google Scholar
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4688–4691). IEEE.
Daneshvar, M., & Veisi, H. (2016). Persian phoneme recognition using long short-term memory neural network. Paper presented at the eighth international conference on information and knowledge technology (IKT), Hamedan.
Deng, L., Hinton, G., & Kingsbury, B. (2013a). New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8599–8603). IEEE.
Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., et al. (2013b). Recent advances in deep learning for speech research at Microsoft. In ICASSP (Vol. 26, p. 64).
Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed, A.-R., & Hinton, G. (2010). Binary coding of speech spectrograms using a deep auto-encoder. In Eleventh annual conference of the international speech communication association.
Deng, L., Tur, G., He, X., & Hakkani-Tur, D. (2012a). Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In Spoken language technology workshop (SLT), 2012 IEEE (pp. 210–215). IEEE.
Deng, L., & Yu, D. (2011). Deep convex net: A scalable architecture for speech pattern classification. In Twelfth annual conference of the international speech communication association.
Deng, L., & Yu, D. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.
Article MathSciNet Google Scholar
Deng, L., Yu, D., & Hinton, G. (2009). Deep learning for speech recognition and related applications. In NIPS workshop.
Deng, L., Yu, D., & Platt, J. (2012b). Scalable stacking and learning for building deep architectures. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2133–2136). IEEE.
Fausett, L. (1994). Fundamentals of neural networks: Architectures, algorithms, and applications. Upper Saddle River: Prentice-Hall Inc.
MATH Google Scholar
Fischer, A., & Igel, C. (2012). An introduction to restricted Boltzmann machines. In Iberoamerican congress on pattern recognition (pp. 14–36). Berlin: Springer.
Gers, F. (2001). Long short-term memory in recurrent neural networks. Unpublished Ph.D. dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
Graves, A. (2012a). Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
Graves, A. (2012b). Supervised sequence labelling with recurrent neural networks (Vol. 385). Berlin: Springer.
Book Google Scholar
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). ACM.
Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. In Artificial neural networks: Formal models and their applications–ICANN 2005 (pp. 799–804). Berlin: Springer.
Graves, A., Jaitly, N., & Mohamed, A.-R. (2013a). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 273–278). IEEE.
Graves, A., Mohamed, A. -R., & Hinton, G. (2013b). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649). IEEE.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.
Article Google Scholar
Haykin, S. (1994). Neural networks: A comprehensive foundation. Upper Saddle River: Prentice Hall PTR.
MATH Google Scholar
He, X., Deng, L., & Chou, W. (2008). Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 25(5), 14–36.
Article Google Scholar
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall PTR.
Google Scholar
Hutchinson, B., Deng, L., & Yu, D. (2012). A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4805–4808). IEEE.
Hutchinson, B., Deng, L., & Yu, D. (2013). Tensor deep stacking networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1944–1957.
Article Google Scholar
Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009 (pp. 3761–3764). IEEE.
Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In Thirteenth annual conference of the international speech communication association.
Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43.
Article Google Scholar
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse deep belief net model for visual area V2. In Advances in neural information processing systems (pp. 873–880).
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., et al. (2019). Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288.
Liu, Y., Zhou, S., & Chen, Q. (2011). Discriminative deep belief networks for visual data classification. Pattern Recognition, 44(10), 2287–2296.
Article Google Scholar
Malekzadeh, S., Gholizadeh, M. H., & Razavi, S. N. (2018). Persian phonemes recognition using PPNet. arXiv preprint arXiv:1812.08600.
Mohamed, A.-R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications (Vol. 1, pp. 39, Vol. 9). Vancouver, Canada.
Mohamed, A.-R., Yu, D., & Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In eleventh annual conference of the international speech communication association.
Morgan, N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden Markov models. In 1990 international conference on acoustics, speech, and signal processing, 1990. ICASSP-90 (pp. 413–416). IEEE.
Morgan, N., & Bourlard, H. A. (1995). Neural networks for statistical recognition of continuous speech. Proceedings of the IEEE, 83(5), 742–772.
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE signal processing society.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. https://doi.org/10.1109/5.18626.
Article Google Scholar
Sainath, T. N., Ramabhadran, B., & Picheny, M. (2009). An exploration of large vocabulary tools for small vocabulary phonetic recognition. In IEEE workshop on automatic speech recognition & understanding, 2009. ASRU 2009 (pp. 359–364). IEEE.
Sak, H., Senior, A., & Beaufays, F. (2014a). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., et al. (2014b). Sequence discriminative distributed training of long short-term memory recurrent neural networks. In Fifteenth annual conference of the international speech communication association.
Salakhutdinov, R. (2009). Learning deep generative models. Toronto: University of Toronto.
Google Scholar
Sameti, H., Movasagh, H., Babaali, B., Bahrani, M., Hosseinzadeh, K., Fazel Dehkordi, A., et al. (2004). Persian continuous speech recognition system with large vocabulary. Paper presented at the The first workshop on Persian language and computer, Tehran, Iran.
Sameti, H., Veisi, H., Bahrani, M., Babaali, B., & Hosseinzadeh, K. (2011). A large vocabulary continuous speech recognition system for Persian language. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1–12.
Article Google Scholar
Saon, G., Kuo, H.-K. J., Rennie, S., & Picheny, M. (2015). The IBM 2015 English conversational telephone speech recognition system. arXiv preprint arXiv:1505.05899.
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., et al. (2017). English conversational telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136.
Saon, G., Sercu, T., Rennie, S., & Kuo, H.-K. J. (2016). The IBM 2016 English conversational telephone speech recognition system.
Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In Twelfth annual conference of the international speech communication association.
Veisi, H., & Haji Mani, A. (2015). Persian speech recognition using long short-term memory. Paper presented at the The 21st national conference of the computer society of Iran, Iran.
Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In Advances in neural information processing systems (pp. 2825–2833).
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1990). Phoneme recognition using time-delay neural networks. In Readings in speech recognition (pp. 393–404). Amsterdam: Elsevier.
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), 15–20 April 2018 (pp. 5934–5938). https://doi.org/10.1109/icassp.2018.8461870.
Young, S. J., & Young, S. (1993). The htk hidden markov model toolkit: Design and philosophy. Department of Engineering, Cambridge University. UK, Tech. Rep. TR, 153.
Yu, D., & Deng, L. (2015). Automatic speech recognition (Signals and Communication Technology). London: Springer.
Google Scholar
Yu, D., Deng, L., & Dahl, G. (2010). Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In Proceedings of the NIPS workshop on deep learning and unsupervised feature learning.
Yu, D., Deng, L., He, X., & Acero, A. (2007). Large-margin minimum classification error training for large-scale speech recognition tasks. In IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007 (Vol. 4, pp. IV-1137–IV-1140). IEEE.
Yu, D., Deng, L., & Seide, F. (2012). Large vocabulary speech recognition using deep tensor neural networks. In Thirteenth annual conference of the international speech communication association.
Yu, D., Deng, L., & Seide, F. (2013). The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 21(2), 388–396.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of New Sciences and Technologies (FNST), University of Tehran, North Karegar St., Tehran, Iran
Hadi Veisi & Armita Haji Mani

Authors

Hadi Veisi
View author publications
You can also search for this author in PubMed Google Scholar
Armita Haji Mani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hadi Veisi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Veisi, H., Haji Mani, A. Persian speech recognition using deep learning. Int J Speech Technol 23, 893–905 (2020). https://doi.org/10.1007/s10772-020-09768-x

Download citation

Received: 24 October 2019
Accepted: 29 October 2020
Published: 06 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10772-020-09768-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Persian speech recognition using deep learning

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

A review on the long short-term memory model

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Persian speech recognition using deep learning

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

A review on the long short-term memory model

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation