Skip to main content
Log in

Persian speech recognition using deep learning

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Up to now, various methods are used for Automatic Speech Recognition (ASR), and among which the Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs) are the most important ones. One of the existing challenges is increasing the accuracy and efficiency of these systems. One way to enhance the accuracy of them is by improving the acoustic model (AM). In this paper, for the first time, the combination of deep belief network (DBN), for extracting features of speech signals, and Deep Bidirectional Long Short-Term Memory (DBLSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an AM on the Farsdat Persian speech data set. The obtained results show that the use of a deep neural network (DNN) compared to a shallow network improves the results. Also, using the bidirectional network increases the accuracy of the model in comparison with the unidirectional network, in both deep and shallow networks. Comparing obtained results with the HMM and Kaldi-DNN indicates that using DBLSTM with features extracted from the DBN increases the accuracy of Persian phoneme recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Almas Ganj, F. (2004). Shenava 2-Persian continuous speech recognition software Paper presented at the the first workshop on Persian language and computer, Iran.

  • Almas Ganj, F., Seyed Salehi, S. A., Bijankhan, M., Sameti, H., & Sheikh Zadegan, J. (2000). Shenava 1-Persian continuous speech recognition system. Paper presented at the The 9th Iranian electrical engineering conference.

  • Andrew, G., & Bilmes, J. (2013). Backpropagation in sequential deep belief networks. In Neural information processing systems (NIPS).

  • Anusuya, M., & Katti, S. K. (2010). Speech recognition by machine, a review. arXiv preprint arXiv:1001.2267.

  • Azadi Yazdi, S. (2013). Deep learning for speech recognition. Sharif University of Technology.

  • BabaAli, B. (2016). A state-of-the-art and efficient framework for Persian Speech Reognition. [Research]. Signal and Data Processing, 13(3), 51–62. https://doi.org/10.18869/acadpub.jsdp.13.3.51.

    Article  Google Scholar 

  • Babaali, B., & Sameti, H. (2004). The sharif speaker-independent large vocabulary speech recognition system. In The 2nd workshop on information technology & its disciplines (WITID 2004) (pp. 24–26).

  • Bijankhan, M., & Sheikhzadegan, J. (2006). The spoken data of Persian language. Paper presented at the second workshop on Persian language and computer.

  • Bijankhan, M., Sheikhzadegan, J., & Roohani, M. (1994). FARSDAT-the speech database of Farsi spoken language. In 1994: proceedings Australian conference on speech science and technology.

  • Brocki, Ł., & Marasek, K. (2015). Deep belief neural networks and bidirectional long-short term memory hybrid for speech recognition. Archives of Acoustics, 40(2), 191–195.

    Article  Google Scholar 

  • Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4688–4691). IEEE.

  • Daneshvar, M., & Veisi, H. (2016). Persian phoneme recognition using long short-term memory neural network. Paper presented at the eighth international conference on information and knowledge technology (IKT), Hamedan.

  • Deng, L., Hinton, G., & Kingsbury, B. (2013a). New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8599–8603). IEEE.

  • Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., et al. (2013b). Recent advances in deep learning for speech research at Microsoft. In ICASSP (Vol. 26, p. 64).

  • Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed, A.-R., & Hinton, G. (2010). Binary coding of speech spectrograms using a deep auto-encoder. In Eleventh annual conference of the international speech communication association.

  • Deng, L., Tur, G., He, X., & Hakkani-Tur, D. (2012a). Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In Spoken language technology workshop (SLT), 2012 IEEE (pp. 210–215). IEEE.

  • Deng, L., & Yu, D. (2011). Deep convex net: A scalable architecture for speech pattern classification. In Twelfth annual conference of the international speech communication association.

  • Deng, L., & Yu, D. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.

    Article  MathSciNet  Google Scholar 

  • Deng, L., Yu, D., & Hinton, G. (2009). Deep learning for speech recognition and related applications. In NIPS workshop.

  • Deng, L., Yu, D., & Platt, J. (2012b). Scalable stacking and learning for building deep architectures. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2133–2136). IEEE.

  • Fausett, L. (1994). Fundamentals of neural networks: Architectures, algorithms, and applications. Upper Saddle River: Prentice-Hall Inc.

    MATH  Google Scholar 

  • Fischer, A., & Igel, C. (2012). An introduction to restricted Boltzmann machines. In Iberoamerican congress on pattern recognition (pp. 14–36). Berlin: Springer.

  • Gers, F. (2001). Long short-term memory in recurrent neural networks. Unpublished Ph.D. dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.

  • Graves, A. (2012a). Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.

  • Graves, A. (2012b). Supervised sequence labelling with recurrent neural networks (Vol. 385). Berlin: Springer.

    Book  Google Scholar 

  • Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). ACM.

  • Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. In Artificial neural networks: Formal models and their applicationsICANN 2005 (pp. 799–804). Berlin: Springer.

  • Graves, A., Jaitly, N., & Mohamed, A.-R. (2013a). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 273–278). IEEE.

  • Graves, A., Mohamed, A. -R., & Hinton, G. (2013b). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649). IEEE.

  • Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.

    Article  Google Scholar 

  • Haykin, S. (1994). Neural networks: A comprehensive foundation. Upper Saddle River: Prentice Hall PTR.

    MATH  Google Scholar 

  • He, X., Deng, L., & Chou, W. (2008). Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 25(5), 14–36.

    Article  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall PTR.

    Google Scholar 

  • Hutchinson, B., Deng, L., & Yu, D. (2012). A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4805–4808). IEEE.

  • Hutchinson, B., Deng, L., & Yu, D. (2013). Tensor deep stacking networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1944–1957.

    Article  Google Scholar 

  • Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009 (pp. 3761–3764). IEEE.

  • Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In Thirteenth annual conference of the international speech communication association.

  • Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43.

    Article  Google Scholar 

  • Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse deep belief net model for visual area V2. In Advances in neural information processing systems (pp. 873–880).

  • Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., et al. (2019). Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288.

  • Liu, Y., Zhou, S., & Chen, Q. (2011). Discriminative deep belief networks for visual data classification. Pattern Recognition, 44(10), 2287–2296.

    Article  Google Scholar 

  • Malekzadeh, S., Gholizadeh, M. H., & Razavi, S. N. (2018). Persian phonemes recognition using PPNet. arXiv preprint arXiv:1812.08600.

  • Mohamed, A.-R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications (Vol. 1, pp. 39, Vol. 9). Vancouver, Canada.

  • Mohamed, A.-R., Yu, D., & Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In eleventh annual conference of the international speech communication association.

  • Morgan, N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden Markov models. In 1990 international conference on acoustics, speech, and signal processing, 1990. ICASSP-90 (pp. 413–416). IEEE.

  • Morgan, N., & Bourlard, H. A. (1995). Neural networks for statistical recognition of continuous speech. Proceedings of the IEEE, 83(5), 742–772.

    Article  Google Scholar 

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE signal processing society.

  • Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. https://doi.org/10.1109/5.18626.

    Article  Google Scholar 

  • Sainath, T. N., Ramabhadran, B., & Picheny, M. (2009). An exploration of large vocabulary tools for small vocabulary phonetic recognition. In IEEE workshop on automatic speech recognition & understanding, 2009. ASRU 2009 (pp. 359–364). IEEE.

  • Sak, H., Senior, A., & Beaufays, F. (2014a). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.

  • Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., et al. (2014b). Sequence discriminative distributed training of long short-term memory recurrent neural networks. In Fifteenth annual conference of the international speech communication association.

  • Salakhutdinov, R. (2009). Learning deep generative models. Toronto: University of Toronto.

    Google Scholar 

  • Sameti, H., Movasagh, H., Babaali, B., Bahrani, M., Hosseinzadeh, K., Fazel Dehkordi, A., et al. (2004). Persian continuous speech recognition system with large vocabulary. Paper presented at the The first workshop on Persian language and computer, Tehran, Iran.

  • Sameti, H., Veisi, H., Bahrani, M., Babaali, B., & Hosseinzadeh, K. (2011). A large vocabulary continuous speech recognition system for Persian language. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1–12.

    Article  Google Scholar 

  • Saon, G., Kuo, H.-K. J., Rennie, S., & Picheny, M. (2015). The IBM 2015 English conversational telephone speech recognition system. arXiv preprint arXiv:1505.05899.

  • Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., et al. (2017). English conversational telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136.

  • Saon, G., Sercu, T., Rennie, S., & Kuo, H.-K. J. (2016). The IBM 2016 English conversational telephone speech recognition system.

  • Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In Twelfth annual conference of the international speech communication association.

  • Veisi, H., & Haji Mani, A. (2015). Persian speech recognition using long short-term memory. Paper presented at the The 21st national conference of the computer society of Iran, Iran.

  • Vinyals, O., Jia, Y., Deng, L., & Darrell, T. (2012). Learning with recursive perceptual representations. In Advances in neural information processing systems (pp. 2825–2833).

  • Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1990). Phoneme recognition using time-delay neural networks. In Readings in speech recognition (pp. 393–404). Amsterdam: Elsevier.

  • Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), 1520 April 2018 (pp. 5934–5938). https://doi.org/10.1109/icassp.2018.8461870.

  • Young, S. J., & Young, S. (1993). The htk hidden markov model toolkit: Design and philosophy. Department of Engineering, Cambridge University. UK, Tech. Rep. TR, 153.

  • Yu, D., & Deng, L. (2015). Automatic speech recognition (Signals and Communication Technology). London: Springer.

    Google Scholar 

  • Yu, D., Deng, L., & Dahl, G. (2010). Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In Proceedings of the NIPS workshop on deep learning and unsupervised feature learning.

  • Yu, D., Deng, L., He, X., & Acero, A. (2007). Large-margin minimum classification error training for large-scale speech recognition tasks. In IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007 (Vol. 4, pp. IV-1137–IV-1140). IEEE.

  • Yu, D., Deng, L., & Seide, F. (2012). Large vocabulary speech recognition using deep tensor neural networks. In Thirteenth annual conference of the international speech communication association.

  • Yu, D., Deng, L., & Seide, F. (2013). The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 21(2), 388–396.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Veisi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Veisi, H., Haji Mani, A. Persian speech recognition using deep learning. Int J Speech Technol 23, 893–905 (2020). https://doi.org/10.1007/s10772-020-09768-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09768-x

Keywords

Navigation