Advertisement

A comparative study of deep neural network based Punjabi-ASR system

  • Virender Kadyan
  • Archana Mantri
  • R. K. Aggarwal
  • Amitoj SinghEmail author
Article
  • 25 Downloads

Abstract

HMM is regarded as the leader from last five decades for handling the temporal variability in an input speech signal for building automatic speech recognition system. GMM became an integral part of HMM so as to measure the efficiency of each state that stores the information of a short windowed frame. In order to systematically fit the frame, it reserves the frame coefficients and connects their posterior probability over HMM state that acts as an output. In this paper, deep neural network (DNN) is tested against the GMM through utilization of many hidden layers which helps the DNN to successfully evade the issue of overfitting on large training dataset before its performance becomes worse. The implementation DNN with robust feature extraction approach has brought a high performance margin in Punjabi speech recognition system. For feature extraction, the baseline MFCC and GFCC approaches are integrated with cepstral mean and variance normalization. The dimension reduction, decorrelation of vector information and speaker variability is later addressed with linear discriminant analysis, maximum likelihood linear transformation, SAT, maximum likelihood linear regression adaptation models. Two hybrid classifiers investigate the conceived acoustic feature vectors: GMM–HMM, and DNN–HMM to obtain improvement in performance on connected and continuous Punjabi speech corpus. Experimental setup shows a notable improvement of 4–5% and 1–3% (in connected and continuous datasets respectively).

Keywords

Deep neural network (DNN) Gaussian mixture model (GMM) Hidden markov model (HMM) Maximum likelihood linear transformation (MLLT) Cepstral mean and variance normalization (CMVN) Maximum likelihood linear regression (fMLLR) 

Notes

Acknowledgements

This work is partially tested on the sample Punjabi corpus collected for Language Resources for Auditory impaired Person project from IEEE SIGHT. The views and results in the work is as per perspective of the research. The author would like to thank Speech and Multimodel Laboratory members Mandeep, Sashi, Nikhil at Chitkara University Punjab. Special thanks to Dr. Syed who helps in providing valuable input in formation of a baseline DNN system.

References

  1. Acero, A., & Stern, R. M. (1992). Cepstral normalization for robust speech recognition. In Speech processing in adverse conditions.Google Scholar
  2. Bourlard, H., & Morgan, N. (1993). Connectionist speech recognition. A hybrid approach (Vol. 247). Boston: The Kluwer International Series in Engineering and Computer Science.Google Scholar
  3. Chen, X., & Cheng, J. (2014). Deep neural network acoustic modeling for native and non-native Mandarin speech recognition. In Proceedings of ISCSLP (pp. 6–9).Google Scholar
  4. Dua, M., Aggarwal, R. K., & Biswas, M. (2018). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing.  https://doi.org/10.1007/s12652-018-0828-x.Google Scholar
  5. Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues, 9(4), 359–364.Google Scholar
  6. Gales, M., & Woodland, P. (1996a). Mean and variance adaptation within the MLLR framework. Computer Speech & Language, 10, 249–264.CrossRefGoogle Scholar
  7. Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.CrossRefGoogle Scholar
  8. Ghai, W., & Singh, N. (2013). Continuous speech recognition for Punjabi language. International Journal of Computer Applications, 72(14), 422–431.CrossRefGoogle Scholar
  9. Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Acoustics, speech, and signal processing, 1992. ICASSP-92., 1992 IEEE international conference on (Vol. 1, pp. 13–16). IEEE.Google Scholar
  10. Hermansky, H., Ellis, D. P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In Acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000 IEEE international conference on (Vol. 3, pp. 1635–1638). IEEE.Google Scholar
  11. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
  12. Juang, B. H., Levinson, S., & Sondhi, M. (1986). Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.). IEEE Transactions on Information Theory, 32(2), 307–309.CrossRefGoogle Scholar
  13. Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017) Refinement of HMM model parameters for Punjabi Automatic Speech Recognition (PASR) System, IETE Journal of Research.  https://doi.org/10.1080/03772063.2017.1369370.Google Scholar
  14. Kadyan, V., Mantri, V., & Aggarwal, R. K. (2017) Refinement of HMM model parameters for Punjabi Automatic Speech Recognition (PASR) System. IETE Journal of Research.  https://doi.org/10.1080/03772063.2017.1369370.Google Scholar
  15. Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4), 283–297.CrossRefGoogle Scholar
  16. Kumar, Y., & Singh, N. (2017). An automatic speech recognition system for spontaneous Punjabi speech corpus. International Journal of Speech Technology, 20(2), 297–303.CrossRefGoogle Scholar
  17. Lata, S., & Arora, S. (2013) Laryngeal tonal characteristics of Punjabi—An experimental study. In 2015 2nd international conference on computing for sustainable global development (pp. 1694–1697).Google Scholar
  18. Liu, F., Stern, R. M., Huang, X., & Acero, R. (1993). Efficient cepstral normalization for robust speech recognition. In Proceedings of the workshop on human language technology (pp. 69–74).Google Scholar
  19. Matsoukas, S., Schwartz, R., Jin, H., & Nguyen, L. (1997). Practical implementations of speaker-adaptive training. In DARPA speech recognition workshop.Google Scholar
  20. Mittal, S., & Sharma, R. K. (2014). Development of phonetic engine for Punjabi language (Doctoral dissertation), Thapar University, Patiala, India.Google Scholar
  21. Mitra, V., Wang, W., Franco, H., Lei, Y., Bartels, C., & Graciarena, M. (2014). Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In Fifteenth annual conference of the international speech communication association.Google Scholar
  22. Palaz, D., & Collobert, R. (2015). Analysis of cnn-based speech recognition system using raw speech as input (No. EPFL-REPORT-210039). Idiap.Google Scholar
  23. Parthasarathi, S. H. K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., & Garimella, S. (2015). fMLLR based feature-space speaker adaptation of DNN acoustic models. In Sixteenth annual conference of the international speech communication association.Google Scholar
  24. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., & Rose, R. C. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.CrossRefGoogle Scholar
  25. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRefGoogle Scholar
  26. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Upper Saddle River: Prentice-Hall Inc.Google Scholar
  27. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Network, 61, 85–117.CrossRefGoogle Scholar
  28. Singh, A., Dipti, P., & Agrawal, S. S. (2015) Analysis of Punjabi tonemes. In Computing for Sustainable Global Development (INDIACom) (pp. 1–6).Google Scholar
  29. Sivasankaran, S., Nugraha, A. A., Vincent, E., Morales-Cordovilla, J. A., Dalmia, S., Illina, I., et al. (2015). Robust ASR using neural network based speech enhancement and feature simulation. In IEEE workshop on automatic speech recognition and understanding (ASRU), 2015 (pp. 482–489).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Virender Kadyan
    • 1
  • Archana Mantri
    • 2
  • R. K. Aggarwal
    • 3
  • Amitoj Singh
    • 4
    Email author
  1. 1.Department of Computer Science & EngineeringChitkara University Institute of Engineering & Technology, Chitkara UniversityRajpuraIndia
  2. 2.Department of Electronics & Communication EngineeringChitkara University Institute of Engineering & Technology, Chitkara UniversityRajpuraIndia
  3. 3.Department of Computer EngineeringN.I.T. KurukshetraKurukshetraIndia
  4. 4.Department of Computer ApplicationM.R.S. P.T.UBathindaIndia

Personalised recommendations