Skip to main content
Log in

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted from Punjabi continuous speech samples. The performance of automatic speech recognition (ASR) system for both monophone and triphone model i.e., tri1, tri2 and tri3 model using N-gram language model is reported. The performance of ASR system were computed in terms of word error rate (WER). A significant reduction in WER was observed using the tri phone model over mono phone model ASR .Also the performance of ASR using tri3 model is improved over tri2 model and the performance of tri2 model is improved over tri1 model ASR. Further, it was found that MFCC feature provides higher speech recognition accuracy than PLP features for continuous Punjabi speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and efficient weighted finitestate transducer library. In Proc. CIAA.

  • Allen, J. B. (1994). How do humans process and recognize speech. IEEE Transactions on Speech and Audio Processing, 2(4), 567–576.

    Article  Google Scholar 

  • Becerra, A., de la Rosa, J. I., & González, E. (2016). A case study of speech recognition in Spanish: From conventional to deep approach. In IEEE ANDESCON.

  • Bezoui, M., Elmoutaouakkil, A., & Beni-hssane, A. (2016). Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC). In 5th International Conference on Multimedia Computing and Systems ICMCS.

  • Chen, W., Zhenjiang, M., & Xiao, M. (2009). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.

    Google Scholar 

  • Chourasia, V., Samudravijaya, K., Ingle, M., & Chandwani, M. (2007). Hindi speech recognition under noisy conditions. In International Journal of Acoustic Society India (pp. 41–46).

  • Chow, Y.-L. (1990). Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm. In IEEE 1990 International Conference on Acoustics, Speech, and Signal Processing, 1990 (ICASSP-90) (pp. 701–704). IEEE.

  • Cosi, P. (2015). A KALDI-DNN-based ASR system for Italian. In International Joint Conference on Neural Networks IJCNN.

  • Gopinath, R. A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998 (Vol. 2, pp. 661–664). IEEE.

  • Hermansky, H. (1990). Perceptual linear prediction (PLP) analysis of speech. Journal of Acoustic Society America, 87, 1738–1752.

    Article  Google Scholar 

  • Kipyatkova, I., & Karpov, A. (2016). DNN-based acoustic modeling for Russian speech recognition using Kaldi. In International Conference on Speech and Computer SPECOM (pp. 246–253).

  • Kou, H., & Shang, W. (2014). Parallelized feature extraction and acoustic model training. In Digital Signal Processing. Proceedings ICDSP. IEEE.

  • Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—An open source realtime, large vocabulary recognition engine. In EUROSPEECH (pp. 1691–1694).

  • Lippman, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–15.

    Article  Google Scholar 

  • Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, 20(1), 69–88.

    Article  Google Scholar 

  • Povey, D. (2003). Discriminative training for large vocabulary speech recognition, PhD thesis, Cambridge University Engineering Department.

  • Povey, D., Gales, M. J. F., Kim, D. Y., & Woodland, P. C. (2003). MMI-MAP and MPE-MAP for acoustic model adaptation. In INTERSPEECH.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (No. EPFL-CONF192584). IEEE Signal Processing Society.

  • Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafit, M., Kombrink, S., Motlek, P., Qian, Y., & Riedhammer, K. (2012). Generating exact lattices in the WFST framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4213–4216).

  • Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008a). Boosted MMI for model and feature-space discriminative training, In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (pp. 4057–4060). IEEE.

  • Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008b). Boosted MMI for model and feature-space discriminative training, In ICASSP.

  • Povey, D., & Woodland, P. C. (2002). Minimum phone error and ismoothing for improved discriminative training. Cambridge: Cambridge University Engineering Department.

    Google Scholar 

  • Rabiner, L. R., & Juang, B. H. (2003). Fundamental of speech recognition (1st edn.). Delhi: Pearson Education.

    MATH  Google Scholar 

  • Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Loof, J., Schluter, R., & Ney, H. (2009) The RWTH Aachen University open source speech recognition system. In INTERSPEECH (pp. 2111–2114).

  • Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal Hindi speech recognition. Archives of Acoustics, 40(4), 609–619.

    Article  Google Scholar 

  • Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., & Woelfel, J. (2004) Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems Inc., Technical Report SML1 TR20040811.

  • Yadava, G. T., & Jayanna, H. S. (2016). Development and comparison of ASR models using Kaldi for noisy and enhanced kannada speech. In International Conference on Advances in Computing, Communications and Informatics ICACCI (pp. 635–644).

  • Young, G., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2009). The HTK book (for version 3.4). Cambridge: Cambridge University Engineering Department.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jyoti Guglani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guglani, J., Mishra, A.N. Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int J Speech Technol 21, 211–216 (2018). https://doi.org/10.1007/s10772-018-9497-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-9497-6

Keywords

Navigation