Skip to main content
Log in

Improving the Performance of ASR System by Building Acoustic Models using Spectro-Temporal and Phase-Based Features

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

State-of-the-art spectral or temporal features of speech do not provide adequate attributes for automatic speech recognition (ASR) system in noisy environments. Recently, phase-based speech processing has shown its importance in the speech community. Phase-based features are equally important as magnitude-based features, and if incorporated suitably, it can provide vital acoustic information. This work investigated whether the phase features provide complementary information to spectro-temporal features and enhance the performance of an ASR system. Here, different phase extraction approaches are analysed to identify which representation gives the best performance for the hybrid ASR system. Further, this study addresses the use of phase information along with spectro-temporal features in building an acoustic model for improving the performance of ASR system. Here, gammatonegram-based Gabor filters are utilized to extract the spectro-temporal features from the speech utterances. The combined features seem to inherit better and higher discriminable feature attributes. The experiments are carried out to analyse the performance of ASR system with the combined feature set by considering Aurora2 database and speech utterances from TIMIT corrupted with different noise sources at various SNR values. From the experimental results, it is observed that for the TIMIT database, the performance results show an average relative improvement of 18.2%, 20.1% and 4.7% over MFCC, RASTA-PLP and spectro-temporal features, respectively. In the case of Aurora2 database, a relative improvement of 6.2% on average is obtained with clean training and 6.1% on average is obtained with multi-condition training, compared to the baseline spectro-temporal features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability Statement

The TIMIT data that support the findings of this study are available in Linguistic Data Consortium, LDC Catalog No.: LDC93S1 at https://catalog.ldc.upenn.edu/LDC93S1 [9]. The Aurora2 dataset that is used during this work is available from ELRA (European Language Resource Association) and can be obtained from http://aurora.hsnr.de/aurora-2.html [15].

References

  1. L.D. Alsteris, K.K. Paliwal, Short-time phase spectrum in speech processing: a review and some experimental results. Digital Signal Process. 17(3), 578–616 (2007)

    Article  Google Scholar 

  2. H. Banno, K. Takeda, F. Itakura, A study on perceptual distance measure for phase spectrum of stimuli, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 5, pp. 3297–3300. IEEE (2001)

  3. H. Boril, J.H. Hansen, Unsupervised equalization of lombard effect for speech recognition in noisy adverse environments. IEEE Trans. Audio Speech Lang. Process. 18(6), 1379–1393 (2009)

    Article  Google Scholar 

  4. B. Bozkurt, L. Couvreur, T. Dutoit, Chirp group delay analysis of speech signals. Speech Commun. 49(3), 159–176 (2007)

    Article  Google Scholar 

  5. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  6. A. Dutta, G. Ashishkumar, C.V.R. Rao, Designing of gabor filters for spectro-temporal feature extraction to improve the performance of asr system. Int. J. Speech Technol. 22(4), 1085–1097 (2019)

    Article  Google Scholar 

  7. J. Fahringer, T. Schrank, J. Stahl, P. Mowlaee, F. Pernkopf, Phase-aware signal processing for automatic speech recognition, in INTERSPEECH, pp. 3374–3378 (2016)

  8. S. Ganapathy, M. Omar, Auditory motivated front-end for noisy speech using spectro-temporal modulation filtering. J. Acoust. Soc. Am. 136(5), EL343–EL349 (2014)

    Article  Google Scholar 

  9. J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, (1993)

  10. B.R. Glasberg, B.C. Moore, Derivation of auditory filter shapes from notched-noise data. Hear. Res. 47(1–2), 103–138 (1990)

    Article  Google Scholar 

  11. B. Gold, N. Morgan, D. Ellis, Speech and Audio Signal Processing: Processing and Perception of Speech and Music (Wiley, 2011)

  12. R.M. Hegde, H.A. Murthy, V.R.R. Gadde, Significance of the modified group delay feature in speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(1), 190–202 (2007)

    Article  Google Scholar 

  13. H. Hermansky, N. Morgan A. Bayya, P. Kohn, Rasta-plp speech analysis, in Proc. IEEE Intl Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 121–124 (1991)

  14. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Magn. 29(6), 82–97 (2012)

    Article  Google Scholar 

  15. H.G. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)

  16. H.K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Signal Process. 38(10), 4667–4682 (2019)

    Article  Google Scholar 

  17. C. Kim, R.M. Stern, Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction, in Tenth Annual Conference of the International Speech Communication Association (2009)

  18. T. Kleinschmidt, S. Sridharan, M. Mason, The use of phase in complex spectrum subtraction for robust speech recognition. Comput. Speech Lang. 25(3), 585–600 (2011)

    Article  Google Scholar 

  19. L. Liu, J. He, G. Palm, Effects of phase on the perception of intervocalic stop consonants. Speech Commun. 22(4), 403–417 (1997)

    Article  Google Scholar 

  20. E. Loweimi, S.M. Ahadi, T. Drugman, A new phase-based feature representation for robust speech recognition, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7155–7159. IEEE (2013)

  21. C. Magi, J. Pohjalainen, T. Bäckström, P. Alku, Stabilised weighted linear prediction. Speech Commun. 51(5), 401–411 (2009)

    Article  Google Scholar 

  22. J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  23. K. Manjunath, K.S. Rao, Improvement of phone recognition accuracy using articulatory features. Circuits Syst. Signal Process. 37(2), 704–728 (2018)

    Article  Google Scholar 

  24. A.M.C. Martinez, S.H. Mallidi, B.T. Meyer, On the relevance of auditory-based gabor features for deep learning in robust speech recognition. Comput. Speech Lang. 45, 21–38 (2017)

    Article  Google Scholar 

  25. S.L. Mattys, M.H. Davis, A.R. Bradlow, S.K. Scott, Speech recognition in adverse conditions: a review. Lang. Cogn. Process. 27(7–8), 953–978 (2012)

    Article  Google Scholar 

  26. B.T. Meyer, B. Kollmeier, Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 53(5), 753–767 (2011)

    Article  Google Scholar 

  27. J.A. Morales-Cordovilla, V. Sánchez, A.M. Gómez, A.M. Peinado, On the use of asymmetric windows for robust speech recognition. Circuits Syst. Signal Process. 31(2), 727–736 (2012)

    Article  MathSciNet  Google Scholar 

  28. P. Mowlaee, R. Saeidi, Y. Stylianou, Advances in phase-aware signal processing in speech communication. Speech Commun. 81, 1–29 (2016)

    Article  Google Scholar 

  29. H.A. Murthy, V. Gadde, The modified group delay function and its application to phoneme recognition, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP03)., vol. 1, pp. I–68. IEEE (2003)

  30. H.A. Murthy, B. Yegnanarayana, Group delay functions and its applications in speech technology. Sadhana 36(5), 745–782 (2011)

    Article  Google Scholar 

  31. D. Norris, J.M. McQueen, A. Cutler, Prediction, bayesian inference and feedback in speech recognition. Lang. Cogn. Neurosci. 31(1), 4–18 (2016)

    Article  Google Scholar 

  32. A.V. Oppenheim, Discrete-Time Signal Processing (Pearson Education India, 1999)

  33. A.V. Oppenheim, J.S. Lim, The importance of phase in signals. Proc. IEEE 69(5), 529–541 (1981)

    Article  Google Scholar 

  34. K.K. Paliwal, L. Alsteris, Usefulness of phase spectrum in human speech perception, in Eighth European Conference on Speech Communication and Technology (2003)

  35. P. Pallavi, C.V.R. Rao, Phase-locked loop (pll) based phase estimation in single channel speech enhancement, in Interspeech, pp. 1161–1164 (2018)

  36. R.D. Patterson, The sound of a sinusoid: spectral models. J. Acoust. Soc. Am. 96(3), 1409–1418 (1994)

    Article  Google Scholar 

  37. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., The Kaldi Speech Recognition Toolkit (IEEE Signal Processing Society, Tech. Rep., 2011)

  38. S.O. Sadjadi, J.H. Hansen, Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5448–5451. IEEE (2011)

  39. M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)

    Article  Google Scholar 

  40. R. Schluter, I. Bezrukov, H. Wagner, H. Ney, Gammatone features and feature combination for large vocabulary speech recognition, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP07, vol. 4, pp. IV–649. IEEE (2007)

  41. R. Schluter, H. Ney, Using phase spectrum information for improved speech recognition performance, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 133–136. IEEE (2001)

  42. J. Sebastian, M. Kumar, H.A. Murthy, An analysis of the high resolution property of group delay function with applications to audio signal processing. Speech Commun. 81, 42–53 (2016)

    Article  Google Scholar 

  43. H.R. Seresht, S.M. Ahadi, S. Seyedin, Spectro-temporal power spectrum features for noise robust asr. Circuits Syst. Signal Process. 36(8), 3222–3242 (2017)

    Article  Google Scholar 

  44. G. Shi, M.M. Shanechi, P. Aarabi, On the importance of phase in human speech recognition. IEEE Trans. Audio Speech Lang. Process. 14(5), 1867–1874 (2006)

    Article  Google Scholar 

  45. M. Slaney et al., An efficient implementation of the patterson-holdsworth auditory filter bank. Apple Computer, Perception Group, Tech. Rep 35(8), (1993)

  46. N.S. Srinivas, N. Sugan, N. Kar, L. Kumar, M.K. Nath, A. Kanhe, Recognition of spoken languages from acoustic speech signals using fourier parameters. Circuits Syst. Signal Process. 38(11), 5018–5067 (2019)

    Article  Google Scholar 

  47. T. Thiruvaran, E. Ambikairajah, J. Epps, Extraction of fm components from speech signals using all-pole model. Electron. Lett. 44(6), 449–450 (2008)

    Article  Google Scholar 

  48. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii: noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  49. R. Venkatesan, A.B. Ganesh, Binaural classification-based speech segregation and robust speaker recognition system. Circuits Syst. Signal Process. 37(8), 3383–3411 (2018)

    Article  MathSciNet  Google Scholar 

  50. B. Yegnanarayana, H.A. Murthy, Significance of group delay functions in spectrum estimation. IEEE Trans. Signal Process. 40(9), 2281–2289 (1992)

    Article  Google Scholar 

Download references

Acknowledgements

This work is an outcome of the R&D work undertaken project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation. We are thankful to Electronics and Communication Engineering Department, National Institute of Technology Meghalaya, for giving us the opportunity to use the necessary equipments required to conduct the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anirban Dutta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dutta, A., Ashishkumar, G. & Rao, C.V.R. Improving the Performance of ASR System by Building Acoustic Models using Spectro-Temporal and Phase-Based Features. Circuits Syst Signal Process 41, 1609–1632 (2022). https://doi.org/10.1007/s00034-021-01848-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01848-w

Keywords

Navigation