Skip to main content
Log in

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In the context of automatic speech recognition (ASR), the power spectrum is generally warped to the Mel-scale during front-end speech parameterization. This is motivated by the fact that human perception of sound is nonlinear. The Mel-filterbank provides better resolution for low-frequency contents, while a greater degree of averaging happens in the high-frequency range. The work presented in this paper aims at studying the role of linear, Mel and inverse-Mel-filterbanks in the context of ASR. When speech data are from high-pitched speakers like children, there is a significant amount of relevant information in the high-frequency region. Hence, down-sampling the information in that range through Mel-filterbank reduces the recognition performance. On the other hand, employing inverse-Mel or linear-filterbanks is expected to be more effective in such cases. The same has been experimentally validated in this work. For that purpose, an ASR system is developed on adults’ speech and tested using data from adult as well as child speakers. Significantly improved recognition rates are noted for children’s as well adult females’ speech when linear or inverse-Mel-filterbank is used. The use of linear filters results in a relative improvement of \(21\%\) over the baseline. To further boost the performance, vocal-tract length normalization, explicit pitch scaling and pitch-adaptive spectral estimation are also explored on top of linear filterbank.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. W. Ahmad, S. Shahnawazuddin, H.K. Kathania, G. Pradhan, A.B. Samaddar, Improving children’s speech recognition through explicit pitch scaling based on iterative spectrogram inversion. in Proceedings of INTERSPEECH (2017)

  2. A. Batliner, M. Blomberg, S.D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The \(\text{PF}\_\text{ STAR }\) children’s speech corpus. in Proceedings of INTERSPEECH, pp. 2761–2764 (2005)

  3. D. Byrd, S. Yildirim, S. Narayanan, S. Khurana, Acoustic analysis of preschool children’s speech. in Proceedings of 15th ICPhS Barcelona, pp. 949–952 (2003)

  4. S. Chakroborty, A. Roy, G. Saha, Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int. J. Electr. Comput. Energ. Electron. Commun. Eng. 2(11), 2554–2561 (2008)

    Google Scholar 

  5. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  6. S. D’Arcy, M. Russell, A comparison of human and computer recognition accuracy for children’s speech. in Proceedings of INTERSPEECH, pp. 2187–2200 (2005)

  7. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  8. G. Garau, S. Renals, Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 16(3), 508–518 (2008)

    Article  Google Scholar 

  9. M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. in Proceedings of Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)

  10. S. Ghai, Addressing pitch mismatch for children’s automatic speech recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India, 2011

  11. S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC. in Proceedings of INTERSPEECH, pp. 2589–2592 (2011)

  12. S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition. in Proceedings of Signal Processing and Communications (SPCOM) (2010)

  13. S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition. in Proceedings of INTERSPEECH, pp. 1607–1610 (2009)

  14. G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  15. H.K. Kathania, S. Shahnawazuddin, N. Adiga, W. Ahmad, Role of prosodic features on children’s speech recognition. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5519–5523 (2018)

  16. H.K. Kathania, W. Ahmad, S. Shahnawazuddin, A.B. Samaddar, Explicit pitch mapping for improved children’s speech recognition. Circuits Syst. Signal Process. 37(5), 2021–2044 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  17. H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHTs, a speech analysis, modification and synthesis framework. Sadhana 36(5), 713–727 (2011)

    Article  Google Scholar 

  18. H. Kawahara, I. Masuda-Katsuse, A. De Cheveigné, Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 27(3), 187–207 (1999)

    Article  Google Scholar 

  19. R.D. Kent, Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. JHSR 9, 421–447 (1976)

    Google Scholar 

  20. L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)

    Article  Google Scholar 

  21. H. Lei, E. Gonzalo, Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition. in Proceedings of INTERSPEECH, pp. 2323–2326 (2009)

  22. H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children. in Proceedings of INTERSPEECH, pp. 1611–1615 (2015)

  23. A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners. in Proceedings of INTERSPEECH, pp. 1468–1472 (2014)

  24. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit. in Proceedings of ASRU (2011)

  25. M.R. Qun Li, An analysis of the causes of increased error rates in children’s speech recognition. in Proceedings of ICSLP2002, Sept 2002

  26. S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks. in Proceedings of INTERSPEECH (2013)

  27. T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. in Proceedings of ICASSP, vol. 1, pp. 81–84 (1995)

  28. A. Roy, G. Saha, S. Majumdar, S. Chakroborty, Capturing complementary information via reversed filter bank and parallel implementation with MFCC for improved text-independent speaker identification. in Proceedings of International Conference on Computing: Theory and Applications(ICCTA), pp. 463–467 (2007)

  29. M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)

  30. M. Russell, S. D’Arcy, L. Qun, The effects of bandwidth reduction on human and computer recognition of children’s speech. IEEE Signal Process. Lett. 14(12), 1044–1046 (2007)

    Article  Google Scholar 

  31. R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition. in Proceedings of Spoken Language Technology Workshop (SLT), pp. 135–140 (2014)

  32. R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Nat. Lang. Eng. 23(3), 325–350 (2016)

    Article  Google Scholar 

  33. S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. in Proceedings of INTERSPEECH (2016)

  34. S. Shahnawazuddin, H. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. in Proceedings of TENCON (2015)

  35. S. Shahnawazuddin, N. Adiga, H.K. Kathania, G. Pradhan, R. Sinha, Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition. Digit. Signal Process. 79, 142–151 (2018)

    Article  MathSciNet  Google Scholar 

  36. S. Shahnawazuddin, H.K. Kathania, A. Dey, R. Sinha, Improving children’s mismatched asr using structured low-rank feature projection. Speech Commun. 105, 103–113 (2018)

    Article  Google Scholar 

  37. R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition. in Proceedings of INTERSPEECH, pp. 568–571 (2009)

  38. R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal processing for children’s speech recognition. Comput. Speech Lang. 48, 103–121 (2018)

    Article  Google Scholar 

  39. R. Vergin, D. O’Shaughnessy, A. Farhat, Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. ASSP 7(5), 525–532 (1999)

    Google Scholar 

  40. X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, S. Shamma, Linear versus Mel frequency cepstral coefficients for speaker recognition. in Proceedings of ASRU, pp. 559–564 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Shahnawazuddin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kathania, H.K., Shahnawazuddin, S., Ahmad, W. et al. Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers. Circuits Syst Signal Process 38, 4667–4682 (2019). https://doi.org/10.1007/s00034-019-01072-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01072-7

Keywords

Navigation