Bark scaled oversampled WPT based speech recognition enhancement in noisy environments

  • Navneet UpadhyayEmail author
  • Hamurabi Gamboa Rosales


The performance of speech recognition system degrades significantly in real-world environment, is a case of the acoustic mismatch between the training and operating conditions. This paper presents a two-stage approach to make a speech recognition system immune to additive and uncorrelated background noise i.e. robust. In the first stage, an oversampled wavelet packet decomposes the entire input noisy speech into seventeen nonlinear frequency subbands like the Bark scale of the human hearing system and the adaptive noise estimation based spectral subtraction filters the noisy speech from each subband signal. The oversampled WPT is linear and advantageous as it causes to overcome the shift-invariance complexity by removing the decimation after the filtering at each decomposition level. In the second stage, a nonparametric approach is used for feature extraction from filtered speech, and the parameters from the feature extraction stage are compared with the parameters extracted from speech signals stored in a template to recognize the utterance. A series of experiments are carried out to evaluate the performance of the proposed two-stage system in a variety of real environments, with and without the use of the first stage. Recognition accuracy is evaluated at the word level in a wide range of SNRs for various types of noisy environments. The experimental results show significant improvement in recognition performance at low SNR using the proposed system.


Speech enhancement Oversampled WPT Bark and Mel frequency scale Hidden Markov model Speech recognition 



  1. Acero, A., & Stern, R. M. (1990). Environmental robustness in automatic speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA (Vol. 2, pp. 849–852).Google Scholar
  2. Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication,49, 763–786.CrossRefGoogle Scholar
  3. Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech corrupted by acoustic noise. In International Conference on Acoustics, Speech, and Signal Processing, Washington, DC, USA (Vol. 4, pp. 208–211).Google Scholar
  4. Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transaction on Speech and Audio Processing,27(2), 113–120.Google Scholar
  5. Cohen, I. (2003). Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Transactions on Speech, and Audio Processing,11(5), 466–475.CrossRefGoogle Scholar
  6. Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing,7(1), 25–46.CrossRefGoogle Scholar
  7. Flores, J. A. N. & Young, S. J. (1993). Adapting a HMM based recognizer for noisy speech enhanced by spectral subtraction. In European conference on speech communication and technology (pp. 829–832).Google Scholar
  8. Gong, Y. (1995). Speech recognition in noisy environments: A survey. Computer Speech & Language,16, 261–291.MathSciNetGoogle Scholar
  9. Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In International conference on spoken language processing, China, Oct 16–20, 2000 (pp. 17–21).Google Scholar
  10. Juang, B. H. (1991). Speech recognition in adverse environments. Computer Speech & Language,5, 275–294.CrossRefGoogle Scholar
  11. Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics,33(3), 251–272.MathSciNetCrossRefGoogle Scholar
  12. Kamath, S., & Loizou, P. (2002). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In International conference on acoustics, speech, and signal processing, USA, May 2002 (Vol. 4, pp. 4160–4164).Google Scholar
  13. Lin, L., Holmes, W., & Ambikairajah, E. (2002). Speech denoising using perceptual modification of Wiener filtering. Electronics Letters,38(23), 1486–1487.CrossRefGoogle Scholar
  14. Lin, L., Holmes, W. H., & Ambikairajah, E. (2003). Adaptive noise estimation algorithm for speech enhancement. Electronics Letters,39(9), 754–755.CrossRefGoogle Scholar
  15. Mallat, S. (1989). A theory for multi-resolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,11(7), 674–693.CrossRefGoogle Scholar
  16. Mallat, S. (2009). A wavelet tour of signal processing: The sparse way (3rd ed.). New York: Academic Press.zbMATHGoogle Scholar
  17. Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transaction on Speech and Audio Processing,9(5), 504–512.CrossRefGoogle Scholar
  18. Olhede, S., & Walden, A. T. (2005). A generalized demodulation approach to time-frequency projections for multi-component signals. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,461, 2159–2179.MathSciNetCrossRefGoogle Scholar
  19. Pallett, Devid S. (1985). Performance assessment of automatic speech recognizers. Journal of Research of the National Bureau of Standards,90(5), 371–385.CrossRefGoogle Scholar
  20. Rix, A. R., Beerends, J., Hollier, M., & Hekstra, A. (2001). Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs. In Proceedings of IEEE international conference on acoustics, speech, and signal processing, Salt Lake City, UT (Vol. 2, pp. 749–752).Google Scholar
  21. Upadhyay, N., & Karmakar, A. (2014). A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments. International Journal of Speech Technology,17, 117–132.CrossRefGoogle Scholar
  22. Upadhyay, N., & Rosales, H. G. (2016). Auditory driven subband speech enhancement for automatic recognition of noisy speech. International Journal of Speech Technology,19(4), 869–880.CrossRefGoogle Scholar
  23. Walden, A. T., & Contreras, C. (1998). The phase-corrected undecimated discrete wavelet packet transform and its application to interpreting the timing of events. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,454, 2243–2266.CrossRefGoogle Scholar
  24. Yamada, Takeshi, Kumakura, Masakazu, & Kitawaki, Nobuhiko. (2006). Performance estimation of speech recognition system under noise conditions using objective quality measures and artificial voice. IEEE Transactions on Audio, Speech and Language Processing,14(6), 2006–2013.CrossRefGoogle Scholar
  25. Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America,68, 1523–1525.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Signal Processing and Acoustics, Faculty of Electrical EngineeringAutonomous University of ZacatecasZacatecasMexico
  2. 2.Department of Electronics and Communication EngineeringThe LNM Institute of Information TechnologyJaipurIndia

Personalised recommendations