Designing of Gabor filters for spectro-temporal feature extraction to improve the performance of ASR system
- 4 Downloads
Existing automatic speech recognition (ASR) system uses the spectral or temporal features of speech. The performance of such systems is still poor compared to the human perception of hearing, especially in noisy environments. This paper concentrates on the extraction of spectro-temporal features based on physiological and psychoacoustically inspired approaches. Here, two dimensional Gabor filters are used to estimate the spectro-temporal features from time–frequency representation of uttered speech signals. The Gabor filters are designed using the concept of constant Q factor. It is found that human perception system maintains approximately constant Q in its frequency response along the chain of its filter bank. Constant Q analysis ensures that the Gabor filters occupy a set of geometrically spaced spectral and temporal bins. Time–frequency representation of speech signal is a key ingredient for Gabor based feature extraction method. For time–frequency mapping, Gammatonegram is adopted instead of conventional spectrogram representations. The performance of the ASR system with the proposed feature set is experimentally validated using AURORA2 noisy digit database. Under clean training; the proposed features obtained a relative improvement of about 50% in word error rate (WER) compared to Mel frequency cepstral coefficients (MFCC) features. A relative improvement of 23% in WER is also obtained compared with that of existing spectro-temporal feature extraction methods. Further analysis is carried out on TIMIT corrupted with noise samples taken from the NOISEX-92 database. The experimental verification proves the robustness of proposed features in building a robust acoustic model for the ASR system.
KeywordsSpectro-temporal feature Constant Q factor Deep neural network Gabor filter Speech recognition
This work is an outcome of the R&D work undertaken Project under the Visvesvaraya PhD Scheme of Ministry of Electronics and Information Technology, Government of India, being implemented by Digital India Corporation. We are thankful to Electronics and Communication Engineering Department, National Institute of Technology Meghalaya for giving us the opportunity to use the necessary equipments required to conduct the research.
- Ellis, D. P. W. (2009). Gammatone-like spectrograms. http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram.
- Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic–phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93.Google Scholar
- Hirsch, H. G. (2005). FaNT-filtering and noise adding tool. Niederrhein University of Applied Sciences. http://dnt.kr.hsnr.de/download.html.
- Hirsch, H. G., & Pearce, D. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millennium ISCA Tutorial and Research Workshop (ITRW).Google Scholar
- Katsiamis, A. G., Drakakis, E. M., & Lyon, R. F. (2007). Practical gammatone-like filters for auditory processing. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 063685.Google Scholar
- Kim, C., & Stern, R. M. (2009). Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. In: Tenth annual conference of the International Speech Communication Association.Google Scholar
- Kleinschmidt, M. (2003). Localized spectro-temporal features for automatic speech recognition. In Eighth European conference on speech communication and technology.Google Scholar
- Kleinschmidt, M., & Gelbart, D. (2002). Improving word accuracy with Gabor feature extraction. In Seventh international conference on spoken language processing.Google Scholar
- Martinez, A. M. C., Moritz, N., & Meyer, B. T. (2014). Should deep neural nets have ears? The role of auditory features in deep learning approaches. In Fifteenth annual conference of the International Speech Communication Association.Google Scholar
- Mesgarani, N., David, S., & Shamma, S. (2007). Representation of phonemes in primary auditory cortex: How the brain analyzes speech. In 2007 IEEE international conference on acoustics, speech and signal processing—ICASSP’07 (Vol. 4, pp. IV-765). IEEE.Google Scholar
- Mesgarani, N., Thomas, S., & Hermansky, H. (2010). A multistream multiresolution framework for phoneme recognition. In Eleventh annual conference of the International Speech Communication Association.Google Scholar
- Meyer, B. T., & Kollmeier, B. (2011). Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Communication, 53(5), 753–767.Google Scholar
- Mohamed, Ar., Sainath, T. N., Dahl, G. E., Ramabhadran, B., Hinton, G. E., Picheny, M. A., et al. (2011). Deep belief networks using discriminative features for phone recognition. In ICASSP (pp. 5060–5063).Google Scholar
- Patterson, R., et al. (1992). Complex sounds and auditory images. In Y. Cazals, et al. (Eds.), Auditory physiology and perception. Oxford: Pergamon Press.Google Scholar
- Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The Kaldi speech recognition toolkit. Technical report. IEEE Signal Processing Society.Google Scholar
- Povey, D., Zhang, X., & Khudanpur, S. (2014). Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint arXiv:14107455.
- Rath, S. P., Povey, D., Veselỳ, K., & Cernockỳ, J. (2013). Improved feature processing for deep neural networks. In Interspeech (pp. 109–113).Google Scholar
- Slaney, M., et al. (1993). An efficient implementation of the Patterson–Holdsworth auditory filter bank. Technical report, 35(8). Apple Computer, Perception Group.Google Scholar
- Todisco, M., Delgado, H., & Evans, N. (2016). A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In Speaker Odyssey workshop, Bilbao, Spain (Vol. 25, pp. 249–252).Google Scholar
- Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 215–219). IEEE.Google Scholar
- Zhao, S. Y., Ravuri, S., & Morgan, N. (2009). Multi-stream to many-stream: Using spectro-temporal features for ASR. In: Tenth annual conference of the International Speech Communication Association.Google Scholar