Speech Enhancement Using Source Information for Phoneme Recognition of Speech with Background Music
- 54 Downloads
This work explores the significance of source information for speech enhancement resulting in better phoneme recognition of speech with background music segments. Standard procedure for speech enhancement in noisy conditions involves sequential processing in terms of the temporal, spectral and perceptual methods. This work follows the same sequential processing but with the additional modification of studying the effect of source, particularly in the temporal and perceptual-based enhancement techniques for enhancing speech with background music segments. The source information is studied in terms of the epoch locations and epoch strength, obtained after passing the sum of the mean and standard deviation of the component envelopes computed across frequencies obtained using the single frequency filter (SFF), through a zero frequency filter (ZFF). This method of obtaining epoch locations and epoch strength will be termed as SFF-ZFF in this work. The enhanced segments are passed through a phoneme recognizer built using Gaussian mixture model-hidden Markov model (GMM-HMM), subspace Gaussian mixture model-hidden Markov model (SGMM-HMM) and deep neural network-hidden Markov model (DNN-HMM) system, where the models are trained on clean speech. The enhanced audio files show a better phone error rate than the degraded audio files, which means that performing enhancement in terms of the source information is significant for the speech with background music regions.
KeywordsSource information Single frequency filter Zero frequency filter Temporal enhancement Spectral enhancement
This work is part of the project titled Multi-modal Broadcast Analytics: Structured Evidence Visualization for Events of Security Concern funded by the e-Security division of the Department of Electronics and Information Technology (DeitY), Govt. of India.
- 2.M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’79, vol. 4 (IEEE, 1979), pp. 208–211Google Scholar
- 11.G. Fant, Speech Sounds and Features (The MIT Press, Cambridge, MA, 1973)Google Scholar
- 14.S. Kamath, P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 4. (Citeseer, 2002), pp. 4164–4164Google Scholar
- 21.D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584 (IEEE Signal Processing Society, 2011)Google Scholar
- 22.D. Povey, X. Zhang, S. Khudanpur, Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455 (2014)
- 24.K. Tokuda, T. Kobayashi, T. Masuko, S. Imai, Mel-generalized cepstral analysis—a unified approach to speech spectral estimation. In: ICSLP (1994)Google Scholar
- 26.K. Veselỳ, A. Ghoshal, L. Burget, D. Povey, Sequence-discriminative training of deep neural networks. In: Interspeech (2013), pp. 2345–2349Google Scholar