Abstract
This paper presents a conversational speech recognition system able to operate in non-stationary reverberated environments. The system is composed of a dereverberation front-end exploiting multiple distant microphones, and a speech recognition engine. The dereverberation front-end identifies a room impulse response by means of a blind channel identification stage based on the Unconstrained Normalized Multi-Channel Frequency Domain Least Mean Square algorithm. The dereverberation stage is based on the adaptive inverse filter theory and uses the identified responses to obtain a set of inverse filters which are then exploited to estimate the clean speech. The speech recognizer is based on tied-state cross-word triphone models and decodes features computed from the dereverberated speech signal. Experiments conducted on the Buckeye corpus of conversational speech report a relative word accuracy improvement of 17.48% in the stationary case and of 11.16% in the non-stationary one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Haque, M., Hasan, M.: Noise robust multichannel frequency-domain LMS algorithms for blind channel identification. IEEE Signal Process. Lett. 15, 305–308 (2008)
Hikichi, T., Delcroix, M., Miyoshi, M.: Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations. EURASIP Journal on Advances in Signal Process. 2007(1) (2007)
Huang, Y., Benesty, J.: A class of frequency domain adaptive approaches to blind multichannel identification. IEEE Trans. Speech Audio Process. 51(1), 11–24 (2003)
Kumar, K., Singh, R., Raj, B., Stern, R.: Gammatone sub-band magnitude-domain dereverberation for ASR. In: Proc. of ICASSP, pp. 4604–4607 (May 2011)
Miyoshi, M., Kaneda, Y.: Inverse filtering of room acoustics. IEEE Trans. Signal Process. 36(2), 145–152 (1988)
Naylor, P., Gaubitch, N.: Speech Dereverberation. Signals and Communication Technology. Springer (2010)
Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., Fosler-Lussier, E.: Buckeye corpus of conversational speech, 2nd release (2007), http://www.buckeyecorpus.osu.edu , Columbus, OH: Department of Psychology, Ohio State University (Distributor)
Principi, E., Cifani, S., Rocchi, C., Squartini, S., Piazza, F.: Keyword spotting based system for conversation fostering in tabletop scenarios: Preliminary evaluation. In: Proc. of 2nd Int. Conf. on Human System Interaction, Catania, pp. 216–219 (2009)
Principi, E., Cifani, S., Rotili, R., Squartini, S., Piazza, F.: Comparative evaluation of single-channel MMSE-based noise reduction schemes for speech recognition. Journal of Electrical and Computer Engineering 2010, 6 (2010)
Rotili, R., Cifani, S., Principi, E., Squartini, S., Piazza, F.: A robust iterative inverse filtering approach for speech dereverberation in presence of disturbances. In: Proc. of IEEE APCCAS, pp. 434–437 (December 2008)
Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication, 1062–1087 (February 2011)
Schuller, B., Wöllmer, M., Moosmayr, T., Rigoll, G.: Recognition of noisy speech: A comparative survey of robust model architecture and feature enhancement. EURASIP Journal on Audio, Speech, and Music Processing 2009, 17 (2009)
Sehr, A., Maas, R., Kellermann, W.: Reverberation model-based decoding in the logmelspec domain for robust distant-talking speech recognition. IEEE Trans. on Audio, Speech, and Lang. Process. 18(7), 1676–1691 (2010)
Wölfel, M., McDonough, J.: Distant Speech Recognition, 1st edn. Wiley, New York (2009)
Wöllmer, M., Schuller, B., Rigoll, G.: A novel Bottleneck-BLSTM front-end for feature-level context modeling in conversational speech recognition. In: Proc. of ASRU, Waikoloa, Big Island, Hawaii, pp. 36–41 (December 2011)
Wöllmer, M., Marchi, E., Squartini, S., Schuller, B.: Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting. Cognitive Neurodynamics 5(3), 253–264 (2011)
Young, S., Everman, G., Kershaw, D., Moore, G., Odell, J.: The HTK Book. Cambridge University Engineering (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rotili, R., Principi, E., Wöllmer, M., Squartini, S., Schuller, B. (2012). Conversational Speech Recognition in Non-stationary Reverberated Environments. In: Esposito, A., Esposito, A.M., Vinciarelli, A., Hoffmann, R., Müller, V.C. (eds) Cognitive Behavioural Systems. Lecture Notes in Computer Science, vol 7403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34584-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-34584-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34583-8
Online ISBN: 978-3-642-34584-5
eBook Packages: Computer ScienceComputer Science (R0)