Abstract
In this paper, we investigate the effect of temporal context for speech/ non-speech detection (SND). It is shown that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for further investigation. Experimental evaluations on the test data set, with a state-of-the-art multi-layer perceptron based SND system and a simple energy threshold based SND method, using the F-measure, show an absolute performance gain of 4.4% and 5.4% respectively. The optimal contextual length was found to be 1000 ms. Further numerical optimizations yield an improvement (3.37% absolute), resulting in an absolute gain of 7.77% and 8.77% over the MLP based and energy based methods respectively. ROC based performance evaluation also reveals promising performance for the proposed method, particularly in low SNR conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Atal, B.S., Rabiner, L.R.: A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Trans. on Acoust., Speech and Signal Process (1976)
Dines, J., Vepa, J., Hain, T.: The segmentation of multi-channel meeting recordings for automatic speech recognition. In: Int. Conf. on Spoken Language Processing (Interspeech ICSLP), Pittsburgh, USA, pp. 1213–1216 (2006)
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)
Garofolo, J.S., Laprun, C.D., Michel, M., Stanford, V.M., Tabassi, E.: The NIST meeting room pilot corpus (2004)
Maganti, H.K., Motlicek, P., Perez, D.G.: Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (2007)
Mesgarani, N., Slaney, M., Shamma, S.A.: Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations. IEEE Transactions on Audio, Speech and Language Processing 14, 920–930 (2006)
Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The noisex-92 study on the effect of additive noise on automatic speech recognition. Tech. Report DRA Speech Research Unit (1992)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Krishnan Parthasarathi, S.H., Motlíček, P., Hermansky, H. (2008). Exploiting Contextual Information for Speech/Non-Speech Detection. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_58
Download citation
DOI: https://doi.org/10.1007/978-3-540-87391-4_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87390-7
Online ISBN: 978-3-540-87391-4
eBook Packages: Computer ScienceComputer Science (R0)