Summary
We first consider typical noise sources and channel distortions and then focus on the effect of additive noise on the speech signal. To better understand the gap between machine and human performance, we review early studies and recent results about speech perception of distorted speech by human listeners. Finally, we focus on two important issues often neglected in the building of ASR systems: endpoint detection and the Lombard reflex.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acero, A. (1990). Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis. Carnegie Mellon University.
Acero, A., Crespo, C, De la Torre, C, and Torrecilla, J. (1993). Robust HMM-based endpoint detector. In EUROSPEECH, pages 1551–1554.
Ainsworth, W. (1976). Mechanisms of Speech Recognition. Pergamon Press.
Ainsworth, W. and Pratt, S. (1993). Comparing error correction strategies in speech recognition systems. In Baber, C. and Noyés, J., editors, Interactive Speech Technology, pages 131–135. Taylor&Francis.
Allen, J. and Berkley, D. (1979). Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am., pages 943–950.
Anastasakos, A., Kubala, F., Makhoul, J., and Schwartz, R. (1994). Adaptation to new microphones using tied-mixture normalization. In ICASSP, pages I.433–I.436.
Anglade, Y., Fohr, D., and Junqua, J.-C. (1992). Selectively trained neural networks for the discrimination of normal and Lombard speech. In ICSLP, pages 595–598.
Applebaum, T. and Hanson, B. (1991). Tradeoffs in the design of regression features for word recognition. In EUROSPEECH, pages 1203–1206.
Barnwell III T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. In ICASSP, pages 710–713.
Bateman, D., Bye, D., and Hunt, M. (1992). Spectral contrast normalization and other techniques for speech recognition in noise. In ICASSP, pages I.241-I.244.
Bernstein, J., Taussig, K., and Godfrey, J. (1994). Macrophone: An American English telephone speech corpus for the Polyphone project. In ICASSP, pages I.81-I.84.
Blauert, J. (1983). Spatial Hearing. M.I.T. Press.
Bregman, A. (1990). Auditory Scene Analysis. M.I.T. Press.
Brown, K. and George, E. (1995). CTIMIT: A speech corpus for the cellular environment with applications to automatic speech recognition. In ICASSP, pages 105–108.
Carbonell, N., Damestoy, J.-P., Fohr, D., Haton, J.-P., and Lonchamp, F. (1986). APHODEX, design and implementation of an acoustic-phonetic decoding expert system. In ICASSP, pages 1201–1204.
Carey, M., Chen, H.-T., Descloux, A., Ingle, J., and Park, K. (1984). 1982/83 end office connection study: Analog voice and voiceband data transmission performance characterization of the public switched network. AT&T Bell Laboratories Technical Journal, 63(9):2059–2119.
Chang, J. and Zue, V. (1994). A study of speech recognition system robustness to microphone variations: Experiments in phonetic classification, hi ICSLP, pages 995–998.
Cherry, C. and Wiley, R. (1977). Speech communication in very noisy environments. In Hawley, M., editor, Speech Intelligibility and Speaker Recognition, page 300. Dowden, Hutchinson & Ross, Inc.
Cole, R., Novick, D., Burnett, D., Hansen, B., Sutton, S., and Fanty, M. (1994). Towards automatic collection of the U.S. census. In ICASSP, pages I.93-I.96.
Cole, R., Roginski, K., and Fanty, M. (1992). A telephone speech database of spelled and spoken names. In ICSLP, pages 891–893.
Cole, R., Stern, R., and Lasry, M. (1985). Performing fine phonetic distinctions. In Perkell, J. and Klatt, D., editors, Variability and Invariance in Speech Processes, pages 325–345. Lawrence Erlbaum Associates.
Crawford, M., Brown, G., Cooke, M., and Green, P. (1994). Design, collection and analysis of a multi-simultaneous-speaker corpus. In Proc. of the Institute of Acoustics, Vol. 16, Part 5, pages 183–190.
Damhuis, M., Boogaart, T., int’t Veld, C., Versteijlen, M., Schelvis, W., Bos, L., and Boves, L. (1994). Creation and analysis of the Dutch Polyphone corpus. In ICSLP, pages 1803–1806.
Das, S., Nádas, A., Nahamoo, D., and Picheny, M. (1994). Adaptation techniques for ambience and microphone compensation in die IBM Tangora speech recognition system. In ICASSP, pages I.21-I.24.
de Krom, G. (1990). A new cepstrum-based technique for the estimation of spectral signal-to-noise ratio in speech signals. In ETRW: Speaker Characterization in Speech Technology, Edinburgh, Scotland, pages 83–93.
Dennody, P. (1992). Human capabilities for speech processing in noise. In ETRW: Speech Processing in Adverse Conditions, pages 11–19.
Doddington, G. (1992). CSR corpus development. In DARPA Workshop Speech and Natural Language, pages 363–366.
Dreher, J. and O’Neill, J. (1957). Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am., 29:1320–1323.
Egan, J. (1967). Pshychoacoustics of the Lombard voice reflex. Ph.D. thesis. Western Reserve University.
Ephraim, Y., Wilpon, J., and Rabiner, L. (1987). A linear predictive front-end processor for speech recognition in noisy environments. In ICASSP, pages 1324–1327.
Erell, A. and Weintraub, M. (1990). Estimation using log-spectral-distance criterion for noise-robust speech recognition. In ICASSP, pages 853–856.
Fairbanks, G. (1954). Systematic research in experimental phonetics. A theory of the speech mechanism as a servosystem. Journal of Speech and Hearing Research, 19:133–139.
Fisher, W., Doddington, G., and Goudie-Marshall, K. (1986). The DARPA speech recognition database: Specifications and status. In DARPA Workshop on Speech Recognition, pages 93–99.
Flanagan, J., Johnston, J., Zahn, R., and Elko, G. (1985). Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am., 78:1508–1518.
Fletcher, H., Raff, G., and Parmley, F. (1918). Study of the effects of different amounts of sidetone in the telephone set. Technical Report 19412, Western Electric Company.
French, N. and Steinberg, J. (1947). Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am., pages 90–119.
Fund, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. ASSP, ASSP-34:52–59.
Gay, T. (1977). Articulatory movements in VCV sequences. J. Acoust. Soc. Am., 62:183–193.
Halphen, E. (1910). Des Lésions traumatiques de l’oreille interne. Ph.D. thesis. Faculté de Médecine, Paris.
Hamada, M., Takizawa, Y., and Norimatsu, T. (1990). A noise robust speech recognition system. In ICSLP, pages 893–896.
Hansen, J. (1988). Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Ph.D. thesis. Georgia Institute of Technology.
Hansen, J. and Bria, O. (1990). Lombard effect compensation for robust automatic speech recognition in noise. In ICSLP, pages 1125–1128.
Hanson, B. and Applebaum, T. (1990). Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech. In ICASSP, pages 857–860.
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P. (1991). Compensation for the effect of the communication channel in auditory-like analysis of speech (RAS-TA-PLP). In EUROSPEECH, pages 1367–1370.
Hirsch, H., Meyer, P., and Ruehl, H. (1991). Improved speech recognition using highpass filtering of subband envelopes. In EUROSPEECH, pages 413–416.
Howes, D. (1957). On the relation between the intelligibility and frequency of occurrence of English words. J. Acoust. Soc. Am., 29(2):296–305.
Huang, X., Alleva, R, Hon, H.-W., Hwang, M.-Y., Lee, K.-F., and Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2): 137–148.
Jankowski, C., Kalyanswamy, A., Basson, S., and Spitz, J. (1990). N-TIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In ICASSP. pages 109–112.
Jot, J.-M. (1992). An analysis/synthesis approach to real-time artificial reverberation. In ICASSP, pages II.221-II.224.
Junqua, J.-C. (1990). ORION: A two pass hybrid system for isolated-words automatic speech recognition. In ICASSP, pages 41–44.
Junqua, J.-C. (1991). Robustness and cooperative multimodal man-machine communication applications. In Second Venaco Workshop: The Structure of Multimodal Dialogue.
Junqua, J.-C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am., 93(1):510–524.
Junqua, J.-C, Mak, B., and Reaves, B. (1994). A robust algorithm for word boundary detection in the presence of noise. IEEE Trans, on Speech and Audio Processing, 2(3):406–412.
Junqua, J.-C. and Wakita, H. (1989). A comparative study of cepstral lifters and distance measures for all-pole models of speech in noise. In ICASSP, pages 476–479.
Kahn, D. and Gnanadesikan, A. (1986). Experiments in speech recognition over the telephone network. In ICASSP, pages 729–732.
Lamel, L., Rabiner, L., Rosenberg, A., and Wilpon, J. (1981). An improved endpoint detector for isolated word recognition. IEEE Trans. ASSP, ASSP-29:777–785.
Lane, H. and Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research, 14:677–709.
Lane, H., Tranel, B., and Sisson, C. (1970). Regulation of voice communication by sensory dynamics. J. Acoust. Soc. Am., 47(2):618–624.
Langhans, T. and Strube, H. (1982). Speech enhancement by nonlinear multiband envelope filtering. In ICASSP, pages 156–159.
Lecomte, I., Lever, M., Boudy, J., and Tassy, A. (1989). Car noise processing for speech input. In ICASSP, pages 512–515.
Lim, J. and Oppenheim, A. (1979). Enhancement and bandwidth compression of noisy speech. Proc. IEEE, 67(12): 1586–1604.
Lippmann, R. (1987). An introduction to computing with neural nets. IEEE Trans. ASSP Magazine, 4(2):4–22.
Liu, F.-H., Stern, R., Acero, A., and Moreno, P. (1994). Environment normalization for robust speech recognition using cepstral normalization. In ICASSP, pages II.61-II.64.
Lombard, E. (1911). Le signe de l’élévation de la voix. Ann. Maladies Oreille, Larynx, Nez,Pharynx, 37:101–119.
Mak, B., Junqua, J.-C, and Reaves, B. (1992). A robust speech/non-speech detection algorithm using time and frequency-based features. In ICASSP, pages 269–272.
Mak, M. and Allen, W. (1994). Lip-motion analysis for speech segmentation in noise. Speech Communication, 14(3):279–296.
Mansour,D. and Juang, B.-H. (1989). A family of distortion measures based upon projection operation for robust speech recognition. IEEE Trans. ASSP, ASSP-37(11):1659–1671.
Martin, R. (1993). An efficient algorithm to estimate the instantaneous SNR of speech signals. In EUROSPEECH, pages 1093–1096.
Miyoshi, M. and Kaneda, Y. (1988). Inverse filtering of room acoustics, IEEE Trans. ASSP, ASSP-36(2):145–152.
Mokbel, C. (1992). Reconnaissance de la Parole dans le Bruit: Bruitage/Débruitage. Ph.D. thesis. Ecole Nationale Supérieure des Télécommunications.
Mokbel, C., Monné, J., and Jouvet, D. (1993). On line adaptation of a speech recognizer to variations in telephone line conditions. In EUROSPEECH, pages 1247–1250.
Moreno, P. and Stern, R. (1994). Sources of degradation of speech recognition in the telephone network. In ICASSP, pages I.109-I.112.
Murveit, H., Butzberger, J., and Weintraub, M. (1992a). Performance of SRI’s DECIPHER speech recognition system on DARPA’s CSR task. In DARPA Workshop Speech and Natural Language, pages 410–414.
Murveit, H., Butzberger, J., and Weintraub, M. (1992b). Reduced channel dependence for speech recognition. In DARPA Workshop Speech and Natural Language, pages 280–284.
Muthusamy, Y., Cole, R., and Oshika, B. (1992). The OGI multi-language telephone speech corpus. In ICSLP, pages 895–898.
Nâdas, A., Nahamoo, D., and Picheny, M. (1988). Adaptive labeling: Normalization of speech by adaptive transformations based on vector quantization. In ICASSP, pages 521–524.
Ney, H. (1981). An optimization algorithm for determining the endpoints of isolated utterances. In ICASSP, pages 720–723.
Noll, P. (1974). Adaptive quantization in speech coding systems. In Int. Zurich Seminar on Digital Communications, pages B3.1-B3.6.
Pick, H., Siegel, J., Fox, P., Garber, S., and Kearney, J. (1989). Inhibiting the Lombard effect. J. Acoust Soc. Am., 85(2):894–900.
Pickett, J. (1956). Effects of vocal force on the intelligibility of speech sounds. J. Acoust. Soc. Am., 28(5): 902–905.
Pitrelli, J., Fong, C., Wong, S., Spitz, J., and Leung, H. (1995). PhoneBook: A phonetically-rich isolated-word telephone-speech database. In ICASSP, pages 101–104.
Rabiner, L. and Sambur, M. (1975). An algorithm for determining the endpoints of isolated utterances. Bell Syst. Tech. J., 54(2):297–315.
Rajasekaran, P. and Doddington, G. (1985). Speech recognition in the F16 cockpit using principal spectral components. In ICASSP, pages 882–885.
Rajasekaran, P., Doddington, G., and Picone, J. (1986). Recognition of speech under stress and in noise. In ICASSP, pages 733–736.
Rangoussi, M., Bakamidis, S., and Carayannis, G. (1993). Robust endpoint detection of speech in the presence of noise. In EUROSPEECH, pages 649–652.
Reaves, B. (1991). Comments on an improved endpoint detector for isolated word recognition. Correspondence IEEE ASSP, 39:526–527.
Reaves, B. (1993). Parameters for noise robust speech detection. In Acoustic Society of Japan, Fall, pages 197–198.
Reaves, B. and Junqua, J.-C. (1992). Robust realtime preprocessing for speech recognition. In Acoustical Society of Japan, Fall, pages 225–226.
Rosenbeck, P., Baungaard, B., Jacobsen, C., and Barry, D.-J. (1994). The design and efficient recording of a 3000 speaker Scandinavian telephone speech database: RAFAEL.0. In ICSLP, pages 1807–1810.
Rostolland, D. and Parant, C. (1973). Distortion and intelligibility of shouted voice. In Symposium: Speech Intelligibility. Linoitalic, pages 293–304.
Savoji, M. (1989). A robust algorithm for accurate endpointing of speech. Speech Communication, 8:45–60.
Sayers, B. M. and Cherry, E. (1957). Mechanism of binaural fusion in the hearing of speech. J. Acoust. Soc. Am., 29(9):973–987.
Schulman, R. (1985). Articulatory targeting and perceptual constancy of loud speech. Technical report, PERDLUS IV, Stockholm University.
Schulman, R. (1989). Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am., 85(1):295–312.
Soong, F. and Sondhi, M. M. (1988). A frequency-weighted Itakura spectral distortion measure and its application to speech recognition in noise. IEEE Trans. ASSP, ASSP-36(1):41–48.
Staples, T., Picone, J., and Arai, N. (1994). The voice across Japan database — The Japanese language contribution to Polyphone. In ICASSP, pages I.89-I.92.
Starks, D. and Morgan, M. (1992). Integrating speech recognition into a helicopter. In ETRW: Speech Processing in Adverse Conditions, pages 195–198.
Steeneken, H. and Geurtsen, F. (1990). Description of the RSG-10 noise database. Technical report, TNO Institute for Perception.
Stevens, K. (1987). Relational properties as perceptual correlates of phonetic features. In Eleventh ICphS, pages 352–356.
Summers, W., Pisoni, D., Bernacki, R., Pedlow, R., and Stokes, M. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. J. Acoust. Soc. Am., 84(3):917–928.
Takizawa, Y. and Hamada, M. (1990). Lombard speech recognition by formant-fre-quency-shifted LPC cepstrum. In ICSLP, pages 293–296.
Tapias, D., Acero, A., Esteve, J., and Torrecilla, J. (1994). The VESTEL telephone speech database. In ICSLP, pages 1811–1814.
Tribolet, J., Noll, P., McDermott, B., and Crochieie, R. (1978). A study of complexity and quality of speech waveform coders. In ICASSP, pages 586–590.
Tsao, C. and Gray, R. (1984). An endpoint detector for LPC speech using residual error look-ahead for vector quantization applications. In ICASSP, pages 18b.7.1–4.
Van Compernolle, D., MA, W., Xie, F., and Van Diest, M. (1990). Speech recognition in noisy environments with the aid of microphone arrays. Speech Communication, 9(5–6):433–442.
Varga, A. and Steeneken, H. (1993). Assessment for automatic speech recognition: II NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251.
Viswanathan, V. and Henry, C. (1986). Evaluation of multisensor speech input for speech recognition in high ambient noise. In ICASSP, pages 85–88.
Wang, H. and Itakura, F. (1991). An approach of deverberation using multi-microphone sub-band envelope estimation. In ICASSP, pages 953–956.
Wilpon, J. and Rabiner, L. (1987). Application of hidden Markov models to automatic speech endpoint detection. Computer Speech and Language, 2:321–341.
Yumoto, E. and Gould, W. (1982). Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust Soc. Am., 71(6): 1544–1550.
Zwierzynski, D. and Lefèbvre, C. (1992). Recognition of degraded speech with an IM-ELDA acoustic representation: A helicopter fly-by-voice project. In ETRW: Speech Processing in Adverse Conditions, pages 191–194.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1996 Kluwer Academic Publishers
About this chapter
Cite this chapter
Junqua, JC., Haton, JP. (1996). Dealing with Noisy Speech and Channel Distortions. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_5
Download citation
DOI: https://doi.org/10.1007/978-1-4613-1297-0_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8555-7
Online ISBN: 978-1-4613-1297-0
eBook Packages: Springer Book Archive