Skip to main content

Dealing with Noisy Speech and Channel Distortions

  • Chapter
Robustness in Automatic Speech Recognition

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 341))

  • 203 Accesses

Summary

We first consider typical noise sources and channel distortions and then focus on the effect of additive noise on the speech signal. To better understand the gap between machine and human performance, we review early studies and recent results about speech perception of distorted speech by human listeners. Finally, we focus on two important issues often neglected in the building of ASR systems: endpoint detection and the Lombard reflex.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Acero, A. (1990). Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis. Carnegie Mellon University.

    Google Scholar 

  • Acero, A., Crespo, C, De la Torre, C, and Torrecilla, J. (1993). Robust HMM-based endpoint detector. In EUROSPEECH, pages 1551–1554.

    Google Scholar 

  • Ainsworth, W. (1976). Mechanisms of Speech Recognition. Pergamon Press.

    Google Scholar 

  • Ainsworth, W. and Pratt, S. (1993). Comparing error correction strategies in speech recognition systems. In Baber, C. and Noyés, J., editors, Interactive Speech Technology, pages 131–135. Taylor&Francis.

    Google Scholar 

  • Allen, J. and Berkley, D. (1979). Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am., pages 943–950.

    Google Scholar 

  • Anastasakos, A., Kubala, F., Makhoul, J., and Schwartz, R. (1994). Adaptation to new microphones using tied-mixture normalization. In ICASSP, pages I.433–I.436.

    Google Scholar 

  • Anglade, Y., Fohr, D., and Junqua, J.-C. (1992). Selectively trained neural networks for the discrimination of normal and Lombard speech. In ICSLP, pages 595–598.

    Google Scholar 

  • Applebaum, T. and Hanson, B. (1991). Tradeoffs in the design of regression features for word recognition. In EUROSPEECH, pages 1203–1206.

    Google Scholar 

  • Barnwell III T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. In ICASSP, pages 710–713.

    Google Scholar 

  • Bateman, D., Bye, D., and Hunt, M. (1992). Spectral contrast normalization and other techniques for speech recognition in noise. In ICASSP, pages I.241-I.244.

    Google Scholar 

  • Bernstein, J., Taussig, K., and Godfrey, J. (1994). Macrophone: An American English telephone speech corpus for the Polyphone project. In ICASSP, pages I.81-I.84.

    Google Scholar 

  • Blauert, J. (1983). Spatial Hearing. M.I.T. Press.

    Google Scholar 

  • Bregman, A. (1990). Auditory Scene Analysis. M.I.T. Press.

    Google Scholar 

  • Brown, K. and George, E. (1995). CTIMIT: A speech corpus for the cellular environment with applications to automatic speech recognition. In ICASSP, pages 105–108.

    Google Scholar 

  • Carbonell, N., Damestoy, J.-P., Fohr, D., Haton, J.-P., and Lonchamp, F. (1986). APHODEX, design and implementation of an acoustic-phonetic decoding expert system. In ICASSP, pages 1201–1204.

    Google Scholar 

  • Carey, M., Chen, H.-T., Descloux, A., Ingle, J., and Park, K. (1984). 1982/83 end office connection study: Analog voice and voiceband data transmission performance characterization of the public switched network. AT&T Bell Laboratories Technical Journal, 63(9):2059–2119.

    Google Scholar 

  • Chang, J. and Zue, V. (1994). A study of speech recognition system robustness to microphone variations: Experiments in phonetic classification, hi ICSLP, pages 995–998.

    Google Scholar 

  • Cherry, C. and Wiley, R. (1977). Speech communication in very noisy environments. In Hawley, M., editor, Speech Intelligibility and Speaker Recognition, page 300. Dowden, Hutchinson & Ross, Inc.

    Google Scholar 

  • Cole, R., Novick, D., Burnett, D., Hansen, B., Sutton, S., and Fanty, M. (1994). Towards automatic collection of the U.S. census. In ICASSP, pages I.93-I.96.

    Google Scholar 

  • Cole, R., Roginski, K., and Fanty, M. (1992). A telephone speech database of spelled and spoken names. In ICSLP, pages 891–893.

    Google Scholar 

  • Cole, R., Stern, R., and Lasry, M. (1985). Performing fine phonetic distinctions. In Perkell, J. and Klatt, D., editors, Variability and Invariance in Speech Processes, pages 325–345. Lawrence Erlbaum Associates.

    Google Scholar 

  • Crawford, M., Brown, G., Cooke, M., and Green, P. (1994). Design, collection and analysis of a multi-simultaneous-speaker corpus. In Proc. of the Institute of Acoustics, Vol. 16, Part 5, pages 183–190.

    Google Scholar 

  • Damhuis, M., Boogaart, T., int’t Veld, C., Versteijlen, M., Schelvis, W., Bos, L., and Boves, L. (1994). Creation and analysis of the Dutch Polyphone corpus. In ICSLP, pages 1803–1806.

    Google Scholar 

  • Das, S., Nádas, A., Nahamoo, D., and Picheny, M. (1994). Adaptation techniques for ambience and microphone compensation in die IBM Tangora speech recognition system. In ICASSP, pages I.21-I.24.

    Google Scholar 

  • de Krom, G. (1990). A new cepstrum-based technique for the estimation of spectral signal-to-noise ratio in speech signals. In ETRW: Speaker Characterization in Speech Technology, Edinburgh, Scotland, pages 83–93.

    Google Scholar 

  • Dennody, P. (1992). Human capabilities for speech processing in noise. In ETRW: Speech Processing in Adverse Conditions, pages 11–19.

    Google Scholar 

  • Doddington, G. (1992). CSR corpus development. In DARPA Workshop Speech and Natural Language, pages 363–366.

    Google Scholar 

  • Dreher, J. and O’Neill, J. (1957). Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am., 29:1320–1323.

    Google Scholar 

  • Egan, J. (1967). Pshychoacoustics of the Lombard voice reflex. Ph.D. thesis. Western Reserve University.

    Google Scholar 

  • Ephraim, Y., Wilpon, J., and Rabiner, L. (1987). A linear predictive front-end processor for speech recognition in noisy environments. In ICASSP, pages 1324–1327.

    Google Scholar 

  • Erell, A. and Weintraub, M. (1990). Estimation using log-spectral-distance criterion for noise-robust speech recognition. In ICASSP, pages 853–856.

    Google Scholar 

  • Fairbanks, G. (1954). Systematic research in experimental phonetics. A theory of the speech mechanism as a servosystem. Journal of Speech and Hearing Research, 19:133–139.

    Google Scholar 

  • Fisher, W., Doddington, G., and Goudie-Marshall, K. (1986). The DARPA speech recognition database: Specifications and status. In DARPA Workshop on Speech Recognition, pages 93–99.

    Google Scholar 

  • Flanagan, J., Johnston, J., Zahn, R., and Elko, G. (1985). Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am., 78:1508–1518.

    Google Scholar 

  • Fletcher, H., Raff, G., and Parmley, F. (1918). Study of the effects of different amounts of sidetone in the telephone set. Technical Report 19412, Western Electric Company.

    Google Scholar 

  • French, N. and Steinberg, J. (1947). Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am., pages 90–119.

    Google Scholar 

  • Fund, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. ASSP, ASSP-34:52–59.

    Google Scholar 

  • Gay, T. (1977). Articulatory movements in VCV sequences. J. Acoust. Soc. Am., 62:183–193.

    Google Scholar 

  • Halphen, E. (1910). Des Lésions traumatiques de l’oreille interne. Ph.D. thesis. Faculté de Médecine, Paris.

    Google Scholar 

  • Hamada, M., Takizawa, Y., and Norimatsu, T. (1990). A noise robust speech recognition system. In ICSLP, pages 893–896.

    Google Scholar 

  • Hansen, J. (1988). Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Ph.D. thesis. Georgia Institute of Technology.

    Google Scholar 

  • Hansen, J. and Bria, O. (1990). Lombard effect compensation for robust automatic speech recognition in noise. In ICSLP, pages 1125–1128.

    Google Scholar 

  • Hanson, B. and Applebaum, T. (1990). Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech. In ICASSP, pages 857–860.

    Google Scholar 

  • Hermansky, H., Morgan, N., Bayya, A., and Kohn, P. (1991). Compensation for the effect of the communication channel in auditory-like analysis of speech (RAS-TA-PLP). In EUROSPEECH, pages 1367–1370.

    Google Scholar 

  • Hirsch, H., Meyer, P., and Ruehl, H. (1991). Improved speech recognition using highpass filtering of subband envelopes. In EUROSPEECH, pages 413–416.

    Google Scholar 

  • Howes, D. (1957). On the relation between the intelligibility and frequency of occurrence of English words. J. Acoust. Soc. Am., 29(2):296–305.

    MathSciNet  Google Scholar 

  • Huang, X., Alleva, R, Hon, H.-W., Hwang, M.-Y., Lee, K.-F., and Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2): 137–148.

    Google Scholar 

  • Jankowski, C., Kalyanswamy, A., Basson, S., and Spitz, J. (1990). N-TIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In ICASSP. pages 109–112.

    Google Scholar 

  • Jot, J.-M. (1992). An analysis/synthesis approach to real-time artificial reverberation. In ICASSP, pages II.221-II.224.

    Google Scholar 

  • Junqua, J.-C. (1990). ORION: A two pass hybrid system for isolated-words automatic speech recognition. In ICASSP, pages 41–44.

    Google Scholar 

  • Junqua, J.-C. (1991). Robustness and cooperative multimodal man-machine communication applications. In Second Venaco Workshop: The Structure of Multimodal Dialogue.

    Google Scholar 

  • Junqua, J.-C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am., 93(1):510–524.

    Google Scholar 

  • Junqua, J.-C, Mak, B., and Reaves, B. (1994). A robust algorithm for word boundary detection in the presence of noise. IEEE Trans, on Speech and Audio Processing, 2(3):406–412.

    Google Scholar 

  • Junqua, J.-C. and Wakita, H. (1989). A comparative study of cepstral lifters and distance measures for all-pole models of speech in noise. In ICASSP, pages 476–479.

    Google Scholar 

  • Kahn, D. and Gnanadesikan, A. (1986). Experiments in speech recognition over the telephone network. In ICASSP, pages 729–732.

    Google Scholar 

  • Lamel, L., Rabiner, L., Rosenberg, A., and Wilpon, J. (1981). An improved endpoint detector for isolated word recognition. IEEE Trans. ASSP, ASSP-29:777–785.

    Google Scholar 

  • Lane, H. and Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research, 14:677–709.

    Google Scholar 

  • Lane, H., Tranel, B., and Sisson, C. (1970). Regulation of voice communication by sensory dynamics. J. Acoust. Soc. Am., 47(2):618–624.

    Google Scholar 

  • Langhans, T. and Strube, H. (1982). Speech enhancement by nonlinear multiband envelope filtering. In ICASSP, pages 156–159.

    Google Scholar 

  • Lecomte, I., Lever, M., Boudy, J., and Tassy, A. (1989). Car noise processing for speech input. In ICASSP, pages 512–515.

    Google Scholar 

  • Lim, J. and Oppenheim, A. (1979). Enhancement and bandwidth compression of noisy speech. Proc. IEEE, 67(12): 1586–1604.

    Google Scholar 

  • Lippmann, R. (1987). An introduction to computing with neural nets. IEEE Trans. ASSP Magazine, 4(2):4–22.

    Google Scholar 

  • Liu, F.-H., Stern, R., Acero, A., and Moreno, P. (1994). Environment normalization for robust speech recognition using cepstral normalization. In ICASSP, pages II.61-II.64.

    Google Scholar 

  • Lombard, E. (1911). Le signe de l’élévation de la voix. Ann. Maladies Oreille, Larynx, Nez,Pharynx, 37:101–119.

    Google Scholar 

  • Mak, B., Junqua, J.-C, and Reaves, B. (1992). A robust speech/non-speech detection algorithm using time and frequency-based features. In ICASSP, pages 269–272.

    Google Scholar 

  • Mak, M. and Allen, W. (1994). Lip-motion analysis for speech segmentation in noise. Speech Communication, 14(3):279–296.

    Google Scholar 

  • Mansour,D. and Juang, B.-H. (1989). A family of distortion measures based upon projection operation for robust speech recognition. IEEE Trans. ASSP, ASSP-37(11):1659–1671.

    Google Scholar 

  • Martin, R. (1993). An efficient algorithm to estimate the instantaneous SNR of speech signals. In EUROSPEECH, pages 1093–1096.

    Google Scholar 

  • Miyoshi, M. and Kaneda, Y. (1988). Inverse filtering of room acoustics, IEEE Trans. ASSP, ASSP-36(2):145–152.

    Google Scholar 

  • Mokbel, C. (1992). Reconnaissance de la Parole dans le Bruit: Bruitage/Débruitage. Ph.D. thesis. Ecole Nationale Supérieure des Télécommunications.

    Google Scholar 

  • Mokbel, C., Monné, J., and Jouvet, D. (1993). On line adaptation of a speech recognizer to variations in telephone line conditions. In EUROSPEECH, pages 1247–1250.

    Google Scholar 

  • Moreno, P. and Stern, R. (1994). Sources of degradation of speech recognition in the telephone network. In ICASSP, pages I.109-I.112.

    Google Scholar 

  • Murveit, H., Butzberger, J., and Weintraub, M. (1992a). Performance of SRI’s DECIPHER speech recognition system on DARPA’s CSR task. In DARPA Workshop Speech and Natural Language, pages 410–414.

    Google Scholar 

  • Murveit, H., Butzberger, J., and Weintraub, M. (1992b). Reduced channel dependence for speech recognition. In DARPA Workshop Speech and Natural Language, pages 280–284.

    Google Scholar 

  • Muthusamy, Y., Cole, R., and Oshika, B. (1992). The OGI multi-language telephone speech corpus. In ICSLP, pages 895–898.

    Google Scholar 

  • Nâdas, A., Nahamoo, D., and Picheny, M. (1988). Adaptive labeling: Normalization of speech by adaptive transformations based on vector quantization. In ICASSP, pages 521–524.

    Google Scholar 

  • Ney, H. (1981). An optimization algorithm for determining the endpoints of isolated utterances. In ICASSP, pages 720–723.

    Google Scholar 

  • Noll, P. (1974). Adaptive quantization in speech coding systems. In Int. Zurich Seminar on Digital Communications, pages B3.1-B3.6.

    Google Scholar 

  • Pick, H., Siegel, J., Fox, P., Garber, S., and Kearney, J. (1989). Inhibiting the Lombard effect. J. Acoust Soc. Am., 85(2):894–900.

    Google Scholar 

  • Pickett, J. (1956). Effects of vocal force on the intelligibility of speech sounds. J. Acoust. Soc. Am., 28(5): 902–905.

    Google Scholar 

  • Pitrelli, J., Fong, C., Wong, S., Spitz, J., and Leung, H. (1995). PhoneBook: A phonetically-rich isolated-word telephone-speech database. In ICASSP, pages 101–104.

    Google Scholar 

  • Rabiner, L. and Sambur, M. (1975). An algorithm for determining the endpoints of isolated utterances. Bell Syst. Tech. J., 54(2):297–315.

    Google Scholar 

  • Rajasekaran, P. and Doddington, G. (1985). Speech recognition in the F16 cockpit using principal spectral components. In ICASSP, pages 882–885.

    Google Scholar 

  • Rajasekaran, P., Doddington, G., and Picone, J. (1986). Recognition of speech under stress and in noise. In ICASSP, pages 733–736.

    Google Scholar 

  • Rangoussi, M., Bakamidis, S., and Carayannis, G. (1993). Robust endpoint detection of speech in the presence of noise. In EUROSPEECH, pages 649–652.

    Google Scholar 

  • Reaves, B. (1991). Comments on an improved endpoint detector for isolated word recognition. Correspondence IEEE ASSP, 39:526–527.

    Google Scholar 

  • Reaves, B. (1993). Parameters for noise robust speech detection. In Acoustic Society of Japan, Fall, pages 197–198.

    Google Scholar 

  • Reaves, B. and Junqua, J.-C. (1992). Robust realtime preprocessing for speech recognition. In Acoustical Society of Japan, Fall, pages 225–226.

    Google Scholar 

  • Rosenbeck, P., Baungaard, B., Jacobsen, C., and Barry, D.-J. (1994). The design and efficient recording of a 3000 speaker Scandinavian telephone speech database: RAFAEL.0. In ICSLP, pages 1807–1810.

    Google Scholar 

  • Rostolland, D. and Parant, C. (1973). Distortion and intelligibility of shouted voice. In Symposium: Speech Intelligibility. Linoitalic, pages 293–304.

    Google Scholar 

  • Savoji, M. (1989). A robust algorithm for accurate endpointing of speech. Speech Communication, 8:45–60.

    Google Scholar 

  • Sayers, B. M. and Cherry, E. (1957). Mechanism of binaural fusion in the hearing of speech. J. Acoust. Soc. Am., 29(9):973–987.

    Google Scholar 

  • Schulman, R. (1985). Articulatory targeting and perceptual constancy of loud speech. Technical report, PERDLUS IV, Stockholm University.

    Google Scholar 

  • Schulman, R. (1989). Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am., 85(1):295–312.

    Google Scholar 

  • Soong, F. and Sondhi, M. M. (1988). A frequency-weighted Itakura spectral distortion measure and its application to speech recognition in noise. IEEE Trans. ASSP, ASSP-36(1):41–48.

    Google Scholar 

  • Staples, T., Picone, J., and Arai, N. (1994). The voice across Japan database — The Japanese language contribution to Polyphone. In ICASSP, pages I.89-I.92.

    Google Scholar 

  • Starks, D. and Morgan, M. (1992). Integrating speech recognition into a helicopter. In ETRW: Speech Processing in Adverse Conditions, pages 195–198.

    Google Scholar 

  • Steeneken, H. and Geurtsen, F. (1990). Description of the RSG-10 noise database. Technical report, TNO Institute for Perception.

    Google Scholar 

  • Stevens, K. (1987). Relational properties as perceptual correlates of phonetic features. In Eleventh ICphS, pages 352–356.

    Google Scholar 

  • Summers, W., Pisoni, D., Bernacki, R., Pedlow, R., and Stokes, M. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. J. Acoust. Soc. Am., 84(3):917–928.

    Google Scholar 

  • Takizawa, Y. and Hamada, M. (1990). Lombard speech recognition by formant-fre-quency-shifted LPC cepstrum. In ICSLP, pages 293–296.

    Google Scholar 

  • Tapias, D., Acero, A., Esteve, J., and Torrecilla, J. (1994). The VESTEL telephone speech database. In ICSLP, pages 1811–1814.

    Google Scholar 

  • Tribolet, J., Noll, P., McDermott, B., and Crochieie, R. (1978). A study of complexity and quality of speech waveform coders. In ICASSP, pages 586–590.

    Google Scholar 

  • Tsao, C. and Gray, R. (1984). An endpoint detector for LPC speech using residual error look-ahead for vector quantization applications. In ICASSP, pages 18b.7.1–4.

    Google Scholar 

  • Van Compernolle, D., MA, W., Xie, F., and Van Diest, M. (1990). Speech recognition in noisy environments with the aid of microphone arrays. Speech Communication, 9(5–6):433–442.

    Google Scholar 

  • Varga, A. and Steeneken, H. (1993). Assessment for automatic speech recognition: II NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251.

    Google Scholar 

  • Viswanathan, V. and Henry, C. (1986). Evaluation of multisensor speech input for speech recognition in high ambient noise. In ICASSP, pages 85–88.

    Google Scholar 

  • Wang, H. and Itakura, F. (1991). An approach of deverberation using multi-microphone sub-band envelope estimation. In ICASSP, pages 953–956.

    Google Scholar 

  • Wilpon, J. and Rabiner, L. (1987). Application of hidden Markov models to automatic speech endpoint detection. Computer Speech and Language, 2:321–341.

    Google Scholar 

  • Yumoto, E. and Gould, W. (1982). Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust Soc. Am., 71(6): 1544–1550.

    Google Scholar 

  • Zwierzynski, D. and Lefèbvre, C. (1992). Recognition of degraded speech with an IM-ELDA acoustic representation: A helicopter fly-by-voice project. In ETRW: Speech Processing in Adverse Conditions, pages 191–194.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Kluwer Academic Publishers

About this chapter

Cite this chapter

Junqua, JC., Haton, JP. (1996). Dealing with Noisy Speech and Channel Distortions. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-1297-0_5

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4612-8555-7

  • Online ISBN: 978-1-4613-1297-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics