Dealing with Noisy Speech and Channel Distortions

Junqua, Jean-Claude; Haton, Jean-Paul

doi:10.1007/978-1-4613-1297-0_5

Jean-Claude Junqua³ &
Jean-Paul Haton⁴

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 341))

203 Accesses

Summary

We first consider typical noise sources and channel distortions and then focus on the effect of additive noise on the speech signal. To better understand the gap between machine and human performance, we review early studies and recent results about speech perception of distorted speech by human listeners. Finally, we focus on two important issues often neglected in the building of ASR systems: endpoint detection and the Lombard reflex.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Acero, A. (1990). Acoustical and environmental robustness in automatic speech recognition. Ph.D. thesis. Carnegie Mellon University.
Google Scholar
Acero, A., Crespo, C, De la Torre, C, and Torrecilla, J. (1993). Robust HMM-based endpoint detector. In EUROSPEECH, pages 1551–1554.
Google Scholar
Ainsworth, W. (1976). Mechanisms of Speech Recognition. Pergamon Press.
Google Scholar
Ainsworth, W. and Pratt, S. (1993). Comparing error correction strategies in speech recognition systems. In Baber, C. and Noyés, J., editors, Interactive Speech Technology, pages 131–135. Taylor&Francis.
Google Scholar
Allen, J. and Berkley, D. (1979). Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am., pages 943–950.
Google Scholar
Anastasakos, A., Kubala, F., Makhoul, J., and Schwartz, R. (1994). Adaptation to new microphones using tied-mixture normalization. In ICASSP, pages I.433–I.436.
Google Scholar
Anglade, Y., Fohr, D., and Junqua, J.-C. (1992). Selectively trained neural networks for the discrimination of normal and Lombard speech. In ICSLP, pages 595–598.
Google Scholar
Applebaum, T. and Hanson, B. (1991). Tradeoffs in the design of regression features for word recognition. In EUROSPEECH, pages 1203–1206.
Google Scholar
Barnwell III T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. In ICASSP, pages 710–713.
Google Scholar
Bateman, D., Bye, D., and Hunt, M. (1992). Spectral contrast normalization and other techniques for speech recognition in noise. In ICASSP, pages I.241-I.244.
Google Scholar
Bernstein, J., Taussig, K., and Godfrey, J. (1994). Macrophone: An American English telephone speech corpus for the Polyphone project. In ICASSP, pages I.81-I.84.
Google Scholar
Blauert, J. (1983). Spatial Hearing. M.I.T. Press.
Google Scholar
Bregman, A. (1990). Auditory Scene Analysis. M.I.T. Press.
Google Scholar
Brown, K. and George, E. (1995). CTIMIT: A speech corpus for the cellular environment with applications to automatic speech recognition. In ICASSP, pages 105–108.
Google Scholar
Carbonell, N., Damestoy, J.-P., Fohr, D., Haton, J.-P., and Lonchamp, F. (1986). APHODEX, design and implementation of an acoustic-phonetic decoding expert system. In ICASSP, pages 1201–1204.
Google Scholar
Carey, M., Chen, H.-T., Descloux, A., Ingle, J., and Park, K. (1984). 1982/83 end office connection study: Analog voice and voiceband data transmission performance characterization of the public switched network. AT&T Bell Laboratories Technical Journal, 63(9):2059–2119.
Google Scholar
Chang, J. and Zue, V. (1994). A study of speech recognition system robustness to microphone variations: Experiments in phonetic classification, hi ICSLP, pages 995–998.
Google Scholar
Cherry, C. and Wiley, R. (1977). Speech communication in very noisy environments. In Hawley, M., editor, Speech Intelligibility and Speaker Recognition, page 300. Dowden, Hutchinson & Ross, Inc.
Google Scholar
Cole, R., Novick, D., Burnett, D., Hansen, B., Sutton, S., and Fanty, M. (1994). Towards automatic collection of the U.S. census. In ICASSP, pages I.93-I.96.
Google Scholar
Cole, R., Roginski, K., and Fanty, M. (1992). A telephone speech database of spelled and spoken names. In ICSLP, pages 891–893.
Google Scholar
Cole, R., Stern, R., and Lasry, M. (1985). Performing fine phonetic distinctions. In Perkell, J. and Klatt, D., editors, Variability and Invariance in Speech Processes, pages 325–345. Lawrence Erlbaum Associates.
Google Scholar
Crawford, M., Brown, G., Cooke, M., and Green, P. (1994). Design, collection and analysis of a multi-simultaneous-speaker corpus. In Proc. of the Institute of Acoustics, Vol. 16, Part 5, pages 183–190.
Google Scholar
Damhuis, M., Boogaart, T., int’t Veld, C., Versteijlen, M., Schelvis, W., Bos, L., and Boves, L. (1994). Creation and analysis of the Dutch Polyphone corpus. In ICSLP, pages 1803–1806.
Google Scholar
Das, S., Nádas, A., Nahamoo, D., and Picheny, M. (1994). Adaptation techniques for ambience and microphone compensation in die IBM Tangora speech recognition system. In ICASSP, pages I.21-I.24.
Google Scholar
de Krom, G. (1990). A new cepstrum-based technique for the estimation of spectral signal-to-noise ratio in speech signals. In ETRW: Speaker Characterization in Speech Technology, Edinburgh, Scotland, pages 83–93.
Google Scholar
Dennody, P. (1992). Human capabilities for speech processing in noise. In ETRW: Speech Processing in Adverse Conditions, pages 11–19.
Google Scholar
Doddington, G. (1992). CSR corpus development. In DARPA Workshop Speech and Natural Language, pages 363–366.
Google Scholar
Dreher, J. and O’Neill, J. (1957). Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am., 29:1320–1323.
Google Scholar
Egan, J. (1967). Pshychoacoustics of the Lombard voice reflex. Ph.D. thesis. Western Reserve University.
Google Scholar
Ephraim, Y., Wilpon, J., and Rabiner, L. (1987). A linear predictive front-end processor for speech recognition in noisy environments. In ICASSP, pages 1324–1327.
Google Scholar
Erell, A. and Weintraub, M. (1990). Estimation using log-spectral-distance criterion for noise-robust speech recognition. In ICASSP, pages 853–856.
Google Scholar
Fairbanks, G. (1954). Systematic research in experimental phonetics. A theory of the speech mechanism as a servosystem. Journal of Speech and Hearing Research, 19:133–139.
Google Scholar
Fisher, W., Doddington, G., and Goudie-Marshall, K. (1986). The DARPA speech recognition database: Specifications and status. In DARPA Workshop on Speech Recognition, pages 93–99.
Google Scholar
Flanagan, J., Johnston, J., Zahn, R., and Elko, G. (1985). Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am., 78:1508–1518.
Google Scholar
Fletcher, H., Raff, G., and Parmley, F. (1918). Study of the effects of different amounts of sidetone in the telephone set. Technical Report 19412, Western Electric Company.
Google Scholar
French, N. and Steinberg, J. (1947). Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am., pages 90–119.
Google Scholar
Fund, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. ASSP, ASSP-34:52–59.
Google Scholar
Gay, T. (1977). Articulatory movements in VCV sequences. J. Acoust. Soc. Am., 62:183–193.
Google Scholar
Halphen, E. (1910). Des Lésions traumatiques de l’oreille interne. Ph.D. thesis. Faculté de Médecine, Paris.
Google Scholar
Hamada, M., Takizawa, Y., and Norimatsu, T. (1990). A noise robust speech recognition system. In ICSLP, pages 893–896.
Google Scholar
Hansen, J. (1988). Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Ph.D. thesis. Georgia Institute of Technology.
Google Scholar
Hansen, J. and Bria, O. (1990). Lombard effect compensation for robust automatic speech recognition in noise. In ICSLP, pages 1125–1128.
Google Scholar
Hanson, B. and Applebaum, T. (1990). Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech. In ICASSP, pages 857–860.
Google Scholar
Hermansky, H., Morgan, N., Bayya, A., and Kohn, P. (1991). Compensation for the effect of the communication channel in auditory-like analysis of speech (RAS-TA-PLP). In EUROSPEECH, pages 1367–1370.
Google Scholar
Hirsch, H., Meyer, P., and Ruehl, H. (1991). Improved speech recognition using highpass filtering of subband envelopes. In EUROSPEECH, pages 413–416.
Google Scholar
Howes, D. (1957). On the relation between the intelligibility and frequency of occurrence of English words. J. Acoust. Soc. Am., 29(2):296–305.
MathSciNet Google Scholar
Huang, X., Alleva, R, Hon, H.-W., Hwang, M.-Y., Lee, K.-F., and Rosenfeld, R. (1993). The SPHINX-II speech recognition system: An overview. Computer Speech and Language, 7(2): 137–148.
Google Scholar
Jankowski, C., Kalyanswamy, A., Basson, S., and Spitz, J. (1990). N-TIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In ICASSP. pages 109–112.
Google Scholar
Jot, J.-M. (1992). An analysis/synthesis approach to real-time artificial reverberation. In ICASSP, pages II.221-II.224.
Google Scholar
Junqua, J.-C. (1990). ORION: A two pass hybrid system for isolated-words automatic speech recognition. In ICASSP, pages 41–44.
Google Scholar
Junqua, J.-C. (1991). Robustness and cooperative multimodal man-machine communication applications. In Second Venaco Workshop: The Structure of Multimodal Dialogue.
Google Scholar
Junqua, J.-C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am., 93(1):510–524.
Google Scholar
Junqua, J.-C, Mak, B., and Reaves, B. (1994). A robust algorithm for word boundary detection in the presence of noise. IEEE Trans, on Speech and Audio Processing, 2(3):406–412.
Google Scholar
Junqua, J.-C. and Wakita, H. (1989). A comparative study of cepstral lifters and distance measures for all-pole models of speech in noise. In ICASSP, pages 476–479.
Google Scholar
Kahn, D. and Gnanadesikan, A. (1986). Experiments in speech recognition over the telephone network. In ICASSP, pages 729–732.
Google Scholar
Lamel, L., Rabiner, L., Rosenberg, A., and Wilpon, J. (1981). An improved endpoint detector for isolated word recognition. IEEE Trans. ASSP, ASSP-29:777–785.
Google Scholar
Lane, H. and Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research, 14:677–709.
Google Scholar
Lane, H., Tranel, B., and Sisson, C. (1970). Regulation of voice communication by sensory dynamics. J. Acoust. Soc. Am., 47(2):618–624.
Google Scholar
Langhans, T. and Strube, H. (1982). Speech enhancement by nonlinear multiband envelope filtering. In ICASSP, pages 156–159.
Google Scholar
Lecomte, I., Lever, M., Boudy, J., and Tassy, A. (1989). Car noise processing for speech input. In ICASSP, pages 512–515.
Google Scholar
Lim, J. and Oppenheim, A. (1979). Enhancement and bandwidth compression of noisy speech. Proc. IEEE, 67(12): 1586–1604.
Google Scholar
Lippmann, R. (1987). An introduction to computing with neural nets. IEEE Trans. ASSP Magazine, 4(2):4–22.
Google Scholar
Liu, F.-H., Stern, R., Acero, A., and Moreno, P. (1994). Environment normalization for robust speech recognition using cepstral normalization. In ICASSP, pages II.61-II.64.
Google Scholar
Lombard, E. (1911). Le signe de l’élévation de la voix. Ann. Maladies Oreille, Larynx, Nez,Pharynx, 37:101–119.
Google Scholar
Mak, B., Junqua, J.-C, and Reaves, B. (1992). A robust speech/non-speech detection algorithm using time and frequency-based features. In ICASSP, pages 269–272.
Google Scholar
Mak, M. and Allen, W. (1994). Lip-motion analysis for speech segmentation in noise. Speech Communication, 14(3):279–296.
Google Scholar
Mansour,D. and Juang, B.-H. (1989). A family of distortion measures based upon projection operation for robust speech recognition. IEEE Trans. ASSP, ASSP-37(11):1659–1671.
Google Scholar
Martin, R. (1993). An efficient algorithm to estimate the instantaneous SNR of speech signals. In EUROSPEECH, pages 1093–1096.
Google Scholar
Miyoshi, M. and Kaneda, Y. (1988). Inverse filtering of room acoustics, IEEE Trans. ASSP, ASSP-36(2):145–152.
Google Scholar
Mokbel, C. (1992). Reconnaissance de la Parole dans le Bruit: Bruitage/Débruitage. Ph.D. thesis. Ecole Nationale Supérieure des Télécommunications.
Google Scholar
Mokbel, C., Monné, J., and Jouvet, D. (1993). On line adaptation of a speech recognizer to variations in telephone line conditions. In EUROSPEECH, pages 1247–1250.
Google Scholar
Moreno, P. and Stern, R. (1994). Sources of degradation of speech recognition in the telephone network. In ICASSP, pages I.109-I.112.
Google Scholar
Murveit, H., Butzberger, J., and Weintraub, M. (1992a). Performance of SRI’s DECIPHER speech recognition system on DARPA’s CSR task. In DARPA Workshop Speech and Natural Language, pages 410–414.
Google Scholar
Murveit, H., Butzberger, J., and Weintraub, M. (1992b). Reduced channel dependence for speech recognition. In DARPA Workshop Speech and Natural Language, pages 280–284.
Google Scholar
Muthusamy, Y., Cole, R., and Oshika, B. (1992). The OGI multi-language telephone speech corpus. In ICSLP, pages 895–898.
Google Scholar
Nâdas, A., Nahamoo, D., and Picheny, M. (1988). Adaptive labeling: Normalization of speech by adaptive transformations based on vector quantization. In ICASSP, pages 521–524.
Google Scholar
Ney, H. (1981). An optimization algorithm for determining the endpoints of isolated utterances. In ICASSP, pages 720–723.
Google Scholar
Noll, P. (1974). Adaptive quantization in speech coding systems. In Int. Zurich Seminar on Digital Communications, pages B3.1-B3.6.
Google Scholar
Pick, H., Siegel, J., Fox, P., Garber, S., and Kearney, J. (1989). Inhibiting the Lombard effect. J. Acoust Soc. Am., 85(2):894–900.
Google Scholar
Pickett, J. (1956). Effects of vocal force on the intelligibility of speech sounds. J. Acoust. Soc. Am., 28(5): 902–905.
Google Scholar
Pitrelli, J., Fong, C., Wong, S., Spitz, J., and Leung, H. (1995). PhoneBook: A phonetically-rich isolated-word telephone-speech database. In ICASSP, pages 101–104.
Google Scholar
Rabiner, L. and Sambur, M. (1975). An algorithm for determining the endpoints of isolated utterances. Bell Syst. Tech. J., 54(2):297–315.
Google Scholar
Rajasekaran, P. and Doddington, G. (1985). Speech recognition in the F16 cockpit using principal spectral components. In ICASSP, pages 882–885.
Google Scholar
Rajasekaran, P., Doddington, G., and Picone, J. (1986). Recognition of speech under stress and in noise. In ICASSP, pages 733–736.
Google Scholar
Rangoussi, M., Bakamidis, S., and Carayannis, G. (1993). Robust endpoint detection of speech in the presence of noise. In EUROSPEECH, pages 649–652.
Google Scholar
Reaves, B. (1991). Comments on an improved endpoint detector for isolated word recognition. Correspondence IEEE ASSP, 39:526–527.
Google Scholar
Reaves, B. (1993). Parameters for noise robust speech detection. In Acoustic Society of Japan, Fall, pages 197–198.
Google Scholar
Reaves, B. and Junqua, J.-C. (1992). Robust realtime preprocessing for speech recognition. In Acoustical Society of Japan, Fall, pages 225–226.
Google Scholar
Rosenbeck, P., Baungaard, B., Jacobsen, C., and Barry, D.-J. (1994). The design and efficient recording of a 3000 speaker Scandinavian telephone speech database: RAFAEL.0. In ICSLP, pages 1807–1810.
Google Scholar
Rostolland, D. and Parant, C. (1973). Distortion and intelligibility of shouted voice. In Symposium: Speech Intelligibility. Linoitalic, pages 293–304.
Google Scholar
Savoji, M. (1989). A robust algorithm for accurate endpointing of speech. Speech Communication, 8:45–60.
Google Scholar
Sayers, B. M. and Cherry, E. (1957). Mechanism of binaural fusion in the hearing of speech. J. Acoust. Soc. Am., 29(9):973–987.
Google Scholar
Schulman, R. (1985). Articulatory targeting and perceptual constancy of loud speech. Technical report, PERDLUS IV, Stockholm University.
Google Scholar
Schulman, R. (1989). Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am., 85(1):295–312.
Google Scholar
Soong, F. and Sondhi, M. M. (1988). A frequency-weighted Itakura spectral distortion measure and its application to speech recognition in noise. IEEE Trans. ASSP, ASSP-36(1):41–48.
Google Scholar
Staples, T., Picone, J., and Arai, N. (1994). The voice across Japan database — The Japanese language contribution to Polyphone. In ICASSP, pages I.89-I.92.
Google Scholar
Starks, D. and Morgan, M. (1992). Integrating speech recognition into a helicopter. In ETRW: Speech Processing in Adverse Conditions, pages 195–198.
Google Scholar
Steeneken, H. and Geurtsen, F. (1990). Description of the RSG-10 noise database. Technical report, TNO Institute for Perception.
Google Scholar
Stevens, K. (1987). Relational properties as perceptual correlates of phonetic features. In Eleventh ICphS, pages 352–356.
Google Scholar
Summers, W., Pisoni, D., Bernacki, R., Pedlow, R., and Stokes, M. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. J. Acoust. Soc. Am., 84(3):917–928.
Google Scholar
Takizawa, Y. and Hamada, M. (1990). Lombard speech recognition by formant-fre-quency-shifted LPC cepstrum. In ICSLP, pages 293–296.
Google Scholar
Tapias, D., Acero, A., Esteve, J., and Torrecilla, J. (1994). The VESTEL telephone speech database. In ICSLP, pages 1811–1814.
Google Scholar
Tribolet, J., Noll, P., McDermott, B., and Crochieie, R. (1978). A study of complexity and quality of speech waveform coders. In ICASSP, pages 586–590.
Google Scholar
Tsao, C. and Gray, R. (1984). An endpoint detector for LPC speech using residual error look-ahead for vector quantization applications. In ICASSP, pages 18b.7.1–4.
Google Scholar
Van Compernolle, D., MA, W., Xie, F., and Van Diest, M. (1990). Speech recognition in noisy environments with the aid of microphone arrays. Speech Communication, 9(5–6):433–442.
Google Scholar
Varga, A. and Steeneken, H. (1993). Assessment for automatic speech recognition: II NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251.
Google Scholar
Viswanathan, V. and Henry, C. (1986). Evaluation of multisensor speech input for speech recognition in high ambient noise. In ICASSP, pages 85–88.
Google Scholar
Wang, H. and Itakura, F. (1991). An approach of deverberation using multi-microphone sub-band envelope estimation. In ICASSP, pages 953–956.
Google Scholar
Wilpon, J. and Rabiner, L. (1987). Application of hidden Markov models to automatic speech endpoint detection. Computer Speech and Language, 2:321–341.
Google Scholar
Yumoto, E. and Gould, W. (1982). Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust Soc. Am., 71(6): 1544–1550.
Google Scholar
Zwierzynski, D. and Lefèbvre, C. (1992). Recognition of degraded speech with an IM-ELDA acoustic representation: A helicopter fly-by-voice project. In ETRW: Speech Processing in Adverse Conditions, pages 191–194.
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Technology Laboratory, USA
Jean-Claude Junqua
CRIN - INRIA, France
Jean-Paul Haton

Authors

Jean-Claude Junqua
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Paul Haton
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Junqua, JC., Haton, JP. (1996). Dealing with Noisy Speech and Channel Distortions. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_5

Download citation

DOI: https://doi.org/10.1007/978-1-4613-1297-0_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4612-8555-7
Online ISBN: 978-1-4613-1297-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics