Skip to main content
Log in

Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The study investigated the segmental intelligibility of four text-to-speech (TTS) products under 0 dB and 5 dB signal-to-noise ratios in a group of native and nonnative speakers of English. Each product—AT&T Next-Gen™, Festival version 1.4.2, FlexVoice™ 2, and IBM ViaVoice™ Version 5.1—uses a different algorithm for generating speech from text. The results, which benefit developers of TTS technology as well as developers of products that utilize TTS, showed that (1) all TTS products were less intelligible to nonnative speakers of English than native speakers, (2) the “hybrid” TTS product that combined concatenative and formant synthesis methods was the least intelligible of the four products investigated, (3) the remaining three products, which used formant, concatenative diphone based LPC, and concatenative waveform synthesis methods respectively, were equally intelligible to nonnative speakers, (4) none of the four TTS products was better at resisting intelligibility loss due to noise than others, and (5) listening to currently available unrestricted TTS under high noise conditions would probably require a greater amount of cognitive resources on the part of both native and nonnative speakers of English and may be difficult when other demanding activities are concurrently performed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • ANSI (1969). American National Standards Specification for Audiometers (ANSI S3.6-1969). New York: American National Standards Institute.

  • British Council (1999). Frequently asked questions. Available at http://www.britishcouncil.org/english/engfaqs.htm#howmany.

  • Cohen, B.H. (1996). Explaining Psychological Statistics. Pacific Grove, CA: Brooks/Cole Publishing Co.

    Google Scholar 

  • Crystal, D. (1997). English as Global Language. New York: Cambridge University Press.

    Google Scholar 

  • Doyle, R. (1999). US Immigration. Scientific American Science and the Citizen Website. (Available at http://www.sci.sdsu.edu/salton/Bythenumbers.html).

  • Dutoit, T. (1997). An Introduction to Text-To-Speech Synthesis. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Greene, G.G. (1986). Perception of synthetic speech by nonnative speakers of English. In Proceedings of the Human Factors Society–30th Annual Meeting, pp. 1340–1343.

  • Greene, B.G., Logan, J.S., and Pisoni, D.B. (1986). Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems. Behavior Research Methods, Instruments and Computers, 18:100–107.

    Google Scholar 

  • House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation-testing methods: Consonantal differentiation with a closed-response set. Journal of the Acoustical Society of America, 37:158–166.

    Article  Google Scholar 

  • Kalikow, D.N., Stevens, K.N., and Elliott, L.L. (1977). Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. Journal of the Acoustical Society of America, 61:1337–1351.

    Article  Google Scholar 

  • Koul, R.K. and Allen, G.D. (1993). Segmental intelligibility and speech interference thresholds of high-quality synthetic speech in the presence of noise. Journal of Speech and Hearing Research, 36:790–798.

    Google Scholar 

  • Lewis, H., Benignus, V.A., Muller, K.E., Malott, C.M., and Barton, C.N. (1988). Babble and random-noise masking of speech in high and low context cue conditions. Journal of Speech and Hearing Research, 31:108–114.

    Google Scholar 

  • Logan, J.S., Greene, B.G., and Pisoni, D.B. (1989). Segmental intelligibility of synthetic speech produced by rule. Journal of the Acoustical Society of America, 86:566–581.

    Article  Google Scholar 

  • Pisoni, D.B., Nusbaum, H.C., and Greene, B.G. (1985). Perception of synthetic speech generated by rule. Proceedings of the IEEE, 73:1665 –1676.

    Article  Google Scholar 

  • Reynolds, M.E., Bond, Z.S., and Fucci, D. (1996). Synthetic speech intelligibility: Comparison of native and non-native speakers of English. AAC: Augmentative and Alternative Communication, 12:32–36.

    Article  Google Scholar 

  • Sproat, R.M., Ostendorf, M., and Hunt, A. (Eds.). (1999). The need for increased speech synthesis research. (A report of the 1998 NSF workshop for discussing research priorities and evaluation strategies in speech synthesis). (Available at http://cslu.cse.ogi.edu/publications).

  • US Immigration and Naturalization Service. (2001). Country of origin. (Available at http://www.ins.usdoj.gov/graphics/aboutins/statistics/299.htm).

  • Venkatagiri, H.S. (2003). Segmental intelligibility of four currently used text-to-speech synthesis methods. Journal of the Acoustical Society of America, 113:2095–2104.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. S. Venkatagiri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Venkatagiri, H.S. Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise. Int J Speech Technol 8, 313–321 (2005). https://doi.org/10.1007/s10772-006-0449-1

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-006-0449-1

Keywords

Navigation