Abstract
The study investigated the segmental intelligibility of four text-to-speech (TTS) products under 0 dB and 5 dB signal-to-noise ratios in a group of native and nonnative speakers of English. Each product—AT&T Next-Gen™, Festival version 1.4.2, FlexVoice™ 2, and IBM ViaVoice™ Version 5.1—uses a different algorithm for generating speech from text. The results, which benefit developers of TTS technology as well as developers of products that utilize TTS, showed that (1) all TTS products were less intelligible to nonnative speakers of English than native speakers, (2) the “hybrid” TTS product that combined concatenative and formant synthesis methods was the least intelligible of the four products investigated, (3) the remaining three products, which used formant, concatenative diphone based LPC, and concatenative waveform synthesis methods respectively, were equally intelligible to nonnative speakers, (4) none of the four TTS products was better at resisting intelligibility loss due to noise than others, and (5) listening to currently available unrestricted TTS under high noise conditions would probably require a greater amount of cognitive resources on the part of both native and nonnative speakers of English and may be difficult when other demanding activities are concurrently performed.
Similar content being viewed by others
References
ANSI (1969). American National Standards Specification for Audiometers (ANSI S3.6-1969). New York: American National Standards Institute.
British Council (1999). Frequently asked questions. Available at http://www.britishcouncil.org/english/engfaqs.htm#howmany.
Cohen, B.H. (1996). Explaining Psychological Statistics. Pacific Grove, CA: Brooks/Cole Publishing Co.
Crystal, D. (1997). English as Global Language. New York: Cambridge University Press.
Doyle, R. (1999). US Immigration. Scientific American Science and the Citizen Website. (Available at http://www.sci.sdsu.edu/salton/Bythenumbers.html).
Dutoit, T. (1997). An Introduction to Text-To-Speech Synthesis. Dordrecht: Kluwer Academic Publishers.
Greene, G.G. (1986). Perception of synthetic speech by nonnative speakers of English. In Proceedings of the Human Factors Society–30th Annual Meeting, pp. 1340–1343.
Greene, B.G., Logan, J.S., and Pisoni, D.B. (1986). Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems. Behavior Research Methods, Instruments and Computers, 18:100–107.
House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation-testing methods: Consonantal differentiation with a closed-response set. Journal of the Acoustical Society of America, 37:158–166.
Kalikow, D.N., Stevens, K.N., and Elliott, L.L. (1977). Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. Journal of the Acoustical Society of America, 61:1337–1351.
Koul, R.K. and Allen, G.D. (1993). Segmental intelligibility and speech interference thresholds of high-quality synthetic speech in the presence of noise. Journal of Speech and Hearing Research, 36:790–798.
Lewis, H., Benignus, V.A., Muller, K.E., Malott, C.M., and Barton, C.N. (1988). Babble and random-noise masking of speech in high and low context cue conditions. Journal of Speech and Hearing Research, 31:108–114.
Logan, J.S., Greene, B.G., and Pisoni, D.B. (1989). Segmental intelligibility of synthetic speech produced by rule. Journal of the Acoustical Society of America, 86:566–581.
Pisoni, D.B., Nusbaum, H.C., and Greene, B.G. (1985). Perception of synthetic speech generated by rule. Proceedings of the IEEE, 73:1665 –1676.
Reynolds, M.E., Bond, Z.S., and Fucci, D. (1996). Synthetic speech intelligibility: Comparison of native and non-native speakers of English. AAC: Augmentative and Alternative Communication, 12:32–36.
Sproat, R.M., Ostendorf, M., and Hunt, A. (Eds.). (1999). The need for increased speech synthesis research. (A report of the 1998 NSF workshop for discussing research priorities and evaluation strategies in speech synthesis). (Available at http://cslu.cse.ogi.edu/publications).
US Immigration and Naturalization Service. (2001). Country of origin. (Available at http://www.ins.usdoj.gov/graphics/aboutins/statistics/299.htm).
Venkatagiri, H.S. (2003). Segmental intelligibility of four currently used text-to-speech synthesis methods. Journal of the Acoustical Society of America, 113:2095–2104.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Venkatagiri, H.S. Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise. Int J Speech Technol 8, 313–321 (2005). https://doi.org/10.1007/s10772-006-0449-1
Issue Date:
DOI: https://doi.org/10.1007/s10772-006-0449-1