Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise
This paper explores how different synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notification. We discuss prior work done on listening tasks as well as speech in noise. We analyze three different speech synthesizers in three different noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech.
KeywordsSpeech Synthesized Noise Radio Intelligibility
We would like to acknowledge several people for their help and support on this work. Particularly Carolyn Penstein, Rajat Kulshreshtha, Abhilasha Ravichander, and the officers of CMU EMS. As well as the several people who helped edit this work, especially Elise Romberger. Finally, thank you to reviewers reading and examining our experiments, methodology, and submission.
- 1.Black, A.W., Lenzo, K.A.: Flite: a small fast run-time synthesis engine. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001)Google Scholar
- 4.Davies, M.: The corpus of contemporary American English (Coca): 450 million words, 1990–2012. Brigham Young University (2002)Google Scholar
- 5.Duddington, J.: eSpeak text to speech (2012)Google Scholar
- 6.Durette, P.N.: gTTS: a python interface for google’s text to speech api (2017). https://github.com/pndurette/gTTS. Accessed 15 Apr 2018
- 8.Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S.: AIDR: Artificial intelligence for disaster response. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. ACM (2014)Google Scholar
- 9.Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: ICASSP, vol. 4, pp. 44164–44164. Citeseer (2002)Google Scholar
- 12.Park, Y., Patwardhan, S., Visweswariah, K., Gates, S.C.: An empirical analysis of word error rate and keyword error rate. In: Ninth Annual Conference of the International Speech Communication Association (2008)Google Scholar
- 14.Ravichander, A., Manzini, T., Grabmair, M., Neubig, G., Francis, J., Nyberg, E.: How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 374–383 (2017)Google Scholar
- 15.Schmidt-Nielsen, A.: Intelligibility and acceptability testing for speech technology. Technical report, Naval Research Lab, Washington DC (1992)Google Scholar
- 16.Valentini-Botinhao, C., Yamagishi, J., King, S.: Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In: Twelfth Annual Conference of the International Speech Communication Association (2011)Google Scholar
- 17.Valentini-Botinhao, C., Yamagishi, J., King, S.: Evaluation of objective measures for intelligibility prediction of hmm-based synthetic speech in noise. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5112–5115. IEEE (2011)Google Scholar
- 19.Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp. 577–582. IEEE (2003)Google Scholar