Abstract
This paper presents an exploratory work to automatically insert disfluencies in text-to-speech (TTS) systems. The objective is to make TTS more spontaneous and expressive. To achieve this, we propose to focus on the linguistic level of speech through the insertion of pauses, repetitions and revisions. We formalize the problem as a theoretical process, where transformations are iteratively composed. This is a novel contribution since most of the previous work either focus on the detection or cleaning of linguistic disfluencies in speech transcripts, or solely concentrate on acoustic phenomena in TTS, especially pauses. We present a first implementation of the proposed process using conditional random fields and language models. The objective and perceptual evalation conducted on an English corpus of spontaneous speech show that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements.
This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adell, J., Bonafonte, A., Escudero, D.: Filled pauses in speech synthesis: towards conversational speech. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 358–365. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_47
Adell, J., Bonafonte, A., Mancebo, D.E.: On the generation of synthetic disfluent speech: local prosodic modifications caused by the insertion of editing terms. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2008)
Adell, J., Escudero, D., Bonafonte, A.: Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Commun. 54, 459–476 (2012)
Andersson, S., Georgila, K., Traum, D., Aylett, M., Clark, R.A.: Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In: Proceedings of Speech Prosody (2010)
Betz, S., Wagner, P., Schlangen, D.: Micro-structure of disfluencies: basics for conversational speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2015)
Clark, H.H.: Speaking in time. Speech Commun. 36, 5–13 (2002)
Dall, R., Tomalin, M., Wester, M., Byrne, W.J., King, S.: Investigating automatic & human filled pause insertion for speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2014)
Hassan, H., Schwartz, L., Hakkani-Tür, D., Tür, G.: Segmentation and disfluency removal for conversational speech translation. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2014)
Honnibal, M., Johnson, M.: Joint incremental disfluency detection and dependency parsing. Trans. Assoc. Comput. Linguist. 2, 131–142 (2014)
Kaushik, M., Trinkle, M., Hashemi-Sakhtsari, A.: Automatic detection and removal of disfluencies from spontaneous speech. In: Proceedings of the Australasian International Conference on Speech Science and Technology (SST) (2010)
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process. 14, 1526–1540 (2006)
de Mareüil, P.B., et al.: A quantitative study of disfluencies in French broadcast interviews. In: Proceedings of Disfluency in Spontaneous Speech Workshop (2005)
Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45, 89–95 (2005)
Rose, R.L.: The communicative value of filled pauses in spontaneous speech. Ph.D. thesis, University of Birmingham (1998)
Shriberg, E.E.: Phonetic consequences of speech disfluency. Technical report, DTIC Document (1999)
Shriberg, E.E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California (1994)
Stolcke, A., Shriberg, E.: Statistical language modeling for speech disfluencies. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1996)
Stolcke, A., et al.: Automatic detection of sentence boundaries and disfluencies based on recognized words. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP) (1998)
Sundaram, S., Narayanan, S.: An empirical text transformation method for spontaneous speech synthesizers. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2003)
Székely, E., Mendelson, J., Gustafson, J.: Synthesising uncertainty: the interplay of vocal effort and hesitation disfluencies. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2017)
Tomalin, M., Wester, M., Dall, R., Byrne, W., King, S.: A lattice-based approach to automatic filled pause insertion. In: Proceedinds of the Workshop on Disfluency in Spontaneous Speech (2015)
Tree, J.E.F.: The effects of false starts and repetitions on the processing of subsequent words in spontaneous speech. J. Mem. Lang. 34, 709–738 (1995)
Tree, J.E.F.: Listeners’ uses ofum and uh in speech comprehension. Mem. Cogn. 29(2), 320–326 (2001)
Tseng, S.C.: Grammar, prosody and speech disfluencies in spoken dialogues. Unpublished doctoral dissertation. University of Bielefeld (1999)
Wester, M., Aylett, M.P., Tomalin, M., Dall, R.: Artificial personality and disfluency. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Qader, R., Lecorvé, G., Lolive, D., Sébillot, P. (2018). Disfluency Insertion for Spontaneous TTS: Formalization and Proof of Concept. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-00810-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00809-3
Online ISBN: 978-3-030-00810-9
eBook Packages: Computer ScienceComputer Science (R0)