Abstract
This paper addresses the issue of producing synthetic speech for an interpreted dialogue where the attitudinal colouring of an original utterance is to be preserved; it describes differences in speaking style between read and spontaneous speech from the viewpoint of synthesis research and discusses the design of a synthesis system incorporating labels to encode the prosodic and segmental variation simultaneously. Spontaneous speech confronts us with phenomena that were not encountered in corpora of prepared or read speech, and to account for these we have to identify increasingly higher-level factors of discourse structure and speaker involvement. The paper makes three specific claims: (a) that it is better to label the distinctive characteristics of speech through higher-level context dependencies, and to select units for synthesis from appropriate contexts, rather than attempt to predict and modify fine phonetic detail; (b) that the labelling of segmental and prosodic characteristics can be done adequately for speech synthesis using automatic techniques, leaving the human labeller free to identify higher-level discourse-related aspects of the speech; and (c) that instead of minimizing the size of the source database of speech units, we should rather be concerned to maximize its variety and to efficiently select from it the units that most closely express the characteristics of the target speech. The CHATR resynthesis toolkit performs many of these tasks.
Keywords
Speech Synthesis Spontaneous Speech Natural Speech Prosodic Characteristic Synthetic SpeechPreview
Unable to display preview. Download preview PDF.
References
- [Bar95]W. J. Barry. Phonetics and phonology in speaking styles. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.Google Scholar
- [BC95]A. W. Black and W. N. Campbell. Predicting the intonation of discourse segments from examples in dialogue speech. Proceedings of the ESCA Workshop on Spoken Dialogue, Hanstholm, Denmark, 1995.Google Scholar
- [BGG+96]G. Bruce, B. Granström, K. Gustafson, M. Home, D. House, and P. Touati. On the analysis of prosody in interaction. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.Google Scholar
- [Bla95]E. Blaauw. On the perceptual classification of spontaneous and read speech. Ph.D. thesis, OTS Dissertation Series, Utrecht University. ISBN 90-5434-045-2, 1995.Google Scholar
- [BT94b]A. W. Black and P. Taylor. CHATR: A generic speech synthesis system. Proceedings of COLING-91 11:983–986, 1994.Google Scholar
- [Cam92a]W. N. Campbell. Multi-level timing in speech. PhD thesis, University of Sussex, Department of Experimental Psychology, 1992. Available as ATR Technical Report TR-IT-0035.Google Scholar
- [Cam92b]W. N. Campbell. Prosodic encoding of English speech. In Proceedings of the International Conference on Spoken Language Processing, Banff Canada, pp. 663–666, 1992.Google Scholar
- [Cam92d]W. N. Campbell. Synthesis units for natural English speech. Technical Report SP 91–129, IEICE, 1992.Google Scholar
- [Cam93a]W. N. Campbell. Automatic detection of prosodic boundaries in speech. Speech Communication, 13:343–354, 1993.CrossRefGoogle Scholar
- [Cam93b]W. N. Campbell. Predicting segmental durations for accommodation within a syllable-level timing framework. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 1081–1084, 1993.Google Scholar
- [Cam94b]W. N. Campbell. Prosody and the selection of source units for concatenative synthesis. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 61–64, 1994.Google Scholar
- [Cam95]W. N. Campbell. Loudness, spectral tilt, and perceived prominence in dialogues. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.Google Scholar
- [CB95]W. N. Campbell and M. Beckman. Stress, loudness, and spectral tilt. Proceedings of the Acoustical Society Japan, Spring Meeting, 3–4–3, 1995.Google Scholar
- [CB96]W. N. Campbell and A. W. Black. Prosody and the selection of source units for concatenative synthesis. In Progress in Speech Synthesis. Berlin: Springer-Verlag, 1996.Google Scholar
- [Col92a]J. C. Coleman. The phonetic interpretation of headed phonological structures containing overlapping constituents. In Phonetics Yearbook 9, pp. 1–44. New York: Academic, 1992.Google Scholar
- [CS92]W. N. Campbell and Y. Sagisaka. Automatic annotation of speech corpora. Proceedings of the SST92 Queensland, Aus¬tralia, pp. 686–691, 1992.Google Scholar
- [dJ95]K. de Jong. The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. J. Acoust. Society Am., 97:491–504, 1995.ADSCrossRefGoogle Scholar
- [Ent93]Entropic Research Laboratory, 600 Pennsylvania Avenue, Washington DC 20003. HTK - Hidden Markov Model Toolkit, 1993.Google Scholar
- [Fai94]L. Fais. Conversation as collaboration: some syntactic evidence. Speech Communication, 15:230–242, 1994.CrossRefGoogle Scholar
- [GS89]J. Gauffin and J. Sundberg. Spectral correlates of glottal voice source waveform characteristics. JSHR, 32:556–565, 1989.Google Scholar
- [HB96]A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, and Signal Processes, 1996.Google Scholar
- [Hir80]D. Hirst. Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de VInstitut de Phonetique 15, Aix en Provence, pp. 71–85, 1980.Google Scholar
- [Hir92]J. Hirschberg. Using discourse context to guide pitch accent decisions in synthetic speech. In G. Bailly, C. Benoit, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pp. 367–376. Amsterdam: Elsevier Science, 1992.Google Scholar
- [Hir95b]J. Hirschberg. Acoustic and prosodic cues to speaking style in spontaneous and read speech. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 36–43, 1995. Symposium on speaking styles.Google Scholar
- [KKN+95a]A. Kiessling, R. Kompe, H. Niemann, E. Noth, and A. Batlinger. Detection of phrase boundaries and accents. Progress and Prospects of Speech Research and Technology: Proceedings of the CRIM/ORWISS Workshop, Sankt Augustin, pp. 266–269, 1995.Google Scholar
- [Koh95a]K. J. Kohler. Articulatory reduction in different speaking styles. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 12–19, 1995. Symposium on speaking styles.Google Scholar
- [Koh96]K. J. Kohler. Modelling prosody in spontaneous speech. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.Google Scholar
- [Lin90]B. E. F. Lindblom. Explaining phonetic variation: A sketch of the H&H theory. In H. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pp. 403–409. Dordrecht: Kluwer, 1990.Google Scholar
- [MC88]G. Mehta and A. Cutler. Detection of target phonemes in spontaneous and read speech. Language and Speech, 31:135–156, 1988.Google Scholar
- [MC90]E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using di- phone. Speech Communication, 9:453–467, 1990.CrossRefGoogle Scholar
- [NS93]C. Nakatani and L. Shriberg. Draft proposal for labelling disfluencies in ToBI. Paper presented at 3rd ToBI Labelling Workshop, Ohio, 1993.Google Scholar
- [OPSH95b]M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS- 95–001, Boston University ECS Dept., 1995.Google Scholar
- [PBH94]J. Pitrelli, M. E. Beckman, and J. Hirschberg. Evaluation of prosodic transcription labelling reliability in the ToBI framework. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 1, pp. 123–126, 1994.Google Scholar
- [PT92]J. B. Pierrehumbert and D. Talkin. Lenition of /h/ and glottal stop. In G. Doherty and D. R. Ladd, editors, Papers in Laboratory phonology 2, pp. 90–127. Cambridge, UK: Cambridge University Press, 1992.Google Scholar
- [SBP+92]K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, Vol. 2, pp. 867–870, 1992.Google Scholar
- [Shr94]L. Shriberg. Preliminaries to a theory of disfluencies. Ph.D. thesis, University of California at Berkeley, 1994.Google Scholar
- [Slu95b]A. C. M. Sluijter. Phonetic correlates of stress and accent Holland Institute of General Linguistics, 1995.Google Scholar
- [Ste94]A. Stenström. An Introduction to Spoken Interaction. London: Longman, 1994.Google Scholar
- [SvH93]A. Sluijter and V. J. van Heuven. Perceptual cues of linguistic stress: intensity revisited. Working papers 41, Proceedings of the ESC A Workshop on Prosody, Lund University, Sweden, pp. 246–249, 1993.Google Scholar
- [TW94]D. Talkin and C. W. Wightman. The aligner: text-to-speech alignment using Markov models and a pronunciation dictionary. Proceedings of the ESCA/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 89–92, 1994.Google Scholar
- [WC94]C. W. Wightman and W. N. Campbell. Automatic labelling of prosodic structure. Technical Report TR-IT-0061, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994.Google Scholar
- [Wha90]D. Whalen. Coarticulation is largely planned. Journal of Phonetics, 18:3–35, 1990.MathSciNetGoogle Scholar