Synthesizing Spontaneous Speech

  • W. N. Campbell

Abstract

This paper addresses the issue of producing synthetic speech for an interpreted dialogue where the attitudinal colouring of an original utterance is to be preserved; it describes differences in speaking style between read and spontaneous speech from the viewpoint of synthesis research and discusses the design of a synthesis system incorporating labels to encode the prosodic and segmental variation simultaneously. Spontaneous speech confronts us with phenomena that were not encountered in corpora of prepared or read speech, and to account for these we have to identify increasingly higher-level factors of discourse structure and speaker involvement. The paper makes three specific claims: (a) that it is better to label the distinctive characteristics of speech through higher-level context dependencies, and to select units for synthesis from appropriate contexts, rather than attempt to predict and modify fine phonetic detail; (b) that the labelling of segmental and prosodic characteristics can be done adequately for speech synthesis using automatic techniques, leaving the human labeller free to identify higher-level discourse-related aspects of the speech; and (c) that instead of minimizing the size of the source database of speech units, we should rather be concerned to maximize its variety and to efficiently select from it the units that most closely express the characteristics of the target speech. The CHATR resynthesis toolkit performs many of these tasks.

Keywords

Speech Synthesis Spontaneous Speech Natural Speech Prosodic Characteristic Synthetic Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Bar95]
    W. J. Barry. Phonetics and phonology in speaking styles. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.Google Scholar
  2. [BC95]
    A. W. Black and W. N. Campbell. Predicting the intonation of discourse segments from examples in dialogue speech. Proceedings of the ESCA Workshop on Spoken Dialogue, Hanstholm, Denmark, 1995.Google Scholar
  3. [BGG+96]
    G. Bruce, B. Granström, K. Gustafson, M. Home, D. House, and P. Touati. On the analysis of prosody in interaction. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.Google Scholar
  4. [Bla95]
    E. Blaauw. On the perceptual classification of spontaneous and read speech. Ph.D. thesis, OTS Dissertation Series, Utrecht University. ISBN 90-5434-045-2, 1995.Google Scholar
  5. [BT94b]
    A. W. Black and P. Taylor. CHATR: A generic speech synthesis system. Proceedings of COLING-91 11:983–986, 1994.Google Scholar
  6. [Cam92a]
    W. N. Campbell. Multi-level timing in speech. PhD thesis, University of Sussex, Department of Experimental Psychology, 1992. Available as ATR Technical Report TR-IT-0035.Google Scholar
  7. [Cam92b]
    W. N. Campbell. Prosodic encoding of English speech. In Proceedings of the International Conference on Spoken Language Processing, Banff Canada, pp. 663–666, 1992.Google Scholar
  8. [Cam92d]
    W. N. Campbell. Synthesis units for natural English speech. Technical Report SP 91–129, IEICE, 1992.Google Scholar
  9. [Cam93a]
    W. N. Campbell. Automatic detection of prosodic boundaries in speech. Speech Communication, 13:343–354, 1993.CrossRefGoogle Scholar
  10. [Cam93b]
    W. N. Campbell. Predicting segmental durations for accommodation within a syllable-level timing framework. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 1081–1084, 1993.Google Scholar
  11. [Cam94b]
    W. N. Campbell. Prosody and the selection of source units for concatenative synthesis. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 61–64, 1994.Google Scholar
  12. [Cam95]
    W. N. Campbell. Loudness, spectral tilt, and perceived prominence in dialogues. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.Google Scholar
  13. [CB95]
    W. N. Campbell and M. Beckman. Stress, loudness, and spectral tilt. Proceedings of the Acoustical Society Japan, Spring Meeting, 3–4–3, 1995.Google Scholar
  14. [CB96]
    W. N. Campbell and A. W. Black. Prosody and the selection of source units for concatenative synthesis. In Progress in Speech Synthesis. Berlin: Springer-Verlag, 1996.Google Scholar
  15. [Col92a]
    J. C. Coleman. The phonetic interpretation of headed phonological structures containing overlapping constituents. In Phonetics Yearbook 9, pp. 1–44. New York: Academic, 1992.Google Scholar
  16. [CS92]
    W. N. Campbell and Y. Sagisaka. Automatic annotation of speech corpora. Proceedings of the SST92 Queensland, Aus¬tralia, pp. 686–691, 1992.Google Scholar
  17. [dJ95]
    K. de Jong. The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. J. Acoust. Society Am., 97:491–504, 1995.ADSCrossRefGoogle Scholar
  18. [Ent93]
    Entropic Research Laboratory, 600 Pennsylvania Avenue, Washington DC 20003. HTK - Hidden Markov Model Toolkit, 1993.Google Scholar
  19. [Fai94]
    L. Fais. Conversation as collaboration: some syntactic evidence. Speech Communication, 15:230–242, 1994.CrossRefGoogle Scholar
  20. [GS89]
    J. Gauffin and J. Sundberg. Spectral correlates of glottal voice source waveform characteristics. JSHR, 32:556–565, 1989.Google Scholar
  21. [HB96]
    A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, and Signal Processes, 1996.Google Scholar
  22. [Hir80]
    D. Hirst. Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de VInstitut de Phonetique 15, Aix en Provence, pp. 71–85, 1980.Google Scholar
  23. [Hir92]
    J. Hirschberg. Using discourse context to guide pitch accent decisions in synthetic speech. In G. Bailly, C. Benoit, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pp. 367–376. Amsterdam: Elsevier Science, 1992.Google Scholar
  24. [Hir95b]
    J. Hirschberg. Acoustic and prosodic cues to speaking style in spontaneous and read speech. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 36–43, 1995. Symposium on speaking styles.Google Scholar
  25. [KKN+95a]
    A. Kiessling, R. Kompe, H. Niemann, E. Noth, and A. Batlinger. Detection of phrase boundaries and accents. Progress and Prospects of Speech Research and Technology: Proceedings of the CRIM/ORWISS Workshop, Sankt Augustin, pp. 266–269, 1995.Google Scholar
  26. [Koh95a]
    K. J. Kohler. Articulatory reduction in different speaking styles. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 12–19, 1995. Symposium on speaking styles.Google Scholar
  27. [Koh96]
    K. J. Kohler. Modelling prosody in spontaneous speech. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.Google Scholar
  28. [Lin90]
    B. E. F. Lindblom. Explaining phonetic variation: A sketch of the H&H theory. In H. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pp. 403–409. Dordrecht: Kluwer, 1990.Google Scholar
  29. [MC88]
    G. Mehta and A. Cutler. Detection of target phonemes in spontaneous and read speech. Language and Speech, 31:135–156, 1988.Google Scholar
  30. [MC90]
    E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using di- phone. Speech Communication, 9:453–467, 1990.CrossRefGoogle Scholar
  31. [NS93]
    C. Nakatani and L. Shriberg. Draft proposal for labelling disfluencies in ToBI. Paper presented at 3rd ToBI Labelling Workshop, Ohio, 1993.Google Scholar
  32. [OPSH95b]
    M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS- 95–001, Boston University ECS Dept., 1995.Google Scholar
  33. [PBH94]
    J. Pitrelli, M. E. Beckman, and J. Hirschberg. Evaluation of prosodic transcription labelling reliability in the ToBI framework. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 1, pp. 123–126, 1994.Google Scholar
  34. [PT92]
    J. B. Pierrehumbert and D. Talkin. Lenition of /h/ and glottal stop. In G. Doherty and D. R. Ladd, editors, Papers in Laboratory phonology 2, pp. 90–127. Cambridge, UK: Cambridge University Press, 1992.Google Scholar
  35. [SBP+92]
    K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, Vol. 2, pp. 867–870, 1992.Google Scholar
  36. [Shr94]
    L. Shriberg. Preliminaries to a theory of disfluencies. Ph.D. thesis, University of California at Berkeley, 1994.Google Scholar
  37. [Slu95b]
    A. C. M. Sluijter. Phonetic correlates of stress and accent Holland Institute of General Linguistics, 1995.Google Scholar
  38. [Ste94]
    A. Stenström. An Introduction to Spoken Interaction. London: Longman, 1994.Google Scholar
  39. [SvH93]
    A. Sluijter and V. J. van Heuven. Perceptual cues of linguistic stress: intensity revisited. Working papers 41, Proceedings of the ESC A Workshop on Prosody, Lund University, Sweden, pp. 246–249, 1993.Google Scholar
  40. [TW94]
    D. Talkin and C. W. Wightman. The aligner: text-to-speech alignment using Markov models and a pronunciation dictionary. Proceedings of the ESCA/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 89–92, 1994.Google Scholar
  41. [WC94]
    C. W. Wightman and W. N. Campbell. Automatic labelling of prosodic structure. Technical Report TR-IT-0061, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994.Google Scholar
  42. [Wha90]
    D. Whalen. Coarticulation is largely planned. Journal of Phonetics, 18:3–35, 1990.MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag New York, Inc. 1997

Authors and Affiliations

  • W. N. Campbell

There are no affiliations available

Personalised recommendations