Skip to main content

Synthesizing Spontaneous Speech

  • Chapter
Computing Prosody

Abstract

This paper addresses the issue of producing synthetic speech for an interpreted dialogue where the attitudinal colouring of an original utterance is to be preserved; it describes differences in speaking style between read and spontaneous speech from the viewpoint of synthesis research and discusses the design of a synthesis system incorporating labels to encode the prosodic and segmental variation simultaneously. Spontaneous speech confronts us with phenomena that were not encountered in corpora of prepared or read speech, and to account for these we have to identify increasingly higher-level factors of discourse structure and speaker involvement. The paper makes three specific claims: (a) that it is better to label the distinctive characteristics of speech through higher-level context dependencies, and to select units for synthesis from appropriate contexts, rather than attempt to predict and modify fine phonetic detail; (b) that the labelling of segmental and prosodic characteristics can be done adequately for speech synthesis using automatic techniques, leaving the human labeller free to identify higher-level discourse-related aspects of the speech; and (c) that instead of minimizing the size of the source database of speech units, we should rather be concerned to maximize its variety and to efficiently select from it the units that most closely express the characteristics of the target speech. The CHATR resynthesis toolkit performs many of these tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. W. J. Barry. Phonetics and phonology in speaking styles. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.

    Google Scholar 

  2. A. W. Black and W. N. Campbell. Predicting the intonation of discourse segments from examples in dialogue speech. Proceedings of the ESCA Workshop on Spoken Dialogue, Hanstholm, Denmark, 1995.

    Google Scholar 

  3. G. Bruce, B. Granström, K. Gustafson, M. Home, D. House, and P. Touati. On the analysis of prosody in interaction. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.

    Google Scholar 

  4. E. Blaauw. On the perceptual classification of spontaneous and read speech. Ph.D. thesis, OTS Dissertation Series, Utrecht University. ISBN 90-5434-045-2, 1995.

    Google Scholar 

  5. A. W. Black and P. Taylor. CHATR: A generic speech synthesis system. Proceedings of COLING-91 11:983–986, 1994.

    Google Scholar 

  6. W. N. Campbell. Multi-level timing in speech. PhD thesis, University of Sussex, Department of Experimental Psychology, 1992. Available as ATR Technical Report TR-IT-0035.

    Google Scholar 

  7. W. N. Campbell. Prosodic encoding of English speech. In Proceedings of the International Conference on Spoken Language Processing, Banff Canada, pp. 663–666, 1992.

    Google Scholar 

  8. W. N. Campbell. Synthesis units for natural English speech. Technical Report SP 91–129, IEICE, 1992.

    Google Scholar 

  9. W. N. Campbell. Automatic detection of prosodic boundaries in speech. Speech Communication, 13:343–354, 1993.

    Article  Google Scholar 

  10. W. N. Campbell. Predicting segmental durations for accommodation within a syllable-level timing framework. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 1081–1084, 1993.

    Google Scholar 

  11. W. N. Campbell. Prosody and the selection of source units for concatenative synthesis. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 61–64, 1994.

    Google Scholar 

  12. W. N. Campbell. Loudness, spectral tilt, and perceived prominence in dialogues. In Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, 1995.

    Google Scholar 

  13. W. N. Campbell and M. Beckman. Stress, loudness, and spectral tilt. Proceedings of the Acoustical Society Japan, Spring Meeting, 3–4–3, 1995.

    Google Scholar 

  14. W. N. Campbell and A. W. Black. Prosody and the selection of source units for concatenative synthesis. In Progress in Speech Synthesis. Berlin: Springer-Verlag, 1996.

    Google Scholar 

  15. J. C. Coleman. The phonetic interpretation of headed phonological structures containing overlapping constituents. In Phonetics Yearbook 9, pp. 1–44. New York: Academic, 1992.

    Google Scholar 

  16. W. N. Campbell and Y. Sagisaka. Automatic annotation of speech corpora. Proceedings of the SST92 Queensland, Aus¬tralia, pp. 686–691, 1992.

    Google Scholar 

  17. K. de Jong. The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. J. Acoust. Society Am., 97:491–504, 1995.

    Article  ADS  Google Scholar 

  18. Entropic Research Laboratory, 600 Pennsylvania Avenue, Washington DC 20003. HTK - Hidden Markov Model Toolkit, 1993.

    Google Scholar 

  19. L. Fais. Conversation as collaboration: some syntactic evidence. Speech Communication, 15:230–242, 1994.

    Article  Google Scholar 

  20. J. Gauffin and J. Sundberg. Spectral correlates of glottal voice source waveform characteristics. JSHR, 32:556–565, 1989.

    Google Scholar 

  21. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, and Signal Processes, 1996.

    Google Scholar 

  22. D. Hirst. Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de VInstitut de Phonetique 15, Aix en Provence, pp. 71–85, 1980.

    Google Scholar 

  23. J. Hirschberg. Using discourse context to guide pitch accent decisions in synthetic speech. In G. Bailly, C. Benoit, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pp. 367–376. Amsterdam: Elsevier Science, 1992.

    Google Scholar 

  24. J. Hirschberg. Acoustic and prosodic cues to speaking style in spontaneous and read speech. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 36–43, 1995. Symposium on speaking styles.

    Google Scholar 

  25. A. Kiessling, R. Kompe, H. Niemann, E. Noth, and A. Batlinger. Detection of phrase boundaries and accents. Progress and Prospects of Speech Research and Technology: Proceedings of the CRIM/ORWISS Workshop, Sankt Augustin, pp. 266–269, 1995.

    Google Scholar 

  26. K. J. Kohler. Articulatory reduction in different speaking styles. Proceedings of the 13th International Congress of Phonetic Sciences, Stockholm, Sweden, Vol. 2, pp. 12–19, 1995. Symposium on speaking styles.

    Google Scholar 

  27. K. J. Kohler. Modelling prosody in spontaneous speech. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997.

    Google Scholar 

  28. B. E. F. Lindblom. Explaining phonetic variation: A sketch of the H&H theory. In H. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pp. 403–409. Dordrecht: Kluwer, 1990.

    Google Scholar 

  29. G. Mehta and A. Cutler. Detection of target phonemes in spontaneous and read speech. Language and Speech, 31:135–156, 1988.

    Google Scholar 

  30. E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using di- phone. Speech Communication, 9:453–467, 1990.

    Article  Google Scholar 

  31. C. Nakatani and L. Shriberg. Draft proposal for labelling disfluencies in ToBI. Paper presented at 3rd ToBI Labelling Workshop, Ohio, 1993.

    Google Scholar 

  32. M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS- 95–001, Boston University ECS Dept., 1995.

    Google Scholar 

  33. J. Pitrelli, M. E. Beckman, and J. Hirschberg. Evaluation of prosodic transcription labelling reliability in the ToBI framework. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 1, pp. 123–126, 1994.

    Google Scholar 

  34. J. B. Pierrehumbert and D. Talkin. Lenition of /h/ and glottal stop. In G. Doherty and D. R. Ladd, editors, Papers in Laboratory phonology 2, pp. 90–127. Cambridge, UK: Cambridge University Press, 1992.

    Google Scholar 

  35. K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, Vol. 2, pp. 867–870, 1992.

    Google Scholar 

  36. L. Shriberg. Preliminaries to a theory of disfluencies. Ph.D. thesis, University of California at Berkeley, 1994.

    Google Scholar 

  37. A. C. M. Sluijter. Phonetic correlates of stress and accent Holland Institute of General Linguistics, 1995.

    Google Scholar 

  38. A. Stenström. An Introduction to Spoken Interaction. London: Longman, 1994.

    Google Scholar 

  39. A. Sluijter and V. J. van Heuven. Perceptual cues of linguistic stress: intensity revisited. Working papers 41, Proceedings of the ESC A Workshop on Prosody, Lund University, Sweden, pp. 246–249, 1993.

    Google Scholar 

  40. D. Talkin and C. W. Wightman. The aligner: text-to-speech alignment using Markov models and a pronunciation dictionary. Proceedings of the ESCA/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 89–92, 1994.

    Google Scholar 

  41. C. W. Wightman and W. N. Campbell. Automatic labelling of prosodic structure. Technical Report TR-IT-0061, ATR Interpreting Telecommunications Laboratories, Kyoto, Japan, 1994.

    Google Scholar 

  42. D. Whalen. Coarticulation is largely planned. Journal of Phonetics, 18:3–35, 1990.

    MathSciNet  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag New York, Inc.

About this chapter

Cite this chapter

Campbell, W.N. (1997). Synthesizing Spontaneous Speech. In: Sagisaka, Y., Campbell, N., Higuchi, N. (eds) Computing Prosody. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2258-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4612-2258-3_12

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4612-7476-6

  • Online ISBN: 978-1-4612-2258-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics