Skip to main content
Log in

Modeling of Vocal Styles Using Portable Features and Placement Rules

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper presents a mathematical description of style in speech and singing. These styles are represented as a set of portable prosodic features along with a set of rules to choose where the features are to be applied. Speakers and singers make creative choices to express their personal style, which may involve specific phrase curve, accent shape, or, similarly, musical embellishment. Therefore a quantitative model of style needs to support unconstrained accent and phrase curve description, and to solve potential conflicts that arise from this freedom. Our current implementation modifies two acoustic parameters: f0 and amplitude. We use an articulator-based model, Stem-ML, to resolve conflicts between intended accents or embellishments and their environment. We present several examples to illustrate the modeling of accents and phrase curves, as well as the usefulness of style/content separation, and the similarity between speech and music.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abe, M. (1997). Speaking styles: Statistical analysis and synthesis by a text-to-speech system. In J.P.H. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis, Springer-Verlag, pp. 495-510.

  • Anderson, M., Pierrehumbert, J., and Liberman, M. (1984). Synthesis by rule of English intonation patterns. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, San Diego, CA, Vol. 1, pp. 2.8.1-2.8.4.

    Google Scholar 

  • Beckman, M.E. and Ayers, G. (1997, March). Guidelines for ToBI labeling (version 3). http://www.ling.ohio-state.edu/phonetics/ ToBI/ToBI.0.html.

  • Bennett, G. and Rodet, X. (1991). Synthesis of the singing voice. In M.V. Mathews and J.R. Pierce (Eds.), Current Directions in Computer Music Research. Cambridge, MA: MIT Press, pp. 19-44.

  • Bloch, B. (1953). Linguistic structure and linguistic analysis. In A.A. Hill (Ed.), Report of the Fourth Annual Round Table Meeting on Linguistics and Language Teaching. Washington, DC: Georgetown University Press, pp. 40-44.

    Google Scholar 

  • Cahn, J.E. (1998). A Computational Memory and Processing Model for Prosody. PhD Thesis, MIT, Cambridge, MA.

  • Cook, P. (1991). Identification of Control Parameters in an Articulatory Vocal Tract Model, with Applications to the Synthesis of Singing. PhD Thesis, Stanford University.

  • Dacre, H. (1892). Daisy Belle, or A Bicycle Made for Two. London: Francis, Day and Hunter.

  • Dorson, R.M. (1960). Oral styles of American folk narrators. In T.A. Sebeok (Ed.), Style in Language. Cambridge, MA: MIT Press, pp. 27-51.

    Google Scholar 

  • ESPS/Waves. (2002). http://www.speech.kth.se/esps/esps.zip.

  • Friberg, A. (1995). A Quantitative Rule System for Musical Performance. PhD Thesis, Royal Institute of Technology (KTH), Sweden.

  • Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In P.F. MacNeilage (Ed.), The Production of Speech. Springer-Verlag, pp. 39-55.

  • Fujisaki, H. (1988). A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In O. Fujimura (Ed.), Vocal Fold Physiology: Voice Production, Mechanisms and Functions. New York: Raven, pp. 347-355.

    Google Scholar 

  • Garretson, R. (1993). Choral Music: History, Style, and Performance Practice. Prentice Hall.

  • Higuchi, N., Hirai, T., and Sagisaka, Y. (1997). Effect of speaking style on parameters of fundamental frequency contour. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer-Verlag, pp. 417-428.

  • Hill, A.V. (1938). The heat of shortening and the dynamic constraints of muscle. Proceedings of the Royal Society B 126, 136-195.

    Google Scholar 

  • Hirst, D.J., Di Cristo, A., and Espesser, R. (2000). Levels of representation and levels of analysis for the description of intonation systems. In M. Horne (Ed.), Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 51-87.

    Google Scholar 

  • Huxley, A.F. (1957). Muscle structure and theories of contraction. Progress in Biophysics and Biophysical Chemistry, 7:257-318.

    Google Scholar 

  • Jilka, M., Möhler, G., and Dogil, G. (1999). Rules for the generation of ToBI-based American English intonation. Speech Communications, 28:83-108.

    Google Scholar 

  • King, M.L. (2000). Martin Luther King, Jr.: We Shall Overcome. Rolling Bay, Washinton: SpeechWorks, SoundWorks Entertainment, Inc., JRCD 7036.

    Google Scholar 

  • Kochanski, G. and Shih, C. (2003). Prosody modeling with soft templates. Speech Communication, 39(3/4):311-352.

    Google Scholar 

  • Kochanski, G., Shih, C., and Jing, H. (2003). Hierarchical structure and word strength prediction of Mandarin prosody. International Journal of Speech Technology, 6:33-43.

    Google Scholar 

  • Kochanski, G.P. and Shih, C. (2000). Stem-ML: Language independent prosody description. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 3, pp. 239-242.

    Google Scholar 

  • Kubrick, S. (1968). 2001: A Space Odyssey. Turner Entertainment Company. Based on the book of the same title by A.C. Clarke.

  • Liberman, M.Y. and Pierrehumbert, J.B. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff and R. Oehrle (Eds.), Language Sound Structure, Cambridge, MA: MIT Press, pp. 157-233.

    Google Scholar 

  • Lindblom, B. (1963). Spectrographic study of vowel reduction. Journal of the Acoustical Society of America, 35:1773-1781.

    Google Scholar 

  • Macon, M.W., Jensen-Link, L., Oliverio, J., Clements, M., and George, E.B. (1997).Asystem for singing voice synthesis based on sinusoidal modeling. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, vol. 1, pp. 435-438.

    Google Scholar 

  • Mathews, M.V. (1963). Bicycle built for two. Music from Mathematics. Decca Records. DL 9103.

  • Möhler, G. and Mayer, J. (2001). A discourse model for pitchrange control. 4th ISCA Workshop on Speech Synthesis, Pitlochry, Scotland, pp. 11-15.

  • Monaghan, A.I.C. and Ladd, D.R. (1991). Manipulating synthetic intonation for speaker characterisation. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, pp. 453-456.

  • Murray, I.R. and Arnott, J.L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America 93:1097-1108.

    Google Scholar 

  • NIST. (2000). DARPA communicator travel reservation corpus-June 2000 evaluation. Technical report, National Institute of Standards and Technology, Gaithersburg, MD. Speech Data published on CD-ROM.

  • Olive, J. (1998). The talking computer: Text to speech synthesis. In D.G. Stork (Ed.), HAL's Legacy: 2001's Computer as Dream and Reality, Cambridge, MA: MIT Press, Chap. 6, pp. 101-130.

    Google Scholar 

  • Pierrehumbert, J. (1979). The perception of fundamental frequency declination. Journal of the Acoustical Society of America, 66(2):363-369.

    Google Scholar 

  • Schroder, M. (2001). Emotional speech synthesis-a review. Proceedings of Eurospeech, Aalborg, Denmark, pp. 561-564.

  • Shih, C. and Kochanski, G.P. (2000). Chinese tone modeling with Stem-ML. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 2, pp. 67-70.

    Google Scholar 

  • Shih, C. and Kochanski, G.P. (2001). Synthesis of prosodic styles. 4th ISCA Workshop on Speech Synthesis, Perthshire, Scotland, pp. 229-234.

  • Shih, C., Kochanski, G.P., Fosler-Lussier, E., Chan, M., and Yuan, J.-H. (2001). Implications of prosody modeling for prosody recognition. In M. Bacchiani, J. Hirschberg, D. Litman, and M. Ostendorf (Eds.), Proceedings of the ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding. International Speech Communication Association. Red Bank, NJ, pp. 133-138.

  • Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, vol. 2, pp. 867-870.

    Google Scholar 

  • Strik, H. and Boves, L. (1992). Control of fundamental frequency, intensity and voice quality in speech. Journal of Phonetics, 20(1):15-25.

    Google Scholar 

  • Sundberg, J., Askenfelt, A., and Fryd´en, L. (1983). Musical performance: A synthesis-by-rule approach. Computer Music Journal, 7:37-43.

    Google Scholar 

  • Taylor, P.A. (2000). Analysis and synthesis of intonation using the tilt model. Journal of the Acoustical Society of America, 107(3):1697-1714.

    Google Scholar 

  • van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of f0 generation and alignment. In A. Botinis (Ed.), Intonation: Analysis, Modelling and Technology. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 269-288.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shih, C., Kochanski, G. Modeling of Vocal Styles Using Portable Features and Placement Rules. International Journal of Speech Technology 6, 393–408 (2003). https://doi.org/10.1023/A:1025765101903

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025765101903

Navigation