Modeling of Vocal Styles Using Portable Features and Placement Rules

Shih, Chilin; Kochanski, Greg

doi:10.1023/A:1025765101903

Modeling of Vocal Styles Using Portable Features and Placement Rules

Published: October 2003

Volume 6, pages 393–408, (2003)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Chilin Shih¹ &
Greg Kochanski¹

79 Accesses
Explore all metrics

Abstract

This paper presents a mathematical description of style in speech and singing. These styles are represented as a set of portable prosodic features along with a set of rules to choose where the features are to be applied. Speakers and singers make creative choices to express their personal style, which may involve specific phrase curve, accent shape, or, similarly, musical embellishment. Therefore a quantitative model of style needs to support unconstrained accent and phrase curve description, and to solve potential conflicts that arise from this freedom. Our current implementation modifies two acoustic parameters: f₀ and amplitude. We use an articulator-based model, Stem-ML, to resolve conflicts between intended accents or embellishments and their environment. We present several examples to illustrate the modeling of accents and phrase curves, as well as the usefulness of style/content separation, and the similarity between speech and music.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Change and Continuity in Sound Analysis: A Review of Concepts in Regard to Musical Acoustics, Music Perception, and Transcription

The Human Voice in Speech and Singing

Automatic Transcription of Polyphonic Vocal Music

References

Abe, M. (1997). Speaking styles: Statistical analysis and synthesis by a text-to-speech system. In J.P.H. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis, Springer-Verlag, pp. 495-510.
Anderson, M., Pierrehumbert, J., and Liberman, M. (1984). Synthesis by rule of English intonation patterns. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, San Diego, CA, Vol. 1, pp. 2.8.1-2.8.4.
Google Scholar
Beckman, M.E. and Ayers, G. (1997, March). Guidelines for ToBI labeling (version 3). http://www.ling.ohio-state.edu/phonetics/ ToBI/ToBI.0.html.
Bennett, G. and Rodet, X. (1991). Synthesis of the singing voice. In M.V. Mathews and J.R. Pierce (Eds.), Current Directions in Computer Music Research. Cambridge, MA: MIT Press, pp. 19-44.
Bloch, B. (1953). Linguistic structure and linguistic analysis. In A.A. Hill (Ed.), Report of the Fourth Annual Round Table Meeting on Linguistics and Language Teaching. Washington, DC: Georgetown University Press, pp. 40-44.
Google Scholar
Cahn, J.E. (1998). A Computational Memory and Processing Model for Prosody. PhD Thesis, MIT, Cambridge, MA.
Cook, P. (1991). Identification of Control Parameters in an Articulatory Vocal Tract Model, with Applications to the Synthesis of Singing. PhD Thesis, Stanford University.
Dacre, H. (1892). Daisy Belle, or A Bicycle Made for Two. London: Francis, Day and Hunter.
Dorson, R.M. (1960). Oral styles of American folk narrators. In T.A. Sebeok (Ed.), Style in Language. Cambridge, MA: MIT Press, pp. 27-51.
Google Scholar
ESPS/Waves. (2002). http://www.speech.kth.se/esps/esps.zip.
Friberg, A. (1995). A Quantitative Rule System for Musical Performance. PhD Thesis, Royal Institute of Technology (KTH), Sweden.
Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In P.F. MacNeilage (Ed.), The Production of Speech. Springer-Verlag, pp. 39-55.
Fujisaki, H. (1988). A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In O. Fujimura (Ed.), Vocal Fold Physiology: Voice Production, Mechanisms and Functions. New York: Raven, pp. 347-355.
Google Scholar
Garretson, R. (1993). Choral Music: History, Style, and Performance Practice. Prentice Hall.
Higuchi, N., Hirai, T., and Sagisaka, Y. (1997). Effect of speaking style on parameters of fundamental frequency contour. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer-Verlag, pp. 417-428.
Hill, A.V. (1938). The heat of shortening and the dynamic constraints of muscle. Proceedings of the Royal Society B 126, 136-195.
Google Scholar
Hirst, D.J., Di Cristo, A., and Espesser, R. (2000). Levels of representation and levels of analysis for the description of intonation systems. In M. Horne (Ed.), Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 51-87.
Google Scholar
Huxley, A.F. (1957). Muscle structure and theories of contraction. Progress in Biophysics and Biophysical Chemistry, 7:257-318.
Google Scholar
Jilka, M., Möhler, G., and Dogil, G. (1999). Rules for the generation of ToBI-based American English intonation. Speech Communications, 28:83-108.
Google Scholar
King, M.L. (2000). Martin Luther King, Jr.: We Shall Overcome. Rolling Bay, Washinton: SpeechWorks, SoundWorks Entertainment, Inc., JRCD 7036.
Google Scholar
Kochanski, G. and Shih, C. (2003). Prosody modeling with soft templates. Speech Communication, 39(3/4):311-352.
Google Scholar
Kochanski, G., Shih, C., and Jing, H. (2003). Hierarchical structure and word strength prediction of Mandarin prosody. International Journal of Speech Technology, 6:33-43.
Google Scholar
Kochanski, G.P. and Shih, C. (2000). Stem-ML: Language independent prosody description. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 3, pp. 239-242.
Google Scholar
Kubrick, S. (1968). 2001: A Space Odyssey. Turner Entertainment Company. Based on the book of the same title by A.C. Clarke.
Liberman, M.Y. and Pierrehumbert, J.B. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff and R. Oehrle (Eds.), Language Sound Structure, Cambridge, MA: MIT Press, pp. 157-233.
Google Scholar
Lindblom, B. (1963). Spectrographic study of vowel reduction. Journal of the Acoustical Society of America, 35:1773-1781.
Google Scholar
Macon, M.W., Jensen-Link, L., Oliverio, J., Clements, M., and George, E.B. (1997).Asystem for singing voice synthesis based on sinusoidal modeling. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, vol. 1, pp. 435-438.
Google Scholar
Mathews, M.V. (1963). Bicycle built for two. Music from Mathematics. Decca Records. DL 9103.
Möhler, G. and Mayer, J. (2001). A discourse model for pitchrange control. 4th ISCA Workshop on Speech Synthesis, Pitlochry, Scotland, pp. 11-15.
Monaghan, A.I.C. and Ladd, D.R. (1991). Manipulating synthetic intonation for speaker characterisation. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, pp. 453-456.
Murray, I.R. and Arnott, J.L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America 93:1097-1108.
Google Scholar
NIST. (2000). DARPA communicator travel reservation corpus-June 2000 evaluation. Technical report, National Institute of Standards and Technology, Gaithersburg, MD. Speech Data published on CD-ROM.
Olive, J. (1998). The talking computer: Text to speech synthesis. In D.G. Stork (Ed.), HAL's Legacy: 2001's Computer as Dream and Reality, Cambridge, MA: MIT Press, Chap. 6, pp. 101-130.
Google Scholar
Pierrehumbert, J. (1979). The perception of fundamental frequency declination. Journal of the Acoustical Society of America, 66(2):363-369.
Google Scholar
Schroder, M. (2001). Emotional speech synthesis-a review. Proceedings of Eurospeech, Aalborg, Denmark, pp. 561-564.
Shih, C. and Kochanski, G.P. (2000). Chinese tone modeling with Stem-ML. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 2, pp. 67-70.
Google Scholar
Shih, C. and Kochanski, G.P. (2001). Synthesis of prosodic styles. 4th ISCA Workshop on Speech Synthesis, Perthshire, Scotland, pp. 229-234.
Shih, C., Kochanski, G.P., Fosler-Lussier, E., Chan, M., and Yuan, J.-H. (2001). Implications of prosody modeling for prosody recognition. In M. Bacchiani, J. Hirschberg, D. Litman, and M. Ostendorf (Eds.), Proceedings of the ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding. International Speech Communication Association. Red Bank, NJ, pp. 133-138.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, vol. 2, pp. 867-870.
Google Scholar
Strik, H. and Boves, L. (1992). Control of fundamental frequency, intensity and voice quality in speech. Journal of Phonetics, 20(1):15-25.
Google Scholar
Sundberg, J., Askenfelt, A., and Fryd´en, L. (1983). Musical performance: A synthesis-by-rule approach. Computer Music Journal, 7:37-43.
Google Scholar
Taylor, P.A. (2000). Analysis and synthesis of intonation using the tilt model. Journal of the Acoustical Society of America, 107(3):1697-1714.
Google Scholar
van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of f0 generation and alignment. In A. Botinis (Ed.), Intonation: Analysis, Modelling and Technology. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 269-288.
Google Scholar

Download references

Author information

Authors and Affiliations

Bell Laboratories, Lucent Technologies, Murray Hill, NJ, USA
Chilin Shih & Greg Kochanski

Authors

Chilin Shih
View author publications
You can also search for this author in PubMed Google Scholar
Greg Kochanski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shih, C., Kochanski, G. Modeling of Vocal Styles Using Portable Features and Placement Rules. International Journal of Speech Technology 6, 393–408 (2003). https://doi.org/10.1023/A:1025765101903

Download citation

Issue Date: October 2003
DOI: https://doi.org/10.1023/A:1025765101903

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling of Vocal Styles Using Portable Features and Placement Rules

Abstract

Access this article

Similar content being viewed by others

Change and Continuity in Sound Analysis: A Review of Concepts in Regard to Musical Acoustics, Music Perception, and Transcription

The Human Voice in Speech and Singing

Automatic Transcription of Polyphonic Vocal Music

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Modeling of Vocal Styles Using Portable Features and Placement Rules

Abstract

Access this article

Similar content being viewed by others

Change and Continuity in Sound Analysis: A Review of Concepts in Regard to Musical Acoustics, Music Perception, and Transcription

The Human Voice in Speech and Singing

Automatic Transcription of Polyphonic Vocal Music

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation