Abstract
This paper presents a mathematical description of style in speech and singing. These styles are represented as a set of portable prosodic features along with a set of rules to choose where the features are to be applied. Speakers and singers make creative choices to express their personal style, which may involve specific phrase curve, accent shape, or, similarly, musical embellishment. Therefore a quantitative model of style needs to support unconstrained accent and phrase curve description, and to solve potential conflicts that arise from this freedom. Our current implementation modifies two acoustic parameters: f0 and amplitude. We use an articulator-based model, Stem-ML, to resolve conflicts between intended accents or embellishments and their environment. We present several examples to illustrate the modeling of accents and phrase curves, as well as the usefulness of style/content separation, and the similarity between speech and music.
Similar content being viewed by others
References
Abe, M. (1997). Speaking styles: Statistical analysis and synthesis by a text-to-speech system. In J.P.H. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis, Springer-Verlag, pp. 495-510.
Anderson, M., Pierrehumbert, J., and Liberman, M. (1984). Synthesis by rule of English intonation patterns. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, San Diego, CA, Vol. 1, pp. 2.8.1-2.8.4.
Beckman, M.E. and Ayers, G. (1997, March). Guidelines for ToBI labeling (version 3). http://www.ling.ohio-state.edu/phonetics/ ToBI/ToBI.0.html.
Bennett, G. and Rodet, X. (1991). Synthesis of the singing voice. In M.V. Mathews and J.R. Pierce (Eds.), Current Directions in Computer Music Research. Cambridge, MA: MIT Press, pp. 19-44.
Bloch, B. (1953). Linguistic structure and linguistic analysis. In A.A. Hill (Ed.), Report of the Fourth Annual Round Table Meeting on Linguistics and Language Teaching. Washington, DC: Georgetown University Press, pp. 40-44.
Cahn, J.E. (1998). A Computational Memory and Processing Model for Prosody. PhD Thesis, MIT, Cambridge, MA.
Cook, P. (1991). Identification of Control Parameters in an Articulatory Vocal Tract Model, with Applications to the Synthesis of Singing. PhD Thesis, Stanford University.
Dacre, H. (1892). Daisy Belle, or A Bicycle Made for Two. London: Francis, Day and Hunter.
Dorson, R.M. (1960). Oral styles of American folk narrators. In T.A. Sebeok (Ed.), Style in Language. Cambridge, MA: MIT Press, pp. 27-51.
ESPS/Waves. (2002). http://www.speech.kth.se/esps/esps.zip.
Friberg, A. (1995). A Quantitative Rule System for Musical Performance. PhD Thesis, Royal Institute of Technology (KTH), Sweden.
Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In P.F. MacNeilage (Ed.), The Production of Speech. Springer-Verlag, pp. 39-55.
Fujisaki, H. (1988). A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In O. Fujimura (Ed.), Vocal Fold Physiology: Voice Production, Mechanisms and Functions. New York: Raven, pp. 347-355.
Garretson, R. (1993). Choral Music: History, Style, and Performance Practice. Prentice Hall.
Higuchi, N., Hirai, T., and Sagisaka, Y. (1997). Effect of speaking style on parameters of fundamental frequency contour. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer-Verlag, pp. 417-428.
Hill, A.V. (1938). The heat of shortening and the dynamic constraints of muscle. Proceedings of the Royal Society B 126, 136-195.
Hirst, D.J., Di Cristo, A., and Espesser, R. (2000). Levels of representation and levels of analysis for the description of intonation systems. In M. Horne (Ed.), Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 51-87.
Huxley, A.F. (1957). Muscle structure and theories of contraction. Progress in Biophysics and Biophysical Chemistry, 7:257-318.
Jilka, M., Möhler, G., and Dogil, G. (1999). Rules for the generation of ToBI-based American English intonation. Speech Communications, 28:83-108.
King, M.L. (2000). Martin Luther King, Jr.: We Shall Overcome. Rolling Bay, Washinton: SpeechWorks, SoundWorks Entertainment, Inc., JRCD 7036.
Kochanski, G. and Shih, C. (2003). Prosody modeling with soft templates. Speech Communication, 39(3/4):311-352.
Kochanski, G., Shih, C., and Jing, H. (2003). Hierarchical structure and word strength prediction of Mandarin prosody. International Journal of Speech Technology, 6:33-43.
Kochanski, G.P. and Shih, C. (2000). Stem-ML: Language independent prosody description. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 3, pp. 239-242.
Kubrick, S. (1968). 2001: A Space Odyssey. Turner Entertainment Company. Based on the book of the same title by A.C. Clarke.
Liberman, M.Y. and Pierrehumbert, J.B. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff and R. Oehrle (Eds.), Language Sound Structure, Cambridge, MA: MIT Press, pp. 157-233.
Lindblom, B. (1963). Spectrographic study of vowel reduction. Journal of the Acoustical Society of America, 35:1773-1781.
Macon, M.W., Jensen-Link, L., Oliverio, J., Clements, M., and George, E.B. (1997).Asystem for singing voice synthesis based on sinusoidal modeling. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, vol. 1, pp. 435-438.
Mathews, M.V. (1963). Bicycle built for two. Music from Mathematics. Decca Records. DL 9103.
Möhler, G. and Mayer, J. (2001). A discourse model for pitchrange control. 4th ISCA Workshop on Speech Synthesis, Pitlochry, Scotland, pp. 11-15.
Monaghan, A.I.C. and Ladd, D.R. (1991). Manipulating synthetic intonation for speaker characterisation. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, pp. 453-456.
Murray, I.R. and Arnott, J.L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America 93:1097-1108.
NIST. (2000). DARPA communicator travel reservation corpus-June 2000 evaluation. Technical report, National Institute of Standards and Technology, Gaithersburg, MD. Speech Data published on CD-ROM.
Olive, J. (1998). The talking computer: Text to speech synthesis. In D.G. Stork (Ed.), HAL's Legacy: 2001's Computer as Dream and Reality, Cambridge, MA: MIT Press, Chap. 6, pp. 101-130.
Pierrehumbert, J. (1979). The perception of fundamental frequency declination. Journal of the Acoustical Society of America, 66(2):363-369.
Schroder, M. (2001). Emotional speech synthesis-a review. Proceedings of Eurospeech, Aalborg, Denmark, pp. 561-564.
Shih, C. and Kochanski, G.P. (2000). Chinese tone modeling with Stem-ML. Proceedings of the International Conference on Spoken Language Processing, Beijing, China, vol. 2, pp. 67-70.
Shih, C. and Kochanski, G.P. (2001). Synthesis of prosodic styles. 4th ISCA Workshop on Speech Synthesis, Perthshire, Scotland, pp. 229-234.
Shih, C., Kochanski, G.P., Fosler-Lussier, E., Chan, M., and Yuan, J.-H. (2001). Implications of prosody modeling for prosody recognition. In M. Bacchiani, J. Hirschberg, D. Litman, and M. Ostendorf (Eds.), Proceedings of the ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding. International Speech Communication Association. Red Bank, NJ, pp. 133-138.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, vol. 2, pp. 867-870.
Strik, H. and Boves, L. (1992). Control of fundamental frequency, intensity and voice quality in speech. Journal of Phonetics, 20(1):15-25.
Sundberg, J., Askenfelt, A., and Fryd´en, L. (1983). Musical performance: A synthesis-by-rule approach. Computer Music Journal, 7:37-43.
Taylor, P.A. (2000). Analysis and synthesis of intonation using the tilt model. Journal of the Acoustical Society of America, 107(3):1697-1714.
van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of f0 generation and alignment. In A. Botinis (Ed.), Intonation: Analysis, Modelling and Technology. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 269-288.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Shih, C., Kochanski, G. Modeling of Vocal Styles Using Portable Features and Placement Rules. International Journal of Speech Technology 6, 393–408 (2003). https://doi.org/10.1023/A:1025765101903
Issue Date:
DOI: https://doi.org/10.1023/A:1025765101903