Abstract
We use Stem-ML to build an automatic learning system for Mandarin prosody that allows us to make quantitative measurements of prosodic strengths. Stem-ML is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds. Because Stem-ML describes the interactions between nearby tones or accents, we were able to use a highly constrained model with only one accent template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87% of the variance of the speech's fundamental frequency, f 0. The result reveals strong alternating metrical patterns in words, and suggests that the speaker uses word strength to mark a hierarchy of sentence, clause, phrase, and word boundaries.
References
Bellegarda, J., Silverman, K., Lenzo, K., and Anderson, V. (2001). Statistical prosodic modeling: From corpus design to parameter estimation. IEEE Transactions on Speech and Audio Processing, 9(1):52–66.
Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In P.F. MacNeilage (Ed.), The Production of Speech. Berlin: Springer-Verlag, pp. 39–55.
Hirschberg, J. and Pierrehumbert, J. (1986). The intonational structuring of discourse. In Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics. vol. 24, pp. 136–144.
Hollien, H. (1981). In search of vocal frequency control mechanisms. In D.M. Bless and J.H. Abbs (Eds.), Vocal Fold Physiology: Contemporary Research and Clinical Issues. San Diego, CA: College-Hill Press, pp. 361–367.
Kochanski, G. and Shih, C. (2001). Automated modelling of Chinese intonation in continuous speech. In Proceedings of Eurospeech 2001, Aalborg, Denmark. International Speech Communication Association.
Kochanski, G.P. and Shih, C. (2000). Language independent prosody description. In Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China.
Kochanski, G.P. and Shih, C. (2002). Soft templates for prosody mark-up. Accepted by Speech Communication.
Lea, W. (1973). Segmental and suprasegmental influences on fundamental frequency contours. In L. Hyman (Ed.), Consonant Types and Tones, University of Southern California, Los Angeles, pp. 15–70.
Levenberg, K. (1944). A method for the solution of certain problems in least squares. Quart. Applied Math., 2:164–168.
Liberman, M.Y. and Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8:249–336.
Lin, M.-C. and Yan, J. (1983). The stress pattern and its acoustic correlates in Beijing Mandarin. In Proceedings of the 10th International Congress of Phonetic Sciences. pp. 504–514.
Marquardt, D. (1963). An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Applied Math, 11:431–441.
MathSoft, Inc. (1995). Splus Online Documentation, 3.3 edition. Subroutine ltsreg(), set to exclude the 5 most extreme data points from the objective function.
Ohman, S. (1967). Word and sentence intonation, a quantitative model (Technical report). Department of Speech Communication, Royal Institute of Technology (KTH).
Shih, C. (1986). The Prosodic Domain of Tone Sandhi in Chinese. Ph.D. Thesis, University of California, San Diego.
Shih, C., Kochanski, G., Fosler-Lussier, E., Chan, M., and Yuan, J.-H. (2001). Implications of prosody modeling for prosody recognition. In M. Bacchiani, J. Hirschberg, D. Litman, and M. Ostendorf (Eds). Proceedings of the ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding. Red Bank, NJ: International Speech Communication Association, pp. 133–138.
Shih, C. and Kochanski, G.P. (2000). Chinese tone modeling with Stem-ML. In Proceedings of the Sixth International Conference on Speech and Language Processing. Beijing, China.
Silverman, K.E. (1987). The Structure and Processing of Fundamental Frequency Contours. Ph.D. Thesis, University of Cambridge, UK.
Stevens, K.N. (1998). Acoustic Phonetics. Cambridge,MA: The MIT Press.
Wilder, C.N. (1981). Chest wall preparation for phonation in female speakers. In D.M. Bless and J.H. Abbs (Eds.), Vocal Fold Physiology: Comtemporary Research and Clinical Issues. San Diego, CA: College-Hill Press, pp. 109–123.
Winkworth, A.L., Davis, P.J., Adams, R.D., and Ellis, E. (1995). Breathing patterns during spontaneous speech. Journal of Speech and Hearing Research, 38(1):124–144.
Xu, Y. (2001). Pitch targets and their realization: Evidence from Mandarin Chinese. Speech Communication, 33:319–337.
Xu, Y. and Sun, X.J. (2000). How fast can we really change pitch? Maximum speed of pitch change revisited. In Proceedings of the Sixth International Conference on Spoken Language Processing (ICSLP). Beijing, China.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kochanski, G., Shih, C. & Jing, H. Hierarchical Structure and Word Strength Prediction of Mandarin Prosody. International Journal of Speech Technology 6, 33–43 (2003). https://doi.org/10.1023/A:1021095805490
Issue Date:
DOI: https://doi.org/10.1023/A:1021095805490