Abstract
Tone study is very important for Mandarin speech recognition. In this paper, a Mixture Stochastic Polynomial Tone Model (MSPTM) is proposed for tone modeling in continuous Mandarin speech. In this model the pitch contour, main representative of tone pattern, is described as a mixed stochastic trajectory. The mean trajectory is represented by a polynomial function of normalized time while the variance is time varying. Effective training and tone recognition algorithms were developed. The experimental results based on the proposed MSPTM showed 40.7% tone recognition error rate reduction relative to the traditional Hidden Markov Model (HMM) tone model. We also present a decision tree based approach to learning the tone pattern variation in continuous speech. The phonetic and linguistic factors that may affect the tone patterns were taken into consideration while constructing the tree. After the tree was established, 28 different tone patterns were obtained. We found that in addition to the tone of the neighboring syllable, Consonant/Vowel type of the syllable and the position of the syllable in the utterance also made important contributions to tone pattern variations in continuous speech. Finally, a new approach of integrating tone information into the search process at word level is discussed. Experiments on continuous Mandarin speech recognition showed that the new tone model and tone information integration method were efficient, achieving a 16.2% relative character error rate reduction.
Similar content being viewed by others
References
Cao,Y., Huang, T.-Y., Xu, B., and Li, C.-R. (2000). Astochastic polynomial tone model for continuous Mandarin speech. ICSLP'2000 Proceedings.
Chang, P.-C, Sun, S.-W., and Chen, S.-H. (1972). Mandarin Tone recognition by multilayer perception. IEEE Trans. On Audio and Electroacoustic, 20:367–377.
Chen, C.J., Gopinath, R.A., and Monkowshi, M.D. (1997). New method in continuous Mandarin speech recognition. In ICASSP'97 Proceedings (CDROM).
Chen, S.-H., Hwang, S.-H., and Wang, Y.-R. (1998). An RNNbased prosodic information synthesizer for Mandarin text-tospeech. IEEE Trans. on Speech and Signal Processing, 6(3):226–239.
Chen, S.-H. and Wang, Y.-R. (1995). Tone recognition of continuous Mandarin speech based on neural networks. IEEE Trans. on Speech and Signal Processing, 3(2):146–150.
Dempster, A.P., Larid, N.M., and Rubin, D.B. (1977). Maximum-likelihood from Incomplete Data via the EM algorithm. Journal of Royal Statistical Society Series B, 39:13–18.
Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society (B), 58:155–176.
Huang, H. and Seide, F. (2000). Pitch tracking and tone features for Mandarin speech recognition. ICASSP'2000 Proceedings, pp. 1523-1526.
Jain, A.K. et al. (2000). Statistical pattern recognition: A review. IEEE Trans on Pattern Analysis and Machine Intelligence, 22(1):4–37.
Juang, B.H. and Katagiri, S. (1992). Discriminative learning for minimum error training. IEEE Trans. on Signal Processing, 40(12):3043–3051.
Juang, B.H., Chou, W., and Lee, C.-H. (1997). Minimum classification error rate methods for speech recognition. IEEE Trans. on Speech and Audio Processing, 5(3):257–265.
Lee, T., Carlson, R., and Granstorm, B. (1998). Context-dependent duration modeling for continuous speech recognition. ICSLP'98 Proceedings (CDROM).
Lin, M.-C. (1998). The Acoustic and Perceptual Characteristics of Chinese Mandarin Speech. Chinese Language (in Chinese), No. 2.
Ma, B. et al. (1996). Context-dependent acoustic models in Chinese speech recognition. In ICASSP'96 Proceedings (CDROM).
Russell, M. and Moore, R. (1985). Explicit modeling of state occupancy in Hidden Markov models for automatic speech recognition. ICASSP'1985, Proceedings, pp. 2376-2379.
Wang, C. and Seneff, S. (1998). A study of tone and tempo in continuous Mandarin digital strings and their application in telephone quality speech recognition. ICSLP'98 Proceedings, pp. 695-698.
Wang, H.-M. et al. (1997). Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary but limited training data. IEEE Trans. on Speech and Audio Processing, 5(2):196–201.
Wang, Y.R. et al. (1994). Tone recognition of continuous Chinese speech based on Hidden Markov model. Int. J. Pattern Recognition and Artificial Intelligence, 8(1):233–246.
Wong,Y.W. and Chang, E. (2001). The effect of pitch and lexical tone on different Mandarin speech recognition tasks. Eurospeech'2001 Proceedings (CDROM).
Zhao, L. et al. (1997). HMM based recognition of Chinese tones in continuous speech. The First China-Japan Workshop on Spoken Language Processing Proceedings (CDROM).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Cao, Y., Zhang, S., Huang, T. et al. Tone Modeling for Continuous Mandarin Speech Recognition. International Journal of Speech Technology 7, 115–128 (2004). https://doi.org/10.1023/B:IJST.0000017012.11970.6a
Issue Date:
DOI: https://doi.org/10.1023/B:IJST.0000017012.11970.6a