Abstract
This paper presents an innovative method for prosody modeling in Chinese speech recognition. Our method first evaluated the reliability of the prosodic information by which the recognition system dynamically tunes the balance between the spectral scores and prosodic scores. The basic idea of this method is to use prosodic knowledge based on its reliability. The higher the reliability, the more the prosodic information contributes to recognition. Thus, this method will not introduce extra errors but will incorporate more knowledge into the recognition system. Experimental results showed that this method reduced the relative word error rate by as much as 52.9% and 46.0% for Mandarin and Cantonese digit string recognition tasks, respectively. When incorporating tone information into Cantonese Large Vocabulary Continuous Speech Recognition (LVCSR) via the proposed method, a 20.16% relative character error rate reduction was obtained.
Similar content being viewed by others
References
Boersma, P. and Weenink, D. (2001). Praat: Doing phonetics by computer [Online]. Available: http://www.fon.hum. uva.nl/praat/
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. Boston: Kluwer Academic Publishers, vol. 2, pp. 121–167.
Burshtein, D. (1996). Robust parametric modeling of durations in Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 4(3):240–242.
Ferguson, J.D. (1980). Variable duration models for speech. Proceedings of Symposia on the Application of Hidden Markov Models to Text and Speech. New-Jersey: Princeton, pp. 143–179.
Gandour, J., Tumtavitikul, A., and Satthamnuwong, N. (1999). Effects of speaking rate on Thai tones. Phonetica, 56:123–134.
Hess,W. (1983). Pitch Determination of Speech Signals: Algorithms and Devices. Berlin: Springer-Verlag.
Huang, Hank C.-H. and Seide, F. (2000) Pitch tracking and tone features for Mandarin speech recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1523–1526.
Kong, J.-P. (2001). Study on dynamic glottis through highspeed digital imaging. Ph.D. thesis, City University of Hong Kong.
Lau, W., Lee, T., Wong, Y.W., and Ching, P.C. (2000). Incorporating tone information into Cantonese large-vocabulary continuous speech recognition. Proceedings of the 2000 International Conference on Spoken Language Processing (ICSLP), vol. 2, pp. 883–886.
Lee, K.-F., Hon, H.-W., and Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1):35–45.
Lee, T., Lo, W.K., Ching, P.C., and Meng, Helen. (2002a). Spoken language resources for Cantonese speech processing. Speech Communication, 36:327–342.
Lee, T., Lau, W., Wong, Y.W., and Ching, P.C. (2002b). Using tone information in Cantonese continuous speech recognition. ACM Transactions on Asia Language Information Processing, 1(1):83–102.
Levinson, S.E. (1986). Continuously variable duration Hidden Markov Models for automatic speech recognition. Computer Speech and Language, 1:29–45.
Lin, T. and Wang, L.J. (1992). Yu Yin Xue Jiao Cheng (in Pinyin). Beijing University Publishing.
Linguistic Society of Hong Kong (LSHK). (2002). Hong Kong Jyut Ping Character Table, 2nd ed. Linguistic Society of Hong Kong.
Peng, G. (2002). Reliability index guided prosody modeling in speech recognition. Ph.D. Dissertation, City University of Hong Kong.
Potisuk, S., Harper, M.P., and Gandour, J. (1999). Classification of Thai tone sequences in syllable-segmentated speech using the analysis-by-synthesis method. IEEE Transactions on Speech and Audio Processing, 7(1):95–102.
Rabiner, L.R. (1984a). On the application of energy contours to the recognition of connected word sequence. AT&T Bell Laboratories Techinical Journal, 63(9):1981–1995.
Rabiner, L.R. (1984b). On the performance of isolated word speech recognizers using vector quantization and temporal energy contours. AT&T Bell Laboratories Techinical Journal, 63(7):1245–1260.
Rabiner, L.R. (1989). High performance connected digit recognition using Hidden Markov Models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(8):1214–1225.
Ramesh, P. and Wilpon, J.G. (1992). Modeling state durations in Hidden Markov Models for automatic speech recognition. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 381–384.
Russell, M.J. and Moore, R.K. (1985). Explicit modeling of state occupancy in Hidden Markov Models for automatic speech recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5-8.
Shen, X.-N. (1990). Tonal coarticulation in Mandarin. Journal of Phonetics, 18:281–295.
Talkin, D. (1995). A robust algorithm for pitch tracking. In W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis. Amsterdam and New York: Elsevier, chapter 14, pp. 495–518.
Wang,W.S.-Y. (1973). The Chineese language. Scientific American, 228:50–63.
Wang, W.S.-Y. and Li, K.-P. (1967). Tone 3 in Pekinese. Journal of Speech and Hearing Research, 10(3):629–636.
Wilpon, J.G., Lee, C.-H., and Rabiner, L.R. (1991). Improvements in connected digit recognition using higher order spectral and energy features. Proceedings of the International Conference on Acoustics, Speech, and Signal Procesing (ICASSP), vol. 1, pp. 349–352.
WiseNews. (2001). [Online]. Available: http://libwisenews.wisers.net.
Wu, Z.-J. (1984). Tone sandhi of tri-syllabic words in Mandarin. Journal of Chinese Linguistics, 2:70–92.
Xu, Y. (1994). Production and perception of coarticulated tones. Journal of the Acoustical Society of America (JASA), 95(4):2240–2253.
Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25:61–83.
Zhang, B., Liu, J., Peng, G., and Wang, W.S.-Y. (1999). A high performance Mandarin digit recognizer. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (ISSPA), vol. 2, pp. 629–632.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Peng, G., Wang, W.SY. An Innovative Prosody Modeling Method for Chinese Speech Recognition. International Journal of Speech Technology 7, 129–140 (2004). https://doi.org/10.1023/B:IJST.0000017013.70486.51
Issue Date:
DOI: https://doi.org/10.1023/B:IJST.0000017013.70486.51