Skip to main content
Log in

An Innovative Prosody Modeling Method for Chinese Speech Recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper presents an innovative method for prosody modeling in Chinese speech recognition. Our method first evaluated the reliability of the prosodic information by which the recognition system dynamically tunes the balance between the spectral scores and prosodic scores. The basic idea of this method is to use prosodic knowledge based on its reliability. The higher the reliability, the more the prosodic information contributes to recognition. Thus, this method will not introduce extra errors but will incorporate more knowledge into the recognition system. Experimental results showed that this method reduced the relative word error rate by as much as 52.9% and 46.0% for Mandarin and Cantonese digit string recognition tasks, respectively. When incorporating tone information into Cantonese Large Vocabulary Continuous Speech Recognition (LVCSR) via the proposed method, a 20.16% relative character error rate reduction was obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Boersma, P. and Weenink, D. (2001). Praat: Doing phonetics by computer [Online]. Available: http://www.fon.hum. uva.nl/praat/

  • Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. Boston: Kluwer Academic Publishers, vol. 2, pp. 121–167.

    Google Scholar 

  • Burshtein, D. (1996). Robust parametric modeling of durations in Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 4(3):240–242.

    Google Scholar 

  • Ferguson, J.D. (1980). Variable duration models for speech. Proceedings of Symposia on the Application of Hidden Markov Models to Text and Speech. New-Jersey: Princeton, pp. 143–179.

    Google Scholar 

  • Gandour, J., Tumtavitikul, A., and Satthamnuwong, N. (1999). Effects of speaking rate on Thai tones. Phonetica, 56:123–134.

    Google Scholar 

  • Hess,W. (1983). Pitch Determination of Speech Signals: Algorithms and Devices. Berlin: Springer-Verlag.

    Google Scholar 

  • Huang, Hank C.-H. and Seide, F. (2000) Pitch tracking and tone features for Mandarin speech recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1523–1526.

    Google Scholar 

  • Kong, J.-P. (2001). Study on dynamic glottis through highspeed digital imaging. Ph.D. thesis, City University of Hong Kong.

  • Lau, W., Lee, T., Wong, Y.W., and Ching, P.C. (2000). Incorporating tone information into Cantonese large-vocabulary continuous speech recognition. Proceedings of the 2000 International Conference on Spoken Language Processing (ICSLP), vol. 2, pp. 883–886.

    Google Scholar 

  • Lee, K.-F., Hon, H.-W., and Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1):35–45.

    Google Scholar 

  • Lee, T., Lo, W.K., Ching, P.C., and Meng, Helen. (2002a). Spoken language resources for Cantonese speech processing. Speech Communication, 36:327–342.

    Google Scholar 

  • Lee, T., Lau, W., Wong, Y.W., and Ching, P.C. (2002b). Using tone information in Cantonese continuous speech recognition. ACM Transactions on Asia Language Information Processing, 1(1):83–102.

    Google Scholar 

  • Levinson, S.E. (1986). Continuously variable duration Hidden Markov Models for automatic speech recognition. Computer Speech and Language, 1:29–45.

    Google Scholar 

  • Lin, T. and Wang, L.J. (1992). Yu Yin Xue Jiao Cheng (in Pinyin). Beijing University Publishing.

  • Linguistic Society of Hong Kong (LSHK). (2002). Hong Kong Jyut Ping Character Table, 2nd ed. Linguistic Society of Hong Kong.

  • Peng, G. (2002). Reliability index guided prosody modeling in speech recognition. Ph.D. Dissertation, City University of Hong Kong.

  • Potisuk, S., Harper, M.P., and Gandour, J. (1999). Classification of Thai tone sequences in syllable-segmentated speech using the analysis-by-synthesis method. IEEE Transactions on Speech and Audio Processing, 7(1):95–102.

    Google Scholar 

  • Rabiner, L.R. (1984a). On the application of energy contours to the recognition of connected word sequence. AT&T Bell Laboratories Techinical Journal, 63(9):1981–1995.

    Google Scholar 

  • Rabiner, L.R. (1984b). On the performance of isolated word speech recognizers using vector quantization and temporal energy contours. AT&T Bell Laboratories Techinical Journal, 63(7):1245–1260.

    Google Scholar 

  • Rabiner, L.R. (1989). High performance connected digit recognition using Hidden Markov Models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(8):1214–1225.

    Google Scholar 

  • Ramesh, P. and Wilpon, J.G. (1992). Modeling state durations in Hidden Markov Models for automatic speech recognition. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 381–384.

    Google Scholar 

  • Russell, M.J. and Moore, R.K. (1985). Explicit modeling of state occupancy in Hidden Markov Models for automatic speech recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5-8.

  • Shen, X.-N. (1990). Tonal coarticulation in Mandarin. Journal of Phonetics, 18:281–295.

    Google Scholar 

  • Talkin, D. (1995). A robust algorithm for pitch tracking. In W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis. Amsterdam and New York: Elsevier, chapter 14, pp. 495–518.

    Google Scholar 

  • Wang,W.S.-Y. (1973). The Chineese language. Scientific American, 228:50–63.

    Google Scholar 

  • Wang, W.S.-Y. and Li, K.-P. (1967). Tone 3 in Pekinese. Journal of Speech and Hearing Research, 10(3):629–636.

    Google Scholar 

  • Wilpon, J.G., Lee, C.-H., and Rabiner, L.R. (1991). Improvements in connected digit recognition using higher order spectral and energy features. Proceedings of the International Conference on Acoustics, Speech, and Signal Procesing (ICASSP), vol. 1, pp. 349–352.

    Google Scholar 

  • WiseNews. (2001). [Online]. Available: http://libwisenews.wisers.net.

  • Wu, Z.-J. (1984). Tone sandhi of tri-syllabic words in Mandarin. Journal of Chinese Linguistics, 2:70–92.

    Google Scholar 

  • Xu, Y. (1994). Production and perception of coarticulated tones. Journal of the Acoustical Society of America (JASA), 95(4):2240–2253.

    Google Scholar 

  • Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25:61–83.

    Google Scholar 

  • Zhang, B., Liu, J., Peng, G., and Wang, W.S.-Y. (1999). A high performance Mandarin digit recognizer. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (ISSPA), vol. 2, pp. 629–632.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, G., Wang, W.SY. An Innovative Prosody Modeling Method for Chinese Speech Recognition. International Journal of Speech Technology 7, 129–140 (2004). https://doi.org/10.1023/B:IJST.0000017013.70486.51

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:IJST.0000017013.70486.51

Navigation