Advertisement

Journal of Computer Science and Technology

, Volume 17, Issue 3, pp 249–263 | Cite as

Mandarin pronunciation modeling based on CASS corpus

  • Zheng Fang Email author
  • Song Zhanjiang 
  • Fung Pascale 
  • William Byrne 
Regular Papers

Abstract

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the contextdependent pronunciation weighting, are proposed based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) is reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription are used to refine the acoustic model using the proposed iterative forced-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.

Key words

pronunciation modeling generalized initial and final generalized syllable refined acoustic modeling context-dependent weighting iterative forced-alignment based transcribing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Fosler-Lussier E, Morgan N. Effect of speaking rate and word frequency on pronunciations in conversational speech.Speech Communication, 1999, 29: 137–158.CrossRefGoogle Scholar
  2. [2]
    Decker A M, Lamel L. Pronunciation variants across system configuration, language and speaking style.Speech Communication, 1999, 29: 83–98.CrossRefGoogle Scholar
  3. [3]
    Greenberg S. Speaking in shorthand — A syllable-centric perspective for understanding pronunciation variation.Speech Communication, 1999, 29: 159–176.CrossRefGoogle Scholar
  4. [4]
    Zheng F. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, March, 1999, II: 601–604.Google Scholar
  5. [5]
    Finke M, Waibel A. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. InEuropean Conference on Speech Communication and Technology (Euro Speech'97), 1997, 5: 2379–2382.Google Scholar
  6. [6]
    Byrne W, Venkataramani V, Kamm Tet al. Automatic generation of pronunciation lexicons for Mandarin spontaneous s speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, Vol. I, May, 2001, Salt Lake City.Google Scholar
  7. [7]
    Liu M K, Xu B, Huang T Yet al. Mandarin accent adaptation based on context-independent/context-dependent pronunciation modeling. InInternational Conference on Acoustics, Speech and Signal Processing (ICASSP'2000), Istanbul, June, 2000, 4: 1025–1028.Google Scholar
  8. [8]
    Cremelie N, Martens J P. Automatic rule-based generation of word pronunciation networks. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2459–2462.Google Scholar
  9. [9]
    Cremelie N, Martens J P. In search of better pronunciation models for speech recognition.Speech Co Communication, 1999, 29: 115–136.CrossRefGoogle Scholar
  10. [10]
    Liu Y, Fung P. Rule-based word pronunciation networks generation for Mandarin speech recognition,International Symposium of Chinese Spoken Language Processing, Beijing, Oct., 2000,pp.35–38.Google Scholar
  11. [11]
    Fukada T, Sagisaka Y. Automatic generation of a pronunciation dictionary based on a pronunciation network. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2471–2474.Google Scholar
  12. [12]
    Byrne W, Finke M, Khudanpur Set al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, May, 1998, Seattle, pp.313–316.Google Scholar
  13. [13]
    Riley M, Byrne W, Finke Met al. Stochastic pronunciation modelling from hand-labelled phonetic corpora.Speech Communication, 1999, 29: 209–224.CrossRefGoogle Scholar
  14. [14]
    Ma K, Zavaliagkos G, Iyer R. Pronunciation modeling for large vocabulary conversational speech recognition. InInternational Conference on Spoken Language Processing, Sydney, Nov., 1998, 6: 2455–2458.Google Scholar
  15. [15]
    Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 515–518.Google Scholar
  16. [16]
    Holter T, Svendsen T. Maximum likelihood modelling of pronunciation variation.Speech Communication, 1999, 29: 177–191.CrossRefGoogle Scholar
  17. [17]
    Finke M, Fritsch J, Koll Det al. Modeling and efficient decoding of large vocabulary conversational speech. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 467–470.Google Scholar
  18. [18]
    Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature.Speech Communication, 1999, 29: 225–246.CrossRefGoogle Scholar
  19. [19]
    Li A J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 1: 485–488.Google Scholar
  20. [20]
    Chen X X, Li A Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing, Beijing, Oct., 2000, 4: 652–655.Google Scholar
  21. [21]
    Li A J, Chen X X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 4: 724–727.Google Scholar
  22. [22]
    Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models.Computer Speech and Language, 2000, 14: 137–160.CrossRefGoogle Scholar
  23. [23]
    Young S, Kershaw D, Odell Jet al. The HTK Book. Version 2.2, Entropic Ltd., 1999.Google Scholar
  24. [24]
    Song Z J. Research on pronunciation modeling for spontaneous Chinese speech recognition [Dissertation]. Tsinghua University, Beijing, Apr., 2001.Google Scholar
  25. [25]
    Huang X D, Hwang M Y, Jiang Let al. Deleted interpolation and density sharing for continuous hidden Markov models. InIEEE Int. Con. Acoustics, Speech, and Signal Processing, Atlanta, GA, 1996, pp.885–888.Google Scholar
  26. [26]
    Jelinek F. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, MA, 1998.Google Scholar
  27. [27]
    Kim N S, Un C K. Statistically reliable deleted interpolation.IEEE Trans. SAP, 1997, 5: 292–295.Google Scholar

Copyright information

© Science Press, Beijing China and Allerton Press Inc. 2002

Authors and Affiliations

  • Zheng Fang 
    • 1
    Email author
  • Song Zhanjiang 
    • 1
  • Fung Pascale 
    • 2
  • William Byrne 
    • 3
  1. 1.Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and TechnologyTsinghua UniversityBeijingP.R. China
  2. 2.Department of Electrical and Electronic EngineeringHong Kong University of Science and TechnologyHong Kong, P.R. China
  3. 3.Center for Language and Speech ProcessingThe Johns Hopkins UniversityUSA

Personalised recommendations