International Journal of Speech Technology

, Volume 10, Issue 1, pp 45–55 | Cite as

Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system

  • Yanqiu ShaoEmail author
  • Jiqing Han
  • Ting Liu
  • Yongzhen Zhao


In real speech, not like lexical words (LWs), prosodic words (PWs) are basic rhythmic units. The naturalness of a Text-to-Speech (TTS) system is directly influenced by the segmentation of the PWs. Most of the PWs are the combination of several LWs. In this paper, three Lexical Combination Models are proposed to combine LWs into PWs, including a Directed Acyclic Graph Model, a Segmentation Model and a Markov Model (MM). To cope with the situation where some long LWs should be segmented into two or more PWs, a Lexical Split Model (LSM) is applied to the long LWs. Experimental results prove that relatively constant results with various training data can be obtained from a MM. The Transformation-Based Error Driven Learning (TBED) algorithm, for its high performance of individual property, is applied in combination with the MM to improve the precision of PW segmentation. Experiments show that among the three proposed models, the MM combined with TBED and LSM, leads to the best performance, in which a precision of 93.00% and a recall of 93.23% are achieved. The perception test indicates that by using PWs as the lowest prosodic units a speech sounds more natural and acceptable than by using LWs.


Text-to-speech Prosodic word Lexical word Prosodic structure 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Brill, E., & Resnik, P. (1994). A rule-based approach to prepositional phrase attachment disambiguation. In Proceedings of the 15th international conference on computational linguistics (COLING-94) (pp. 1198–1204). Kyoto, Japan. Google Scholar
  2. Chen, S. H., Hwang, S. H., & Wang, Y. R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech Audio Processing, 6, 226–239. CrossRefGoogle Scholar
  3. Chen, K., Johnson, M. H., & Cohen, A. (2004). An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model. In Proceedings of IEEE international conference on acoustics, speech, and signal processing 2004 (ICASSP 2004) (pp. 509–512). Montreal, Canada. Google Scholar
  4. Chou, F. C., Tseng, C. Y., & Lee, L. S. (1998). Automatic segmental and prosodic labeling of Mandarin speech database. In Proceeding of the fifth international conference on spoken language processing (pp. 1263–1266). Sydney, Australia. Google Scholar
  5. Chu, M., & Qian, Y. (2001). Locating boundaries for prosodic constituents in unrestricted Mandarin texts. Journal of Computational Linguistics and Chinese Language Processing, 6(1), 61–82. Google Scholar
  6. Hirschberg, J., & Prieto, P. (1996). Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Communication, 18, 281–290. Google Scholar
  7. Bachenko, J., & Fitzpatrick, E. (1990). A computational grammar of discourse neutral prosodic phrasing in English. Computational Linguistics, 16(3), 155–170. Google Scholar
  8. Nespor, M., & Vogel, I. (1986). Prosodic phonology. Dordrecht: Foris. Google Scholar
  9. Ostendorf, M., & Veilleux, N. (1994). A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics, 20(1), 27–54. Google Scholar
  10. Qian, Y., & Chu, M. (2001). Segmenting unrestricted Chinese text into prosodic words instead of lexical words. In Proceeding of international conference on acoustics, speech and signal processing 2001 (ICASSP2001) (pp. 825–828). Salt Lake City, USA. Google Scholar
  11. Tseng, C. Y., & Chou, F. C. (1999). A prosodic labeling system for mandarin speech database. In XIVth international congress of phonetic sciences (pp. 2379–2382). San Francisco, USA. Google Scholar
  12. Selkirk, E. (1984). Phonology and syntax: the relation between sound and structure. Cambridge: MIT Press. Google Scholar
  13. Selkirk, E. (1986). On derived domains in sentence phonology. Phonology Yearbook, 3, 371–405. Google Scholar
  14. Wang, M. Q., & Hirschberg, J. (1992). Automatic classification of intonational phrase boundaries. Computer Speech and Language, 6, 175–196. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Yanqiu Shao
    • 1
    Email author
  • Jiqing Han
    • 2
  • Ting Liu
    • 2
  • Yongzhen Zhao
    • 2
  1. 1.Institute of Computational Linguistics, School of Electronics Engineering and Computer SciencePeking UniversityBeijingChina
  2. 2.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations