Advertisement

International Journal of Speech Technology

, Volume 10, Issue 1, pp 31–44 | Cite as

Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

  • Hwai-Tsu HuEmail author
  • Hsin-Min Wang
Article

Abstract

In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates.

In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.

Keywords

Text-to-speech Speech coding Linear prediction synthesizer 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bailly, G., Benoit, C., & Sawallis, T. (Eds.) (1992). Talking machines: theories, models and designs. Amsterdam: North Holland, Elsevier. Google Scholar
  2. Chen, S. H., & Wang, Y. R. (1990). Vector quantization of pitch information in Mandarin speech. IEEE Transactions on Communications, 38(9), 1317–1320. CrossRefGoogle Scholar
  3. Chen, S. H., Hwang, S. H., & Wang, Y. R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech and Audio Processing, 6(3), 226–239. CrossRefGoogle Scholar
  4. Chiang, C. Y., Chen, S. H., & Wang, Y. R. (2005). On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech. In Proceeding of interspeech (pp. 3269–3272). Google Scholar
  5. Childers, D. G., & Hu, H. T. (1994). Speech synthesis by glottal excited linear prediction. Journal of the Acoustical Society of America, 96(4), 2026–2036. CrossRefGoogle Scholar
  6. Choi, J., Hon, H. W., Lebrun, J. L., Lee, S. P., Loudon, G., Phan, V. H., & Yogananthan, S. (1994). Yanhui, a software based high performance Mandarin text-to-speech system. In Proc. ROCLING XII (pp. 35–50). Google Scholar
  7. Chou, F. C., Tseng, C. Y., & Lee, L. S. (2002). A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10(7), 481–494. CrossRefGoogle Scholar
  8. Chu, M., Tang, D., Si, H., Tian, X., & Lu, S. (1998). Research on perception of juncture between syllables in Chinese. Chinese Journal of Acoustics, 17(2), 143–152. Google Scholar
  9. Cohen, G., & Malah, D. (1995). Speech analysis and synthesis using a glottal excited AR model with DTW-based glottal determination. In 18th Convention of electrical and Electronics Engineers, 3.2.3 (pp. 1–5). Google Scholar
  10. Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4), 233–241. Google Scholar
  11. Hu, H. T., Kuo, F. J., & Wang, H. J. (2000). A pseudo glottal excitation model for the linear prediction vocoder with speech signals coded at 1.6 kbps. IEICE Transactions on Information and Systems, E83-D(8), 1654–1661. Google Scholar
  12. Hund, A. (1993). Software dreams and talking machines. Available at http://us.geocities.com/tim_hobbs.geo/sw2.htm.
  13. Hwang, S. H., & Chen, S. H. (1992). Neural network synthesizer of pause duration for Mandarin text-to-speech. Electronics Letters, 28(8), 720–721. CrossRefGoogle Scholar
  14. Hwang, S. H., Chen, S. H., & Wang, Y. R. (1996). A Mandarin text-to-speech system. In Proc. 4th int. conf. spoken language (Vol. 3, pp. 1421–1424). Google Scholar
  15. Klatt, D. H. (1982). The Klattalk text-to-speech system. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 7, pp. 1589–1592). Google Scholar
  16. Laroche, J., Stylianou, Y., & Moulines, E. (1993). HNS: Speech modification based on a harmonic + noise model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 550–553). Google Scholar
  17. Lee, L. S., Tseng, C. Y., & Ouh-Young, M. (1989). The synthesis rules in a Chinese text-to-speech system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(9), 1309–1320. CrossRefGoogle Scholar
  18. Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294. CrossRefGoogle Scholar
  19. Lin, Y. J., & Yu, M. S. (1998). An efficient Mandarin text-to-speech system on time domain. IEICE Transactions on Information and Systems, E81-D(6), 545–555. MathSciNetGoogle Scholar
  20. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-208, 84–95. CrossRefGoogle Scholar
  21. Liu, C. S., Ju, G. H., Wang, W. J., Wang, H. C., & Lai, W. H. (1991). A new speech synthesizer for text-to-speech system using multipulse excitation with pitch predictor. In Proc. IEEE int. conf. computer process. Chinese and oriental languages (pp. 205–209). Google Scholar
  22. McCree, A. V., & Barnwell III, T. P. (1995). A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Transactions on Speech and Audio Processing, 3(4), 242–250. CrossRefGoogle Scholar
  23. Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5/6), 453–467. CrossRefGoogle Scholar
  24. Moulines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205. CrossRefGoogle Scholar
  25. Paliwal, K. K., & Atal, B. S. (1993). Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), 3–14. CrossRefGoogle Scholar
  26. Silva, S. S., & Netto, S. L. (2004). Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki’s model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 1, pp. 621–624). Google Scholar
  27. Soong, F. K., & Juang, B. H. (1993). Optimal quantization of LSP parameters. IEEE Transactions on Speech and Audio Processing, 1(1), 15–24. CrossRefGoogle Scholar
  28. Supplee, L. M., Cohn, R. P., & Collura, J. S. (1997). MELP: the new federal standard at 2400 bps. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 1591–1594). Google Scholar
  29. Taylor, P., Black, A. W., & Caley, R. (1998). The architecture of the festival speech synthesis system. In Proceedings of the third ESCA workshop in speech synthesis (pp. 147–151). Available at http://www.cstr.ed.ac.uk/projects/festival/.
  30. Tseng, C. Y., Pin, S. H., Lee, Y., Wang, H. M., & Chen, Y. C. (2005). Fluent speech prosody: Framework and modeling. Speech Communications, 46, 284–309. CrossRefGoogle Scholar
  31. Varga, A., & Fallside, F. (1987). A technique for using multipulse linear predictive speech synthesis in text-to-speech type systems. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(4), 586–587. CrossRefGoogle Scholar
  32. Wu, C. H., Chen, C. H., & Juang, S. C. (1995). An CELP-based prosodic information modification and generation of Mandarin text-to-speech. In Proc. ROCLING XIII (pp. 233–251). Google Scholar
  33. Yu, C., & Hu, H. T. (2003). Design and implementation of an ASIC architecture for 1.6 kbps speech synthesis. IEEE Transactions on Consumer Electronics, 49(3), 731–736. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.National Ilan UniversityIlanTaiwan
  2. 2.Academia SinicaTaipeiTaiwan

Personalised recommendations