Skip to main content
Log in

Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates.

In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bailly, G., Benoit, C., & Sawallis, T. (Eds.) (1992). Talking machines: theories, models and designs. Amsterdam: North Holland, Elsevier.

    Google Scholar 

  • Chen, S. H., & Wang, Y. R. (1990). Vector quantization of pitch information in Mandarin speech. IEEE Transactions on Communications, 38(9), 1317–1320.

    Article  Google Scholar 

  • Chen, S. H., Hwang, S. H., & Wang, Y. R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech and Audio Processing, 6(3), 226–239.

    Article  Google Scholar 

  • Chiang, C. Y., Chen, S. H., & Wang, Y. R. (2005). On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech. In Proceeding of interspeech (pp. 3269–3272).

  • Childers, D. G., & Hu, H. T. (1994). Speech synthesis by glottal excited linear prediction. Journal of the Acoustical Society of America, 96(4), 2026–2036.

    Article  Google Scholar 

  • Choi, J., Hon, H. W., Lebrun, J. L., Lee, S. P., Loudon, G., Phan, V. H., & Yogananthan, S. (1994). Yanhui, a software based high performance Mandarin text-to-speech system. In Proc. ROCLING XII (pp. 35–50).

  • Chou, F. C., Tseng, C. Y., & Lee, L. S. (2002). A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10(7), 481–494.

    Article  Google Scholar 

  • Chu, M., Tang, D., Si, H., Tian, X., & Lu, S. (1998). Research on perception of juncture between syllables in Chinese. Chinese Journal of Acoustics, 17(2), 143–152.

    Google Scholar 

  • Cohen, G., & Malah, D. (1995). Speech analysis and synthesis using a glottal excited AR model with DTW-based glottal determination. In 18th Convention of electrical and Electronics Engineers, 3.2.3 (pp. 1–5).

  • Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4), 233–241.

    Google Scholar 

  • Hu, H. T., Kuo, F. J., & Wang, H. J. (2000). A pseudo glottal excitation model for the linear prediction vocoder with speech signals coded at 1.6 kbps. IEICE Transactions on Information and Systems, E83-D(8), 1654–1661.

    Google Scholar 

  • Hund, A. (1993). Software dreams and talking machines. Available at http://us.geocities.com/tim_hobbs.geo/sw2.htm.

  • Hwang, S. H., & Chen, S. H. (1992). Neural network synthesizer of pause duration for Mandarin text-to-speech. Electronics Letters, 28(8), 720–721.

    Article  Google Scholar 

  • Hwang, S. H., Chen, S. H., & Wang, Y. R. (1996). A Mandarin text-to-speech system. In Proc. 4th int. conf. spoken language (Vol. 3, pp. 1421–1424).

  • Klatt, D. H. (1982). The Klattalk text-to-speech system. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 7, pp. 1589–1592).

  • Laroche, J., Stylianou, Y., & Moulines, E. (1993). HNS: Speech modification based on a harmonic + noise model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 550–553).

  • Lee, L. S., Tseng, C. Y., & Ouh-Young, M. (1989). The synthesis rules in a Chinese text-to-speech system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(9), 1309–1320.

    Article  Google Scholar 

  • Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294.

    Article  Google Scholar 

  • Lin, Y. J., & Yu, M. S. (1998). An efficient Mandarin text-to-speech system on time domain. IEICE Transactions on Information and Systems, E81-D(6), 545–555.

    MathSciNet  Google Scholar 

  • Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-208, 84–95.

    Article  Google Scholar 

  • Liu, C. S., Ju, G. H., Wang, W. J., Wang, H. C., & Lai, W. H. (1991). A new speech synthesizer for text-to-speech system using multipulse excitation with pitch predictor. In Proc. IEEE int. conf. computer process. Chinese and oriental languages (pp. 205–209).

  • McCree, A. V., & Barnwell III, T. P. (1995). A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Transactions on Speech and Audio Processing, 3(4), 242–250.

    Article  Google Scholar 

  • Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5/6), 453–467.

    Article  Google Scholar 

  • Moulines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205.

    Article  Google Scholar 

  • Paliwal, K. K., & Atal, B. S. (1993). Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), 3–14.

    Article  Google Scholar 

  • Silva, S. S., & Netto, S. L. (2004). Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki’s model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 1, pp. 621–624).

  • Soong, F. K., & Juang, B. H. (1993). Optimal quantization of LSP parameters. IEEE Transactions on Speech and Audio Processing, 1(1), 15–24.

    Article  Google Scholar 

  • Supplee, L. M., Cohn, R. P., & Collura, J. S. (1997). MELP: the new federal standard at 2400 bps. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 1591–1594).

  • Taylor, P., Black, A. W., & Caley, R. (1998). The architecture of the festival speech synthesis system. In Proceedings of the third ESCA workshop in speech synthesis (pp. 147–151). Available at http://www.cstr.ed.ac.uk/projects/festival/.

  • Tseng, C. Y., Pin, S. H., Lee, Y., Wang, H. M., & Chen, Y. C. (2005). Fluent speech prosody: Framework and modeling. Speech Communications, 46, 284–309.

    Article  Google Scholar 

  • Varga, A., & Fallside, F. (1987). A technique for using multipulse linear predictive speech synthesis in text-to-speech type systems. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(4), 586–587.

    Article  Google Scholar 

  • Wu, C. H., Chen, C. H., & Juang, S. C. (1995). An CELP-based prosodic information modification and generation of Mandarin text-to-speech. In Proc. ROCLING XIII (pp. 233–251).

  • Yu, C., & Hu, H. T. (2003). Design and implementation of an ASIC architecture for 1.6 kbps speech synthesis. IEEE Transactions on Consumer Electronics, 49(3), 731–736.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hwai-Tsu Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, HT., Wang, HM. Integrating coding techniques into LP-based Mandarin text-to-speech synthesis. Int J Speech Technol 10, 31–44 (2007). https://doi.org/10.1007/s10772-008-9015-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-008-9015-3

Keywords

Navigation