Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

Hu, Hwai-Tsu; Wang, Hsin-Min

doi:10.1007/s10772-008-9015-3

Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

Published: 06 January 2009

Volume 10, pages 31–44, (2007)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Hwai-Tsu Hu¹ &
Hsin-Min Wang²

101 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates.

In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bailly, G., Benoit, C., & Sawallis, T. (Eds.) (1992). Talking machines: theories, models and designs. Amsterdam: North Holland, Elsevier.
Google Scholar
Chen, S. H., & Wang, Y. R. (1990). Vector quantization of pitch information in Mandarin speech. IEEE Transactions on Communications, 38(9), 1317–1320.
Article Google Scholar
Chen, S. H., Hwang, S. H., & Wang, Y. R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech and Audio Processing, 6(3), 226–239.
Article Google Scholar
Chiang, C. Y., Chen, S. H., & Wang, Y. R. (2005). On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech. In Proceeding of interspeech (pp. 3269–3272).
Childers, D. G., & Hu, H. T. (1994). Speech synthesis by glottal excited linear prediction. Journal of the Acoustical Society of America, 96(4), 2026–2036.
Article Google Scholar
Choi, J., Hon, H. W., Lebrun, J. L., Lee, S. P., Loudon, G., Phan, V. H., & Yogananthan, S. (1994). Yanhui, a software based high performance Mandarin text-to-speech system. In Proc. ROCLING XII (pp. 35–50).
Chou, F. C., Tseng, C. Y., & Lee, L. S. (2002). A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10(7), 481–494.
Article Google Scholar
Chu, M., Tang, D., Si, H., Tian, X., & Lu, S. (1998). Research on perception of juncture between syllables in Chinese. Chinese Journal of Acoustics, 17(2), 143–152.
Google Scholar
Cohen, G., & Malah, D. (1995). Speech analysis and synthesis using a glottal excited AR model with DTW-based glottal determination. In 18th Convention of electrical and Electronics Engineers, 3.2.3 (pp. 1–5).
Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4), 233–241.
Google Scholar
Hu, H. T., Kuo, F. J., & Wang, H. J. (2000). A pseudo glottal excitation model for the linear prediction vocoder with speech signals coded at 1.6 kbps. IEICE Transactions on Information and Systems, E83-D(8), 1654–1661.
Google Scholar
Hund, A. (1993). Software dreams and talking machines. Available at http://us.geocities.com/tim_hobbs.geo/sw2.htm.
Hwang, S. H., & Chen, S. H. (1992). Neural network synthesizer of pause duration for Mandarin text-to-speech. Electronics Letters, 28(8), 720–721.
Article Google Scholar
Hwang, S. H., Chen, S. H., & Wang, Y. R. (1996). A Mandarin text-to-speech system. In Proc. 4th int. conf. spoken language (Vol. 3, pp. 1421–1424).
Klatt, D. H. (1982). The Klattalk text-to-speech system. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 7, pp. 1589–1592).
Laroche, J., Stylianou, Y., & Moulines, E. (1993). HNS: Speech modification based on a harmonic + noise model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 550–553).
Lee, L. S., Tseng, C. Y., & Ouh-Young, M. (1989). The synthesis rules in a Chinese text-to-speech system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(9), 1309–1320.
Article Google Scholar
Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294.
Article Google Scholar
Lin, Y. J., & Yu, M. S. (1998). An efficient Mandarin text-to-speech system on time domain. IEICE Transactions on Information and Systems, E81-D(6), 545–555.
MathSciNet Google Scholar
Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-208, 84–95.
Article Google Scholar
Liu, C. S., Ju, G. H., Wang, W. J., Wang, H. C., & Lai, W. H. (1991). A new speech synthesizer for text-to-speech system using multipulse excitation with pitch predictor. In Proc. IEEE int. conf. computer process. Chinese and oriental languages (pp. 205–209).
McCree, A. V., & Barnwell III, T. P. (1995). A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Transactions on Speech and Audio Processing, 3(4), 242–250.
Article Google Scholar
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5/6), 453–467.
Article Google Scholar
Moulines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205.
Article Google Scholar
Paliwal, K. K., & Atal, B. S. (1993). Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), 3–14.
Article Google Scholar
Silva, S. S., & Netto, S. L. (2004). Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki’s model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 1, pp. 621–624).
Soong, F. K., & Juang, B. H. (1993). Optimal quantization of LSP parameters. IEEE Transactions on Speech and Audio Processing, 1(1), 15–24.
Article Google Scholar
Supplee, L. M., Cohn, R. P., & Collura, J. S. (1997). MELP: the new federal standard at 2400 bps. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 1591–1594).
Taylor, P., Black, A. W., & Caley, R. (1998). The architecture of the festival speech synthesis system. In Proceedings of the third ESCA workshop in speech synthesis (pp. 147–151). Available at http://www.cstr.ed.ac.uk/projects/festival/.
Tseng, C. Y., Pin, S. H., Lee, Y., Wang, H. M., & Chen, Y. C. (2005). Fluent speech prosody: Framework and modeling. Speech Communications, 46, 284–309.
Article Google Scholar
Varga, A., & Fallside, F. (1987). A technique for using multipulse linear predictive speech synthesis in text-to-speech type systems. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(4), 586–587.
Article Google Scholar
Wu, C. H., Chen, C. H., & Juang, S. C. (1995). An CELP-based prosodic information modification and generation of Mandarin text-to-speech. In Proc. ROCLING XIII (pp. 233–251).
Yu, C., & Hu, H. T. (2003). Design and implementation of an ASIC architecture for 1.6 kbps speech synthesis. IEEE Transactions on Consumer Electronics, 49(3), 731–736.
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Ilan University, Ilan, 260, Taiwan
Hwai-Tsu Hu
Academia Sinica, Taipei, 115, Taiwan
Hsin-Min Wang

Authors

Hwai-Tsu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Min Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hwai-Tsu Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, HT., Wang, HM. Integrating coding techniques into LP-based Mandarin text-to-speech synthesis. Int J Speech Technol 10, 31–44 (2007). https://doi.org/10.1007/s10772-008-9015-3

Download citation

Received: 02 August 2006
Accepted: 26 November 2008
Published: 06 January 2009
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10772-008-9015-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

Abstract

Access this article

Similar content being viewed by others

Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning?

Automatic speech recognition: a survey

A Survey on Application Specific Processor Architectures for Digital Hearing Aids

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating coding techniques into LP-based Mandarin text-to-speech synthesis

Abstract

Access this article

Similar content being viewed by others

Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning?

Automatic speech recognition: a survey

A Survey on Application Specific Processor Architectures for Digital Hearing Aids

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation