Advertisement

Approaches to Improve Automatic Speech Synthesis

  • Douglas O’Shaughnessy
Part of the The Springer International Series in Engineering and Computer Science book series (SECS, volume 327)

Abstract

For several years now, there have been automatic text-to-speech systems fof several languages which yield intelligible but unnatural synthetic speech. Quality inferior to that of human speech is usually due to inadequate modeling of human speech production in coarticulation, intonation, and vocal-tract excitation. We will examine the current approaches in these areas, discuss the compromises that are often made, and suggest ways for improvement.

Keywords

Vocal Tract Speech Synthesis Speech Quality Natural Speech Synthetic Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    J. Allen (1992) “Overview of text-to-speech systems,” in Advances in speech signal processing (S. Furui & M. Sondhi, eds. — Marcel Dekker: New York), 741–790.Google Scholar
  2. [2]
    D. Bigorgne et al, “Multilingual PSOLA Text-to-Speech System,” (1993)in Proceedings of the IEEE Conf. on Acoustics Speech and Signal ProcessingII-187–190.Google Scholar
  3. [3]
    B. Caspers & B. Atal (1983) “Changing pitch and duration in LPC synthesized speech using multi-pulse excitation,” J. Acoust. Soc. Am. 73, Sl.CrossRefGoogle Scholar
  4. [4]
    W. Cooper & J. Sorenson (1981)Fundamental frequency in sentence production(Springer-Verlag: New York).CrossRefGoogle Scholar
  5. [5]
    T. Dutoit (1994) “High quality text-to-speech synthesis: A comparison of four candidate algorithms,” Proc. IEEE Int. Conf. ASSP, vol. I, 565–568.Google Scholar
  6. [6]
    G. Fries (1994) “Hybrid time-and frequency-domain speech synthesis with extended glottal source generation,” Proc. IEEE Int. Conf. ASSP, vol. I, 581–584.Google Scholar
  7. [7]
    S. Grau, C. d’Alessandro, & G. Richard (1993) “A speech formant synthesizer based on harmonic + random formant-waveforms representations,” Proc. Eurospeech-93, 1697–1700.Google Scholar
  8. [8]
    S. Gupta&J. Schroeter (1993) “Pitch-synchronous frame-by-frame and segment-based articulatory analysis by synthesis,” J. Acoust. Soc. Am.942517–2530.CrossRefGoogle Scholar
  9. [9]
    J. ‘t Hart&A. Cohen (1973) “Intonation by rule: a perceptual quest,” J. Phonetics 1, 309–327.Google Scholar
  10. [10]
    A. Hauptmann (1993) “SpeakEZ: A first experiment in concatenation synthesis from a large corpus,” Proc. Eurospeech-93, 1701–1704.Google Scholar
  11. [11]
    S. Hertz (1991) “Streams, phones, and transitions: Toward a phonological and phonetic model of formant timing,” J. Phonetics 19, 91–109.Google Scholar
  12. [12]
    J. Holmes (1983) “Formant synthesizers — cascade or parallel?” Speech Comm. 2, 251–273.CrossRefGoogle Scholar
  13. [13]
    I. Karlsson (1992) “Consonants for female speech synthesis,” Proc. Intern. Conf. on Spoken Language Processing, 491–494.Google Scholar
  14. [14]
    I. Karlsson&L. Neovius (1993) “Speech synthesis experiments with the GLOVE synthesizer,” Proc. Eurospeech-93, 925–928.Google Scholar
  15. [15]
    J. Kerkhoff & L. Boues (1993) “Designing control rules for a serial pole-zero vocal tract model,” Proc. Eurospeech-93, 893–896.Google Scholar
  16. [16]
    D. Klatt (1976) “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” J. Acoust. Soc. Am.591208–1221.CrossRefGoogle Scholar
  17. [17]
    D. Klatt (1980) “Software for a cascade/parallel formant synthesizer,” J. Acoust. Soc. Am.87971–995.CrossRefGoogle Scholar
  18. [18]
    D. Klatt (1987) “Review of text-to-speech conversion for English,” J. Acoust. Soc. Am.82737–793.CrossRefGoogle Scholar
  19. [19]
    D. Klatt & L. Klatt (1990) “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am. 87, 820–857.CrossRefGoogle Scholar
  20. [20]
    J. Koreman, L. Boues, and B. Cranen (1992) “The influence of linguistic variations on the voice source characteristics” Proc. Intern. Conf. on Spoken Language Processing, 125–128.Google Scholar
  21. [21]
    M. Macchi, M. Altom, D. Kahn, S. Singhal, & M. Spiegel (1993) “Intelligibility as a function of speech coding method for template-based speech synthesis,” Proc. Eurospeech-93, 893–896.Google Scholar
  22. [22]
    E. Moulines & F. Charpentier (1990) “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones,”Speech Communication vol. 9, 453–467.Google Scholar
  23. [23]
    J. Olive (1975) “Fundamental frequency rules for the synthesis of simple English sentences,” J. Acoust. Soc. Am. 57, 476–482.CrossRefGoogle Scholar
  24. [24]
    S. Palmer and J. House (1992) “Dynamic voice source changes in natural and synthetic speech,” Proc. Intern. Conf. on Spoken Language Processing, 129–132.Google Scholar
  25. [25]
    L. Pols (1992) “ Quality assessment of text-to-speech synthesis by rule,” in Advances in speech signal processing (S. Furui & M. Sondhi, eds. — Marcel Dekker: New York), 387–416.Google Scholar
  26. [26]
    Y. Sagisaka (1990) “Speech synthesis from text,”IEEE Communications Magazinevol. 28, no. 1, 35–41,55.CrossRefGoogle Scholar
  27. [27]
    Y. Sagisaka, N. Kaiki, N. Iwahashi and K. Mimura (1992) “ATR v-Talk speech synthesis system,” Proc. Intern. Conf. on Spoken Language Pro-. cessing, 483–486.Google Scholar
  28. [28]
    H. Sato (1992) “Speech synthesis for text-to-speech systems,” in Advances in speech signal processing (S. Furui & M. Sondhi, eds. — Marcel Dekker: New York), 833–853.Google Scholar
  29. [29]
    R. van Bezooijen and L. Pols (1990) “Evaluating text-to-speech systems: some methodological aspects,”Speech Communicationvol. 9, 263–270.CrossRefGoogle Scholar
  30. [30]
    J. van Santen (1993) “Timing in text-to-speech systems,” Proc. Eurospeech-93, 1397–1404.Google Scholar
  31. [31]
    J. van Santen (1993) “Perceptual experiments for diagnostic testing of text-to-speech systems,”Computer Speech and Languagevol. 7, 49–100.CrossRefGoogle Scholar
  32. [32]
    Proc. ESCA Workshop on Speech Synthesis (1990) Autrans, France.Google Scholar
  33. [33]
    C. Benoit, G. Bailly, and T. Sawallis, eds. (1992) Talking Machines: Theories, Models and Applications, Elsevier, North-Holland.Google Scholar

Copyright information

© Springer Science+Business Media New York 1995

Authors and Affiliations

  • Douglas O’Shaughnessy
    • 1
  1. 1.INRS- TelecommunicationsUniversity of QuebecCanada

Personalised recommendations