Abstract
Speech synthesis enables voice output by machines or devices. Text-to-speech (TTS) synthesis does so by using text as input. Ever since the talking machine by von Kempelen in 1791 [19.1], researchers and technologists have endeavored to make machines talk. The first electronic synthesis, Homer Dudleyʼs Voder (Voice Coder), was demonstrated at the 1939 World Fair in New York City [19.2]. Today, TTS systems enjoy wide use in assistive technologies, telecommunications, entertainment, and education. In this chapter we will review the basic principles of this technology, which serves as an introduction to later chapters that provide the reader with more-detailed information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- ATR:
-
advanced telecommunications research
- CELP:
-
code-excited linear prediction
- DTW:
-
dynamic time warping
- GMM:
-
Gaussian mixture model
- HMM:
-
hidden Markov models
- HNM:
-
harmonic-plus-noise model
- ITU:
-
International Telecommunication Union
- LPC:
-
linear predictive coding
- PSRELP:
-
pitch-synchronous residual excited linear prediction
- SGML:
-
standard generalized markup language
- SSML:
-
speech synthesis markup language
- TD-PSOLA:
-
time-domain pitch-synchronous overlap-add
- TTS:
-
text-to-speech
- ToBI:
-
tone and break indices
- UCD:
-
unit concatenative distortion
- USD:
-
unit segmental distortion
References
J.L. Flanagan: Speech Analysis, Synthesis and Perception (Springer, Berlin, Heidelberg 1972) pp. 204-210, http://www.haskins.yale.edu/featured/heads/SIMULACRA/kempelen.html
H. Dudley, R.R. Riesz, S.A. Watkins: A synthetic speaker, J. Franklin Inst. 227, 739-764 (1939), http://www.bell-labs.com/org/1133/Heritage/Vocoder/
W3C Standard Generalized Markup Language: http://www.w3.org/MarkUp/SGML/
W3C Speech Synthesis Markup Language Version 1.0: http://www.w3.org/TR/2003/CR-speech-synthesis-20031218/ (http://www.xml.com/pub/a/2004/10/20/ssml.html)
J.P. van Santen: Timing. In: Multilingual Text-to-Speech Synthesis - The Bell Labs Approach, ed. by R. Sproat (Springer, New York 1998) pp. 115-139
D.H. Klatt: Linguistic use of segmental duration in English: Acoustic and perceptual evidence, J. Acoust. Soc. Am. 59, 1208-1221 (1976)
A. Schweitzer, B. Moebius: Exemplar-based production of prosody: Evidence from segment and syllable durations, Proc. Speech Prosody 2004 (Nara), ed. by B. Bel, I. Marlien (ISCA, Grenoble 2004)
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, J. Hirschberg: TOBI: A standard for labeling English prosody, Proc. ICSLPʼ92 Banff (1992) pp. 867-870
H. Fujisaki: The role of quantitative modeling in the study of intonation, Proc. Int. Symp. Japanese Prosody (1992) pp. 163-174
G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin, J. Flanagan, S. Levinson, D. Davis, S. Slimon: Numerical simulations of fluid flow in the vocal tract, Proc. of Eurospeech Madrid (1995) pp. 18-21
M. Stone: A three-dimensional model of tongue movement based on ultrasound and x-ray microbeam data, J. Acoust. Soc. Am. 87, 2207-2217 (1990)
T. Baer, J.C. Gore, L.C. Gracco, P.W. Nye: Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels, J. Acoust. Soc. Am. 90, 799-828 (1991)
C.H. Coker: A model of articulatory dynamics and control, Proc. IEEE 64, 452-459 (1976)
E.L. Saltzman, K.G. Munhall: A dynamical approach to gestural patterning in speech production, Ecol. Psychol. 1(4), 333-382 (1989)
J. Schroeter, M.M. Sondhi: Speech coding based on physiological models of speech production. In: Advances in Speech Signal Processing, ed. by S. Furui, M.M. Sondhi (Marcel Dekker, New York 1991) pp. 231-268
J.D. Markel, A.H. Gray: Linear Prediction of Speech (Springer, New York 1976)
D.H. Klatt: Software for a cascade/parallel formant synthesizer, J. Acoust. Soc. Am. 67, 971-995 (1980)
H.M. Hanson, K.N. Stevens: A quasiarticulatory approach to controlling acoustic source parameters in a Klatt-type formant synthesizer using HLsyn, J. Acoust. Soc. Am. 112, 1158-1182 (2002)
J.P.H. van Santen: Combinatorial issues in text-to-speech synthesis, EuroSpeech ʼ97 5th European Conference on Speech Communication and Technology 5, 2511-2514 (1997)
J.P. Olive: Rule synthesis of speech from diadic units, Proc. ICASSP 77, 568-570 (1977)
O. Fujimura, J. Lovins: Syllables as concatenative phonetic elements. In: Syllables and Segments, ed. by A. Bell, J.B. Hooper (North-Holland, New York 1978) pp. 107-120
J. Olive, J. van Santen, B. Möbius, C. Shih: Synthesis. In: Multilingual Text-to-Speech Synthesis - The Bell Labs Approach, ed. by R. Sproat (Kluwer Academic, Dordrecht 1998), Chap. 7
Y. Sagisaka: Speech synthesis by rule using an optimal selection of non-uniform synthesis units, Proc. ICASSP 88, 679-682 (1988)
A. Hunt, A.W. Black: Unit selection in a concatenative speech synthesis system using a large speech database, Proc. ICASSP 96, 373-376 (1996)
A.W. Black, N. Campbell: Optimising selection of units from speech databases for concatenative synthesis, ESCA Eurospeech 95, 581-584 (1995)
L. Rabiner, B.H. Juang: Fundamentals of Speech Recognition (Prentice-Hall, Englewood Cliffs 1993) pp. 339-341
J. Vepa, S. King: Join cost for unit selection speech synthesis. In: Text-to-Speech Synthesis - New Paradigms and Advances, Professional Technical Reference, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004) pp. 35-62, Chap. 3
E. Eide, R. Bakis, W. Hamza, J.F. Petrelli: Toward expressive synthetic speech. In: Text-to-Speech Synthesis - New Paradigms and Advances, Professional Technical Reference, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004) pp. 219-248, Chap. 11
A.W. Black, P. Taylor: Automatically clustering similar units for unit selection in speech synthesis, Proc. Eurospeech 97, 601-604 (1997)
K. Tokuda, H. Zen, A.W. Black: An HMM-based approach to multilingual speech synthesis. In: Text-to-Speech Synthesis - New Paradigms and Advances, Professional Technical Reference, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River 2004) pp. 135-153, Chap. 7
T. Dutoit: An Introduction to Text-to-Speech Synthesis (Kluwer Academic, Dordrecht 1997)
E. Moulines, F. Charpentier: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Commun. 9(5-6), 453-467 (1990)
M. Macchi, M.J. Altom, D. Kahn, S. Singhal, M. Spiegel: Intelligibility as a function of speech coding method for template-based speech synthesis, Proc. Eurospeech 93, 893-896 (1993)
W. Kleijn, K. Paliwal (Eds.): Speech Coding and Synthesis (Elsevier, Amsterdam 1995)
T.F. Quartieri, R.J. McAulay: Shape invariant time-scale and pitch modification of speech, IEEE Trans. Signal Process. 40(3), 497-510 (1992)
Y. Stylianou: Applying the harmonic plus noise model in concatenative speech synthesis, IEEE Trans. Speech Audio Process. 9(1), 21-29 (2001)
H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Commun. 27(3-4), 187-207 (1999)
M. Abe, S. Nakamura, K. Shikano, H. Kuwahara: Voice conversion through vector quantization, Proc. IEEE ICASSP 88, 655-658 (1990), S14.1
I. Stylianou: Modèles Harmoniques plus Bruit combines avec des Méthodes Statistiques, pour la Modication de la Parole et du Locuteur, Doctoral Thesis (Ecole Nationole Supérieure des Télécommunications, Paris 1996), in French
A. Kain, M. Macon: Spectral voice conversion for text-to-speech synthesis, Proc. IEEE ICASPP 98, 285-288 (1998)
M.F. Spiegel, M.J. Altom, M.J. Macchi: Comprehensive assessment of the telephone intelligibility of synthesized and natural speech, Speech Commun. 9, 279-291 (1990)
A. Syrdal: Development of a standard for the evaluation of intelligibility of text-to-speech synthesis systems by ANSI Accredited Standards Committee S3, Bioacoustics, working group S3/WG 91, Text-to-Speech Synthesis Systems, Personal communication (2007)
ITU-T: A Method for Subjective Performance Assessment of the Quality of Speech Output Devices (International Telecommunications Union, Geneva 1994), Recommendation P.85
Y.V. Alvarez, M. Huckvale: The reliability of the ITU-T P.85 standard for the evaluation of text-to-speech systems, Proc. ICSLP 2002, 329-332 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Schroeter, J. (2008). Basic Principles of Speech Synthesis. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-49127-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)