Abstract
Close shadowing experiments involving natural and synthetic stimuli are described. Preliminary results show that speakers are able to follow natural stimuli with an average delay of 70 ms whereas this delay typically exceeds 100 ms for stimuli produced by text-to-speech systems. A complementary experiment shows that this contrast is mainly due to the inappropriate or impoverished prosody generated by actual text-to-speech systems.
Similar content being viewed by others
References
Aubergé, V., Grépillat, T., and Rilliard, A. (1997). Can we perceive attitudes before the end of sentences? The gating paradigm for prosodic contours. Proceedings of the European Conference on Speech Communication and Technology. Rhodes, Greece, pp. 871–874.
Auxiette, C. and Gérard, C. (1992). Perceptual and motor determinants in the synchronization of music and speech. Fourth InternationalWorkshop on Rhythm Perception and Production. Bourges,France, pp. 59–64.
Bailly, G., Barbe, T., and Wang, H. (1990). Automatic labelling of large prosodic databases:Tools, methodology and links with a textto-speech system. ETRWWorkshop on Speech Synthesis. Autrans, France, pp. 201–204.
Boersma, P. and Weenink, D. (1996). Praat, a system for doing phonetics by computer, version 3.4, Institute of Phonetic Sciences of the University of Amsterdam, Report 132. 182 pages.
Carey, P.W. (1971). Verbal retention after shadowing and after listening. Perception and Psychopysics, 9:79–83.
Charpentier, F. and Moulines, E. (1990). Pitch-synchronous waveform processing techniques for text-to-speech using diphones. Speech Communication, 9(5/6):453–467.
Chistovich, L.A., Aliakrinskii, V.V., and Abulian, V.A. (1960). Time delays in speech repetition. Voprosy Psikhologii, 1:114–119.
Dumay, N. and Radeau, M. (1997). Rime and syllabic effects in phonological priming between French spoken words. Proceedings of the European Conference on Speech Communication and Technology, pp. 2191–2194.
Dutoit, T., Pagel, V., Pierret, N., Bataille, F., and Vrecken, O.v.d. (1996). The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes. Proceedings of the International Conference on Speech and Language Processing. Philadelphia, USA, pp. 1393–1396.
Eriksson, A. and Wretling, P. (1997). How flexible is the human voice? A case study of mimicry. Proceedings of the European Conference on Speech Communication and Technology. Rhodes, Greece, pp. 1043–1046.
Fay, W.H. and Coleman, R.O. (1977). A human sound transducer/reproducer: Temporal capabilities of a profoundly echolatic child. Brain and Language, 4:396–402.
Grosjean, F. (1983). How long is the sentence? Prediction and prosody in the on-line processing of language. Linguistica, 21:501–529.
Grosjean, F. and Hirt, C. (1996). Using prosody to predict the end of sentences in English and French: Normal and brain damaged subjects. Language and Cognitive Processes, 11(1):107–134.
Jones, M.R. and Boltz, M.G. (1989). Dynamic attending and responses to time. Psychological Review, 96:459–491.
Kuhl, P.K. and Meltzoff, A.N. (1982). The bimodal perception of speech in infancy. Science, 218:1138–1141.
Kuhl, P.K. and Meltzoff, A.N. (1996). Infant vocalizations in response to speech: Vocal imitation and developmental change. Journal of the Acoustical Society of America, 100:2425–2438.
MacNeilage, P. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21(4):499–548.
Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short latencies. Nature, 244:522–523.
Marslen-Wilson,W. (1985). Speech shadowing and speech comprehension. Speech Communication, 4:55–73.
McCarthy, R. and Warrington, E.K. (1984). A two-route model of speech production: Evidence from aphasia. Brain, 107:463–485.
McLeod, P. and Posner, M.I. (1984). Privileged loops from percept to act. In H. Bouma and D. Bouwhuis (Eds.), Attention and performance X. Lawrence Erlbaum Associates: Mahwah, NJ, USA, pp. 55–66.
Porter, R.J. and Castellanos, F.X. (1980). Speech-production measures of speech perception: Rapid shadowing of VCV syllables. Journal of the Acoustical Society of America, 67(4):1349–1356.
Porter, R.J. and Lubker, J.F. (1980). Rapid reproduction of vowelvowel sequences: Evidence for a fast and direct acoustic-motoric linkage in speech. Journal of Speech and Hearing Research, 23:593–602.
Rizzolatti, G., Fadiga, L., Gallese,V., and Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3:131–141.
Schmuckler, M. (1989). Expectation in music: Investigation of melodic and harmonic processes. Music Perception, 7:109–150.
Schneider, D.E. (1938). The clinical syndromes of echolalia, echopraxia, grasping and sucking. Journal of Nervous and Mental Disease, 88(18–35):200–216.
Stetson, R.H. (1905). Motor theory of rhythm and discrete succession I and II. Psychological Review, 12:250–269, 293–335.
Vitkovitch, M. and Barber, P. (1994). Effect of video frame rate on shadowing. Journal of Speech and Hearing Research, 37:1204–1210.
Young, S.J. (1992). HTK: Hidden Markov Model Toolkit V1.3. Reference Manual. Cambridge University Engineering Department.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bailly, G. Close Shadowing Natural Versus Synthetic Speech. International Journal of Speech Technology 6, 11–19 (2003). https://doi.org/10.1023/A:1021091720511
Issue Date:
DOI: https://doi.org/10.1023/A:1021091720511