Speech exhibits nested linguistic properties: Clauses contain phrases, which contain words, which are composed of syllables, which comprise phonemic segments. The attributes at each scale are readily recognized, yet classic perceptual analyses of the information conveyed by speech have focused on the rapid rate of the production and perception of consonants and vowels, the elementary linguistic constituents that compose utterances (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Miller & Licklider, 1950). In ordinary circumstances, this rate might exceed a dozen segments per second. An acknowledgment of the rapidity of production underlies a foundational argument in cognitive science (Lashley, 1951), that utterances are planned and expressed, rather than triggered in chains of stimulus and response.

The projection of a fading auditory trace into durable linguistic form occurs with some urgency, according to classic estimates of auditory sensory decay: A trace fades in either 50 ms (Baddeley, 1986; Haggard, 1985; Lashley, 1951; Liberman et al., 1967), or <100 ms (Cudahy & Leshowitz, 1974; Elliot, 1967; Huggins, 1964; Miller & Licklider, 1950; Pisoni, 1973). Some more recent estimates converge on these measures: 50–100 ms (Fu & Galvin, 2001; Remez et al., 2010; Remez, Ferro, Wissig, & Landau, 2008). Nonetheless, a challenge to this estimate of fine-grained temporal sensitivity is posed by reports of sensory resolution at the far slower pace of syllables, between 3 and 8 Hz or 120 and 333 ms (Cherry, 1953; Drullman, Festen, & Plomp, 1994; Greenberg & Arai, 1998; Greenberg, Arai, & Grant, 2006; Saberi & Perrott, 1999).

One influential report noted perceptual sparing, despite temporal distortion approaching syllable duration (Saberi & Perrott, 1999; see also Steffen & Werani, 1994). In this method, a sample of speech was divided into equal intervals, each of which was reflected temporally, and the time-reversed segments were then sequenced in the original order, composing an utterance of veridically ordered time-reversed excerpts. The performance measures of the tolerance of temporal distortion revealed that a reversed segment duration as great as 135 ms reduced the judged intelligibility merely by half. This was offered as evidence that neither a detailed sensory representation nor a perceptual analysis of the fine structure of the auditory stream is required for the recognition of linguistic properties. Yet, the acoustic technique used to estimate the effects of temporal distortion disrupts the acoustic modulation of speech, but not its short-term spectra. With time-reversed excerpts retaining the auditory quality of every vowel and of nasal, aspirate, and fricative consonants, this method presumably contaminates the assessment of time-critical modulation sensitivity with perceptual effects of timbre, which is unaltered by temporal distortion (see, e.g., Clarke, Becker, & Nixon, 1966; Van Lancker, Kreiman, & Emmorey, 1985). Accordingly, this confounding was likely to yield a falsely long estimate of the span of temporal integration. Moreover, the reliance on judged intelligibility with repeated exposure to test items, instead of a direct measure of intelligibility, was also likely to overestimate the tolerance for temporal distortion. A fairer test of modulation sensitivity might rely on a contingent task—that is, reports of the linguistic properties of unfamiliar utterances, not the subjective prominence of expected words—and would distinguish the effects of modulation from effects of the carrier spectrum.

Is auditory modulation sensitivity coupled to the rate of spoken syllables, 3–8 Hz? In a test of this claim, we report intelligibility measures of sentences exhibiting temporal distortion ranging from brief to moderate time spans. The findings corroborated both classic and recent reports that sensitivity to modulation in speech approximates the linguistic constituent of the phonetic segment, far briefer than the syllable. In extending the precedent, an assay was created to compare natural and sine-wave speech (Remez, 2008; Remez, Rubin, Pisoni, & Carrell, 1981), in order to estimate modulation sensitivity (Elliott & Theunissen, 2009; Greenberg & Arai, 2001) exclusive of the perceptual effects of short-term spectra. This empirical practice allows a test to use transcription accuracy, a contingent task that is simple for participants, and to measure modulation sensitivity independent of short-term auditory quality, a property unaffected by temporal reversal. Differing in short-term spectra only, the modulations of natural and sine-wave speech are matched. Indeed, the intelligibility difference reported here between natural speech and the sine-wave conditions is arguably due to the perceptual effect of short-term timbre, independent of modulation, and exposes the likely contribution of vocal quality in the precedent.

With no evident correspondence in the pace of syllables and the temporal grain of auditory integration, these new findings show that the syllable derives its cognitive importance from its linguistic function (see Peelle, Gross, & Davis, 2012), which weakens the claim (Ahissar et al., 2001; Kerlin, Shahin, & Miller, 2010; Luo & Poeppel, 2007) that cycles of brain activity at the approximate periodicity of syllables reflect a specifically sensory integrative function, or that a cortical cycle of this periodicity entrained a fundamental sensory function during primate evolution (Ghazanfar, Chandrasekaran, & Morrill, 2010; cf. MacNeilage, 1998).

Experiment

The method of the present project used the acoustic technique of Saberi and Perrott (1999), imposing temporal distortion on a speech waveform, but we sharpened the perceptual measures in two ways. First, a variety of sentences was used, in two acoustic forms, as natural samples and as sine-wave replicas. In addition to diversifying the variety of spoken items presented to listeners—the empirical precedent (Saberi & Perrott, 1999) had used a single natural utterance, and an extension had used nine (Kiss, Cristescu, Fink, & Wittmann, 2008)—these new tests also aimed to distinguish modulation sensitivity from the perceptual effects of short-term natural vocal timbre (e.g., Terasawa, Slaney, & Berger, 2005). Because some consonants and vowels briefly approximate stationary spectra, these impressions of familiar timbre are conserved despite temporal reversal, and arguably may retain their perceptual function whether a sample is temporally veridical or reversed. Sine-wave speech lacks the short-term spectral details of natural vocalization, and without familiar timbre the recognition of linguistic attributes rests largely on sensitivity to modulation, despite an unspeechlike subjective quality (Remez, 2008; Remez et al., 1981).

A second aspect of the procedure also improved the sensitivity of the test. Transcription accuracy was used here as a direct measure of intelligibility, in contrast to prior methods. Saberi and Perrott (1999) relied on indirect reports that a known sentence was spared subjective disruption by temporal distortion. An intelligibility measure was combined with the method of limits by Kiss et al. (2008), using ascending runs decreasing in the duration of reversed segments. In the present test, a listener was assigned to a single condition only, preventing a trial in one condition from influencing performance in another. As a control and replication, some listeners in the present study reported the extent to which a printed sentence shown during a trial remained intelligible, despite the imposition of temporal distortion on a natural sample. Adopting this method along with direct performance measures of intelligibility permitted a comparison of the present methods with the empirical precedent.

Method

Acoustic test materials

Twelve sentences (see the Appendix) spoken by one of the authors (K.R.D., an adult female) were sampled to disk at 44.1 kHz. The average syllable duration of these items was 277 ms (SD = 128 ms) excluding the final stressed syllables (average duration = 496 ms, SD = 124 ms), which lies within the range of 120–333 ms designated by the hypothetical syllable pace. Temporally distorted versions were created by reversing small portions of the waveform of each sample and assembling the reversed portions in veridical order (see Figs. 1 and 2). The reversal spans applied to the natural sentences were 0, 50, 75, 100, and 150 ms.

Fig. 1
figure 1

Effect of temporal reversal of brief segments of a natural speech sample. (Top) A temporally veridical representation of the phrase “the winding,” excerpted from the test item “Take the winding path to reach the lake.” (Bottom) The waveform created by reversing 75-ms segments and recomposing the wave

Fig. 2
figure 2

Spectrographic comparison of natural speech (top), its sine-wave replica (middle), and the composite of temporally reversed 75-ms natural samples (bottom). The phrase is “the winding,” excerpted from the test item “Take the winding path to reach the lake”

Unaltered natural samples of the 12 sentences were used as models for the creation of sine-wave speech. Estimates of the center frequency and amplitude of vocal resonances were created by hand and used as synthesis parameters for four time-varying sinusoids (see Remez et al., 2011). Temporally distorted versions were created by reversing a brief span of a waveform and composing a new waveform of reversed samples, preserving the original order. The reversal spans applied to sine-wave sentences were 0, 25, 50, and 75 ms.

Procedure

Each test session, we used 12 sentences of the same acoustic type, natural or sine-wave, the same temporal reversal, and the same response measure, transcription or the magnitude of subjective intelligibility. The design included 13 conditions overall, in three main tests: natural intelligibility, with reversal segments of 0, 50, 75, 100, and 150 ms; sine-wave intelligibility, with reversal segments of 0, 25, 50, and 75 ms; and judged subjective intelligibility, with reversal segments of 50, 75, 100, and 150 ms. These conditions were chosen to track performance in each of the three tests. In sessions testing intelligibility, a listener was instructed to transcribe each test sentence in a specially prepared booklet. In sessions replicating the method of reported subjective intelligibility, a listener read a printed version of a sentence before hearing it and indicated the apparent intelligibility by designating a magnitude ranging from 5 (all words intelligible) to 1 (no words intelligible). Each sentence was presented five times in succession, and all items presented within a test session shared the same temporal reversal.

Participants

A total of 104 listeners were each assigned randomly to a test condition. Each participant was right-handed and reported no history of speech or hearing difficulty.

Results and discussion

Intelligibility performance was analyzed statistically using a one-way analysis of variance on the intelligibility parameter for the natural and sine-wave sentences, and on the judged-intelligibility parameter for reports about natural sentences; each degree of temporal reversal that was tested was a treatment in the analysis. Performance differed significantly as a function of the duration of temporal reversal [natural judged, F(3, 28) = 35.756, p = .0004; natural intelligibility, F(4, 35) = 333.197, p = .0004; sine-wave intelligibility, F(3, 28) = 37.289, p = .0004]. The group performance is shown in Fig. 3; error bars portray 95 % confidence intervals. Significant differences between the individual treatment means may be seen directly in the figure.

Fig. 3
figure 3

Group performance on three perceptual tasks under differing grains of temporal reversal. Each dot represents the average performance of eight listeners and 12 sentences. Error bars show the 95 % confidence intervals estimated in the analyses of variance. Judged intelligibility was converted to an estimated percentage (scale on right) and plotted on the same frame with the direct measures of intelligibility

The results of the three tests that we performed showed a clear pattern, with the first roughly replicating the finding of Saberi and Perrott (1999): When listeners knew the words composing the utterance in advance of the presentation and judgments of subjective intelligibility were used to estimate distortion tolerance, judged intelligibility declined by half when the reversal segment was 100 ms. If the sentences were not known in advance, transcription accuracy declined by half at a reversal segment of 75 ms, and at a reversal segment of 100 ms, the sentences were unintelligible. This difference as a consequence of the task is most likely due to the overestimation of distortion tolerance caused by the use of an indirect and subjective measure. Relying on transcription accuracy to estimate intelligibility, Kiss et al. (2008) used natural sentences, with each trial presenting a slightly less distorted sentence to the same listener. Ascending runs only occurred in this variant of the method of limits, and due to the cumulative effects of uncertainty across trials, it was likely to produce an underestimate of distortion tolerance. Indeed, they reported that intelligibility fell by half at a reversal segment of 50 ms, and sentences were unintelligible at 74 ms. Nonetheless, the present estimates and those reported by Kiss et al. are briefer than the hypothetical syllable range of 120–333 ms and are counterevidence to the claim that auditory integration of speech is intrinsically keyed to a syllabic rate.

The results of the sine-wave tests show that the intelligibility of intact sentences was good overall, but poorer than the natural items on which the synthesis was modeled. In the conditions with time-reversed segments, the intelligibility of sine-wave sentences was lost at a reversal segment as brief as 50 ms, which is evidence that sensitivity to modulation, independent of timbre, simply develops far more quickly than the syllable, arguably at the pace of the phonetic segment. These findings are approximate to independent measures of perceptual integration of speech signals with sparse acoustic spectra (Fu & Galvin, 2001; Remez et al., 2008; Silipo, Greenberg, & Arai, 1999) and provide a discriminating test of modulation sensitivity.

Psychoacoustically, a brief estimate of modulation sensitivity is plausible, although the performance-level difference between the natural and sine-wave conditions admittedly warrants caution. Performance was 25 % poorer with undistorted sine-wave items than with natural items, and the estimate of tolerable temporal reversal with sine-wave items was 50 % briefer than the estimate using natural items. One interpretation is that these differences reflect two consequences of the contribution to speech perception of short-term spectra and spectrotemporal modulation. When they combine, both aspects of performance are enhanced. When only modulation remains, intelligibility suffers a bit, and cognitive compensation for temporal distortion is hampered. Nonetheless, expressed in these measures might be a general relation between intelligibility and tolerance of temporal distortion. It must be conceded, however, that the technical literature has no precedent for this speculation. Moreover, it would be difficult to assess this conjecture parametrically—for instance, by titrating intelligibility in order to observe changes in distortion tolerance. Although some studies have used filtered, masked, reversed, or vocoded speech in order to preserve some acoustic properties of speech while reducing intelligibility, each of these manipulations disrupts the modulation characteristic of speech, and none is well suited for a direct investigation of modulation sensitivity. Because sine-wave synthesis and fine-grain acoustic chimeras (Smith, Delgutte, & Oxenham, 2002) retain the modulation characteristics of speech at fine frequency detail across 5 kHz, these methods are more appropriate. New tests that vary the distribution of phone classes systematically—fricatives, nasals, and liquids, for example—will also permit a parametric study of the independent effects of short-term timbre and sensitivity to modulation (see Remez et al., 2011).

In this conceptualization, the origin of modulation sensitivity is sensory, understood as an intrinsic function of an auditory system. Could this aspect of perceptual organization vary with the characteristics of an acoustic wave? Although it is customary to distinguish aspects of sensory function that are fixed from those that are altered by attention, one recent study using a method similar to that of Saberi and Perrott (1999, and that of the present project) reported effects contingent on syllable rate (Stilp, Kiefte, Alexander, & Kluender, 2010). The method of their project imposed temporal reversals on brief segments of a waveform, in a test of the robustness of perception despite temporal distortion. Instead of natural speech, they used speech synthesized automatically from text, constructed to exhibit three different speech rates: slow, normal, and rapid. With intelligibility as the measure, the effect of temporal distortion appeared to vary with speech rate, and was very nearly a constant function of syllable rate, independent of absolute temporal characteristics. Stilp et al. concluded that the differences in distortion tolerance due to speech rate were attributable to a match between speech rate and the reversal segment that disrupted the syllables: Distortion of fast speech was relatively more harmed by brief reversal segments, slow speech by long reversal segments, and modal speech by reversal segments of intermediate duration. Although this report warrants caution in interpreting the present measures as the result of a fixed sensory function, the actual implications of the findings reported by Stilp et al. are less certain. To explain, synthetic speech was a surrogate for sampled speech in their test items, to make it feasible for them to vary speech rate with control. But, in assembling continuous speech from discrete segment-size samples, synthetic speech produced from text by unit selection (Hunt & Black, 1996) compromises the natural dynamics of speech acoustics, interpolating segments by algorithm rather than by the natural dynamics of coarticulation. Sine-wave speech is a form of copy synthesis that preserves the dynamics of the evolving utterances exactly. Moreover, in speech synthesized by unit selection, the compromise in the dynamics is great when the synthesis rate departs significantly from the original range of articulation rates at which the segmental templates were sampled. In the method of Stilp et al., the range of syllabic rates varied in the extreme, from 2.5 to 10 Hz (100–400 ms per syllable), incidentally exceeding the hypothetical range of modulation sensitivity proposed in this literature. No tests were reported of speech rates that varied in the natural range close to the modal rate. It will be useful to evaluate the effects of speech rate in new measures with realistic test materials. For now, to consider the condition in their report closest to the natural speech condition of the present test, the estimates of distortion tolerance coincide, despite a small difference in speech rates between the natural talker in this project and the synthetic talker in theirs.

Conclusion

Because the production of speech and its acoustic effects are structured in syllables, it has seemed reasonable at times for theorists to propose a reciprocal perceptual function exhibiting a grain of organization at the level of syllables, to a first approximation. Certainly, a widely influential view of the perception of speech is that perceptual integration occurs at the grain of syllables (Mehler, Dommergues, Frauenfelder, & Segui, 1981). One variant of this claim (Poeppel, 2003) describes the periodicity of cortical networks at roughly the same cycle rate that syllables are produced, and an extrapolation from this premise proposes that the phylogenetic age of this cortical pattern antedates speech and language (Ghazanfar et al., 2010). Syllables occur at 3–8 Hz, in this view, in order to coincide with the natural characteristics of a primate vocalization system exapted for speech (though see Fox & Cohen, 1977, for an equivalent in canid vocalization). However, direct performance estimates of the persistence of auditory sensory traces do not support the premise that the integration of sensory elements occurs at the slow pace of the syllable. It is far likelier that sensory samples are rapidly bound and resolved linguistically into aggregates approximate to syllables, a conceptualization consistent with measures that distinguish sensory and cognitive effects in the cortical accompaniment to speech (Peelle et al., 2012). Although a durable phonetic encoding persists after an auditory trace has decayed (e.g., Baddeley, 1986), tests with tones (Cudahy & Leshowitz, 1974; Elliott, 1967) and with speech (Pisoni, 1973) alike have noted the short span of an auditory trace, which fades so rapidly that very little remains after a tenth of a second. The findings reported here corroborate those psychoacoustic measures and can inform theory and speculation about fundamental functions in the perceptual neuroscience of speech.