Accounting for rate-dependent category boundary shifts in speech perception

Bosker, Hans Rutger

doi:10.3758/s13414-016-1206-4

Accounting for rate-dependent category boundary shifts in speech perception

Open access
Published: 14 September 2016

Volume 79, pages 333–343, (2017)
Cite this article

Download PDF

You have full access to this open access article

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Accounting for rate-dependent category boundary shifts in speech perception

Download PDF

Hans Rutger Bosker^1,2

2478 Accesses
46 Citations
1 Altmetric
Explore all metrics

Abstract

The perception of temporal contrasts in speech is known to be influenced by the speech rate in the surrounding context. This rate-dependent perception is suggested to involve general auditory processes because it is also elicited by nonspeech contexts, such as pure tone sequences. Two general auditory mechanisms have been proposed to underlie rate-dependent perception: durational contrast and neural entrainment. This study compares the predictions of these two accounts of rate-dependent speech perception by means of four experiments, in which participants heard tone sequences followed by Dutch target words ambiguous between /ɑs/ “ash” and /a:s/ “bait”. Tone sequences varied in the duration of tones (short vs. long) and in the presentation rate of the tones (fast vs. slow). Results show that the duration of preceding tones did not influence target perception in any of the experiments, thus challenging durational contrast as explanatory mechanism behind rate-dependent perception. Instead, the presentation rate consistently elicited a category boundary shift, with faster presentation rates inducing more /a:s/ responses, but only if the tone sequence was isochronous. Therefore, this study proposes an alternative, neurobiologically plausible account of rate-dependent perception involving neural entrainment of endogenous oscillations to the rate of a rhythmic stimulus.

Rhythmic and speech rate effects in the perception of durational cues

Article 12 July 2021

The role of isochrony in speech perception in noise

Article Open access 11 November 2020

Interactive effects of linguistic abstraction and stimulus statistics in the online modulation of neural speech encoding

Article 18 December 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Speech can be produced at different rates. The speed at which people speak is known to vary between languages (Pellegrino, Coupé, & Marsico, 2011), between individuals (Quené, 2008), within individuals (Quené, 2013), and even within a single sentence (Miller, Grosjean, & Lomanto, 1984). At the same time, the perception of speech relies heavily on the temporal characteristics of the signal. Many phonological contrasts in the languages of the world involve temporal cues that distinguish the different phonemic categories, such as consonant voicing (voice onset time; VOT), manner of articulation (formant transition duration), gemination, and vowel length (Miller, 1981). As such, variation in the rate at which speech is produced poses a serious challenge to the perceptual system of the listener.

Considering this large-scale variation in speech rate production, listeners are known to interpret speech categories relative to the temporal properties of the surrounding context (henceforth, rate-dependent category boundary shifts). An example of the influence of proximal context (i.e., local, typically adjacent, segments) is the finding that the perception of the stop voicing contrast in English (e.g., in /ba/-/pa/; mainly cued by VOT) may be shifted toward /pa/ (longer VOT) when the duration of the following vowel is reduced (Diehl & Walsh, 1989; Kidd, 1989; Miller & Liberman, 1979; Summerfield, 1975). An example of the influence of distal context (further removed, typically nonadjacent speech: e.g., the sentence in which a target word is embedded) is the finding that the perception of an /ɑ/-/a:/ continuum in Dutch may be biased toward /a:/ by presenting the target continuum in a fast precursor sentence (Bosker & Reinisch, 2015; Reinisch, 2016a). Similar distal context effects have been found for other temporal contrasts, such as phonological voicing (VOT; Gordon, 1988), manner of articulation (Wade & Holt, 2005), lexical stress (Reinisch, Jesse, & McQueen, 2011a), and word segmentation (Reinisch, Jesse, & McQueen, 2011b).

The literature seems to suggest that rate-dependent category boundary shifts are to be explained by a general auditory mechanism. This claim is supported by evidence that rate-dependent perception is found in young infants (Eimas & Miller, 1980) and in nonhuman species (Welch, Sawusch, & Dent, 2009). Moreover, rate-dependent effects occur very early in perceptual processing (Reinisch, 2016b; Reinisch & Sjerps, 2013) and are not modulated by cognitive load (Bosker, Reinisch, & Sjerps, 2016). Finally, rate-dependent effects are sensitive to the rate of speech produced by nontarget talkers (Bosker, 2016; Newman & Sawusch, 2009) and even to the rate of nonspeech precursors (e.g., fast vs. slow pure tone sequences; Gordon, 1988; Wade & Holt, 2005; but see Pitt, Szostak, & Dilley, 2016).

Two possible general auditory mechanisms have been proposed to account for rate-dependent category boundary shifts. The first account is the principle of durational contrast, which was introduced by Diehl and Walsh (1989). They stipulated that the “perceived length of a given acoustic segment is affected contrastively by the duration of adjacent segments” (p. 2154). That is, a target phonetic duration will be perceived as longer in the context of shorter segments than in the context of longer segments.

The principle of durational contrast was originally formulated to explain proximal context effects (of adjacent segments), but has also been suggested to explain distal context effects (of sentential rate). For instance, Wade and Holt (2005) presented participants with a /ba/-/wa/ continuum (varying formant transition duration) preceded by two particular tone sequences: a fast tone sequence (short tones presented at a fast rate) or a slow tone sequence (long tones presented at a slow rate). Across two experiments (with various amplitude manipulations), the authors observed that the fast-tone sequence biased the perception of target words toward /wa/. The authors took this result as evidence for durational contrast, with the duration of the tones exerting a contrastive influence on the perception of the following ambiguous initial consonant.

Another general auditory account of rate-dependent perception involves neural entrainment to the syllabic rhythm of speech. The neurocognitive literature indicates that the brain tracks incoming speech by phase-locking intrinsic oscillators to the syllabic rhythm of the speech signal (Giraud & Poeppel, 2012). The (approximately syllabic) amplitude fluctuations present in speech are thought to elicit a phase reset of cortical oscillations, which thereafter track the speech envelope (Doelling, Arnal, Ghitza, & Poeppel, 2014; Luo & Poeppel, 2007; Peelle, Gross, & Davis, 2013). Thus, neuronal excitability is temporally aligned with the temporal structure of the acoustic input, serving as a parsing mechanism for the initial neural coding of the speech signal (Arnal, Giraud, & Poeppel, 2015; Gross et al., 2013).

Recent studies (e.g., Dilley & Pitt, 2010; Peelle & Davis, 2012; Pitt et al., 2016) have alluded to neural entrainment as a potential explanatory mechanism behind rate-dependent perception, particularly distal context effects of sentential rate (although empirical evidence is currently lacking). For instance, Peelle and Davis (2012) have suggested that there is a consistent phase relationship between the onset of phonetic segments and ongoing (entrained) cortical oscillations, guiding speech perception. To exemplify, the segmental onset for /p/ may hypothetically occur consistently in the low-excitability phase of entrained oscillations, whereas the segmental onset for /b/ may consistently occur in the high-excitability phase (cf. Figure 6 in Peelle & Davis, 2012). Neural entrainment to a fast sentential context induces shifts in the relationship between segmental onsets and oscillatory phase. Thus, the segmental onset of an ambiguous bilabial stop, following a fast precursor, may fall in a more low-excitability phase of entrained oscillations (vs. in a more high-excitability phase following a slow precursor), biasing perception toward /p/ after fast speech.

This study aims to contribute to our understanding of rate-dependent category boundary shifts in speech perception by comparing predictions from the two general auditory accounts introduced above: durational contrast and neural entrainment. This comparison primarily concerns distal context effects of sentential. Crucially, the two accounts differ with respect to the cue in the acoustic context that is thought to elicit these distal context effects: the duration of surrounding acoustic units (i.e., in ms) or their presentation rate (i.e., number of units per second, in Hz). Of course, in natural speech, syllabic durations and speech rate covary: faster speech typically contains shorter syllables. Nevertheless, in a lab experiment, duration and rate can easily be separated by manipulating intervening silent intervals, which allows for discrimination of predictions from the two accounts of rate-dependent perception.

Specifically, this study adopted the experimental design from Wade and Holt (2005): Participants were presented with pure tone sequences (precursors) followed by a vowel continuum between Dutch /ɑ/ and /a:/ (targets). In Experiments 1–3 (using various sample sizes and various vowel continua), four precursor conditions were used—namely, precursors containing tones with either short or long durations, presented at either a fast or a slow presentation rate. Using this full factorial design, the independent (and potentially combined) contributions of tone duration and presentation rate may be disentangled.

If we follow the principle of durational contrast, then modulating the duration of the tones in the tone precursor should elicit rate-dependent category boundary shifts, independent of the presentation rate of the tones (specifically, shorter tones would bias target perception toward /a:/). In contrast, oscillation-based models of speech perception state that intrinsic oscillators phase-lock specifically to the rate of a particular acoustic precursor (Doelling et al., 2014; Ghitza, 2014). Therefore, if we follow proposals about neural entrainment, then rate-dependent perception should be elicited by modulating the presentation rate of the tones in the precursor, independent of the duration of those tones (specifically, faster rates would bias target perception toward /a:/). Note, however, that the two accounts are not mutually exclusive; in fact, they may operate in tandem, with both duration modulations and rate modulations affecting speech perception.

Finally, proposals about neural entrainment maintain a central role for the rhythmic nature of speech in rate-dependent perception. If Experiments 1–3 find that modulating the precursors’ presentation rate elicits rate-dependent perception, then removing the regular timing of a tone sequence may eliminate the effect of different presentation rates. If, however, durational contrast induces rate-dependent perception, then the (regular or irregular) timing of a tone sequence should not influence perception. Experiment 4 compared the effect of isochronous tone precursors (as used in Experiments 1–3) to the effect that anisochronous tone precursors (i.e., with jittered interonset intervals) might have on target perception.

Experiment 1

Method

The experimental design of this study resembles the design introduced in Wade and Holt (2005). However, here, the Dutch vowel contrast between /ɑ/ and /a:/ was investigated instead of the English /ba/-/wa/ contrast in Wade and Holt (2005).

Participants

Similar sample sizes as those used in Wade and Holt (2005) were adopted. Native Dutch participants (N = 14, two males, 12 females, M _age = 33 years) with normal hearing were recruited from the Max Planck Institute (MPI) participant pool, with informed consent as approved by the Ethics Committee of the Social Sciences Department of Radboud University (Project Code: ECSW2014-1003-196).

Design and materials

The stimuli in the experiment consisted of tone precursors followed by target words (see Fig. 1). Four different precursors, each with a total duration of 4 seconds, were created in Praat (Boersma & Weenink, 2012) by crossing two different tone durations (71 vs. 125 ms) with two presentation rates (4 vs. 7 Hz):

A.
SLOW, LONG: tones of 125 ms presented at a rate of 4 Hz
B.
SLOW, SHORT: tones of 71 ms presented at a rate of 4 Hz
C.
FAST, LONG: tones of 125 ms presented at a rate of 7 Hz
D.
FAST, SHORT: tones of 71 ms presented at a rate of 7 Hz

The presentation rates were selected to fall within the range of typical speech rates. The tone durations (including a 20-ms rise-and-decay time) were selected to fall within the range of typical durations of the vowels /ɑ/ and /a:/ in Dutch, and were derived from the selected rates using the formula: 1/(2 × rate). The fundamental frequency of all pure tones was fixed at 440 Hz, thus avoiding spectral masking of the target vowels’ F0, F1, and F2. Because the phase relationship between target word onset and an acoustic periodic precursor may influence perception (ten Oever & Sack, 2015), target word onset was kept at a constant phase (0 degrees) across the different precursors.

For the target words, a female native speaker of Dutch was recorded, producing the Dutch minimal word pair as /ɑs/ “ash” and aas /a:s/ “bait.” From these recordings, one long vowel /a:/ was selected for manipulation. Because the Dutch /ɑ/-/a:/ contrast is cued by both spectral and temporal characteristics, a two-dimensional continuum was created from this one vowel token, comprising five duration values and five F2 values, all falling within the speaker’s natural range. Spectral manipulations were based on Burg’s LPC method (implemented in Praat), with the source and filter models estimated automatically from the selected vowel. The formant values in the filter models were inspected and adjusted to result in a constant F1 value (810 Hz, ambiguous between /ɑ/ and /a:/) and one of five desired F2 values (1350–1550 Hz in steps of 50 Hz). Then, the source and filter models were recombined and the new vowels were adjusted to have the same overall amplitude as the original vowel. Based on these spectrally manipulated vowels, duration continua (120–160 ms in steps of 10 ms) were created using PSOLA. Finally, the vowel tokens were combined with one single /s/ token (set to a constant duration of 200 ms) to form 25 manipulated target words.

These target words were presented in isolation (i.e., without any precursor) to 11 native Dutch listeners in a categorization pretest (two-alternative forced choice; none of these participants took part in any of the other experiments). Listeners indicated whether they heard the word as or aas. Based on this pretest, four vowel tokens with different F2 values but identical duration (120 ms) were selected, each sampling a different point from the categorization curve: Token 1, F2 = 1400 Hz, 14 % /a:/-categorization; Token 2, F2 = 1450 Hz, 25 % /a:/-categorization; Token 3, F2 = 1500 Hz, 45 % /a:/-categorization; and Token 4, F2 = 1550 Hz, 77 % /a:/-categorization. Target words with only these four vowel tokens were used in the following experiment. Finally, the target words were combined with the four different precursors. Each stimulus was presented 10 times per session (total number of trials: 160).

Procedure

Stimulus presentation was controlled by Presentation software (Version 16.5; Neurobehavioral Systems, Albany, CA, USA). Stimuli were presented to half of the participants in a fixed random order, with the reversed order presented to the other half, and participants were allowed to take a short break halfway through the experiment.

Each trial started with a fixation cross appearing in the middle of the screen. After 330 ms, the auditory stimulus was presented. At target offset, the fixation cross was replaced by the two response options as and aas on the left and right side of the screen (position counterbalanced across participants), and participants were instructed to indicate by button press which target word they had heard (“1” for the left word and “0” for the right word). If participants did not respond within 4 seconds, a missing response was recorded, and the next trial was presented.

Results

Categorization data, calculated as the percentage of /a:/ responses (% /a:/), of Experiment 1 are represented in Fig. 2.

A generalized linear mixed model with a logistic linking function as implemented in the lme4 library in R (Bates, Maechler, & Bolker, 2015) tested the binomial responses (/a:/ = 1; /ɑ/ = 0) for fixed effects of Vowel F2 (continuous predictor, scaled around the mean), Tone Duration (categorical predictor, with the short duration of 71 ms mapped onto the intercept), Presentation Rate (categorical predictor, with the slow rate of 4 Hz mapped onto the intercept), and all their interactions, with random effects of Participants. By-participant random slopes for all fixed effects and all interactions were included in the model.

This model revealed significant effects of vowel F2 (the higher the vowel’s F2, the higher the percentage of /a:/ responses; β = 1.013, z = 4.067, p < .001) and of presentation rate (a higher percentage of /a:/ responses for trials with precursors with a presentation rate of 7 Hz; β = 0.490, z = 2.440, p = .015). No effect of tone duration could be established (p > 0.2), nor was any interaction between any of the predictors observed.

Discussion

In Experiment 1, participants listened to target words ambiguous between /ɑs/ and /a:s/, preceded by tone sequences with either short or long tones, presented at either a fast or a slow presentation rate. The use of a full factorial design, crossing the factors Tone Duration and Presentation Rate, allowed for distinguishing whether the precursors’ tone duration and/or the precursors’ presentation rate induces a shift in the phonetic category boundary between /ɑ/ and /a:/.

The results of Experiment 1 are inconsistent with a durational contrast account of rate-dependent perception because no effects of varying tone durations were found. Instead, faster presentation rates biased listeners’ categorization responses towards /a:/, corroborating claims from proposals about neural entrainment.

Nevertheless, even though no significant effect of tone duration could be found, there would seem to be an apparent difference between precursors with short and long tones in Fig. 2. Note, however, that this dissimilarity is in the opposite direction from what the durational contrast account would hypothesize. Moreover, the statistically significant difference between precursors with a fast versus a slow presentation rate would seem to be rather variable across the vowel continuum.

Motivated by this apparent variability in the data of Experiment 1 and the present drive for replicability in psychological science (Open Science Collaboration, 2015), a second experiment was designed. Experiment 2 adopts the procedure of Experiment 1, with larger sample sizes to increase statistical power, thus increasing the generalizability of results and aiding the interpretation of potential variability in the data.