Encoding speech rate in challenging listening conditions: White noise and reverberation

Reinisch, Eva; Bosker, Hans Rutger

doi:10.3758/s13414-022-02554-8

Encoding speech rate in challenging listening conditions: White noise and reverberation

Open access
Published: 22 August 2022

Volume 84, pages 2303–2318, (2022)
Cite this article

Download PDF

You have full access to this open access article

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Encoding speech rate in challenging listening conditions: White noise and reverberation

Download PDF

1907 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

Temporal contrasts in speech are perceived relative to the speech rate of the surrounding context. That is, following a fast context sentence, listeners interpret a given target sound as longer than following a slow context, and vice versa. This rate effect, often referred to as “rate-dependent speech perception,” has been suggested to be the result of a robust, low-level perceptual process, typically examined in quiet laboratory settings. However, speech perception often occurs in more challenging listening conditions. Therefore, we asked whether rate-dependent perception would be (partially) compromised by signal degradation relative to a clear listening condition. Specifically, we tested effects of white noise and reverberation, with the latter specifically distorting temporal information. We hypothesized that signal degradation would reduce the precision of encoding the speech rate in the context and thereby reduce the rate effect relative to a clear context. This prediction was borne out for both types of degradation in Experiment 1, where the context sentences but not the subsequent target words were degraded. However, in Experiment 2, which compared rate effects when contexts and targets were coherent in terms of signal quality, no reduction of the rate effect was found. This suggests that, when confronted with coherently degraded signals, listeners adapt to challenging listening situations, eliminating the difference between rate-dependent perception in clear and degraded conditions. Overall, the present study contributes towards understanding the consequences of different types of listening environments on the functioning of low-level perceptual processes that listeners use during speech perception.

Perceptual Compensation When Isolated Test Words Are Heard in Room Reverberation

Rate dependent speech processing can be speech specific: Evidence from the perceptual disappearance of words under changes in context speech rate

Article 22 September 2015

The Intelligibility of Time-Compressed Speech Is Correlated with the Ability to Listen in Modulated Noise

Article 07 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Speech is a highly variable acoustic signal that listeners have to map onto their language system (e.g., words) in order to understand what is being said. Moreover, the listening environment is hardly ever quiet, but rather the to-be-decoded speech signal may be heard in background noise or distorted by room acoustics. Nevertheless, intuitively, speech perception does not seem like a major challenge to most listeners. This is because the human brain has a number of processes at its disposal that help listeners deal with the variability in the signal. The present study focuses on one of these processes— namely, rate-dependent speech perception—whereby listeners use earlier temporal information—that is, speech rate—in a preceding context sentence to recognize upcoming words. Specifically, we test how robustly the temporal information in a preceding context sentence is encoded in challenging listening conditions, such as background noise and reverberation.

Rate-dependent perception is typically demonstrated in languages that use duration as a cue to segmental contrasts, such as vowel length distinctions. German, for instance, distinguishes minimal word pairs differing in the vowel contrast /a/–/a:/ where words like bannen, “to banish,” contain a short /a/, and words like bahnen, “to channel,” contain a long /a:/ (without any major spectral differences; e.g., Reinisch, 2016a, 2016b). Critically, the perception of this vowel length contrast has been shown to depend on the speech rate of the preceding context. Listeners are more likely to interpret a vowel midway between /a/ and /a:/ as the long vowel /a:/ if it follows a context spoken at a fast rate, but as short /a/ if it follows a slow context (Reinisch, 2016a, 2016b). In other words, in rate-dependent perception a given duration is interpreted as contrasting with the preceding context. The effect of speech rate then is the difference in likelihood that a given sound (here: vowel) is perceived as long when following a fast versus a slow context. Rate-dependent perception has been shown in many different languages, affecting a wide range of temporal contrasts including vowel length distinctions in other languages (Gabay et al., 2019; Reinisch & Sjerps, 2013), voice onset time (VOT) of stop consonants (Kidd, 1989; Newman & Sawusch, 2009; Toscano & McMurray, 2015), formant transition duration (Wade & Holt, 2005), singleton versus geminates (Mitterer, 2018), and even the presence or absence of syllables or words (“lexical rate effect”—as compared with rate-dependent perception of phoneme contrasts; Bosker et al., 2020a; Brown et al., 2012; Dilley & Pitt, 2010; Kaufeld et al., 2020). Note that we name but a few recent examples of studies on rate-dependent perception and refer readers to Stilp (2020) for a comprehensive review.

Rate-dependent perception has been divided into effects of proximal, distal, and global context where proximal refers to the immediate context within approximately 250–300 ms around the target, distal refers to sentence-length context, and global to the experimental setting or general knowledge about a speaker (see, e.g., Maslowski et al., 2020, for definitions and discussion). The present study is concerned with sentence-length context, however, without distinguishing between proximal (immediately adjacent) and distal (longer, further removed) parts of the context sentences (for separate manipulations, see, e.g., Newman & Sawusch, 1996; Reinisch et al., 2011; Sawusch & Newman, 2000; Summerfield, 1981). Moreover, the present study is concerned with the rate-dependent perception of a durationally cued phoneme contrast (i.e., /a/–/a:/ in German; Reinisch, 2016a, 2016b) which some have argued to be qualitatively distinct from speech rate effects on lexical perception (i.e., dis/appearing function words in the lexical rate effect; Baese-Berk et al., 2019; Pitt et al., 2016).

Importantly, many experiments investigating rate-dependent perception used listening conditions that do not reflect what listeners typically experience in “real” life. Laboratory experiments tend to present an ideal (i.e., quiet) listening environment, to serve as a starting point to understand the workings of a given perceptual process. Still, an increasing body of literature is concerned with the need to understand speech perception in everyday communication involving possible listening adversities (for an overview, see, e.g., Mattys et al., 2012). Critically, it has been shown that speech perception does not always operate similarly in quiet compared with when listeners are confronted with challenging listening situations. Listeners flexibly adapt to different listening conditions and reweigh their reliance on different types of information accordingly (e.g., up- or down-weighing the use of acoustic, phonotactic, and lexical information; Derawi et al., 2022; Mattys, 2004; Mattys et al., 2009; Reese & Reinisch, 2022; Strauss et al., 2022; or the extent of considering alternative lexical candidates; Brouwer & Bradlow, 2016; McQueen & Huettig, 2012). Therefore, in order to explain the workings of speech perception in general and specific processes such as rate-dependent perception in particular, an assessment of its operation under different listening conditions is critical.

As for quiet listening conditions, the literature has shown that rate-dependent perception of phoneme contrasts is a low-level process that operates during early stages of speech perception. This is supported by findings that also non-speech contexts, such as pure tones or sine wave speech can trigger the effect (Bosker, 2017; Diehl & Walsh, 1989; Gordon, 1988; Wade & Holt, 2005), that the effect occurs very rapidly (Maslowski et al., 2020; Reinisch & Sjerps, 2013; Toscano & McMurray, 2015) and appears to operate prior to other early perceptual processes, such as stream segregation (Newman & Sawusch, 2009). In fact, speech rate information from competing speakers (e.g., in a cocktail party setting) cannot be ignored (Bosker et al., 2020a). This early use of rate information and its relative independence of the context being (intelligible) speech are some of the factors that have been claimed to differentiate rate-dependent perception of phoneme contrasts from the lexical rate effect (Bosker, 2017; Pitt et al., 2016). That is, the lexical rate effect tends to occur considerably later during processing (Brown et al., 2012; Brown et al., 2021; Maslowski et al., 2020) and critically depends on the context’s intelligibility (Pitt et al., 2016).

As for challenging listening conditions, rate-dependent perception has already been shown to be robust when listening to a speaker with a foreign accent (Bosker & Reinisch, 2015), when listening in a second language (Bosker & Reinisch, 2017), and even when simultaneously performing a secondary task (Bosker et al., 2017). That is, under all these conditions listeners continue to use the speech rate of a context sentence to interpret upcoming temporal cues to speech sounds, and importantly the speech rate effect is not reduced relative to the respective control conditions (i.e., native speech; low cognitive demands). However, how rate-dependent perception operates under conditions of energetic masking of the signal, for instance in noise, or other types of distortion, such as reverberant environments, remains unknown.

One repeated finding of studies on rate-dependent perception in adverse conditions was that when processing resources were taxed, either by listening in a second language (Bosker & Reinisch, 2017), or when performing a concurrent visual search task while listening to the context (Bosker et al., 2017), listeners responded to the target sounds as if the context was faster than without cognitive load. Since listeners typically also give higher speech rate estimates in explicit judgment tasks under cognitive load (Bosker & Reinisch, 2017), this was interpreted with regard to the mechanism how speech rate of the context is calculated. Specifically, two previous accounts of perceptual encoding in adverse listening conditions were tested. What we termed “noisy encoding” (Mattys & Wiget, 2011) due to general reduction in the robustness of processing of the speech signal, and “shrinking of time” (Casini et al., 2009; Chiu et al., 2019) due to impaired temporal sampling of the sensory input. Results suggested that with reduced cognitive resources being available for speech perception, listeners appear to miss temporal pulses, and thus underestimate durations. In other words, cognitive load makes speech sound fast (see Bosker et al., 2017, for a discussion). This finding together with the lack of reduction of the rate effect was interpreted as evidence for the “shrinking of time” account.

Mechanistically, the temporal sampling that underlies rate-dependent perception may involve entrainment of neural oscillations. The listening brain has been shown to “track” the syllabic rate of speech by phase-locking endogenous theta oscillations (i.e., 3–9 Hz) to the amplitude envelope of speech (Doelling et al., 2014; Peelle & Davis, 2012). These rate-dependent neural oscillations have been suggested to support speech intelligibility when “in sync” with the speech amplitude fluctuations (van Bree et al., 2021), in line with earlier demonstrations of the critical contribution of slow amplitude modulations in speech to intelligibility (Drullman et al., 2014a, 2014b; Fogerty & Humes, 2012). Specifically, ongoing oscillations are proposed to build temporal predictions about upcoming sensory input. In fact, experiments using magnetoencephalography (MEG) and transcranial alternating current stimulation (tACS) point towards a causal role of speech-tracking oscillations in the theta range in rate-dependent perception: participants who show greater evidence for neural entrainment to a context speech rate in MEG also demonstrate larger rate effects in behavior (Kösem et al., 2018). There are even indications that tACS can serve as external “pacemaker,” guiding the phase and frequency of endogenous oscillations, in turn influencing behavioral speech perception (Kösem et al., 2020; Riecke et al., 2018; Zoefel et al., 2018). In line with these neurobiological findings, behavioral rate-dependent effects are observed only for speech rates in the 3–9-Hz range—that is, when the speech rate can be encoded by ongoing theta oscillations (Bosker & Ghitza, 2018). Further behavioral support comes from the observation that special populations known to demonstrate neural entrainment impairments such as individuals with developmental dyslexia (Goswami, 2011; Goswami et al., 2002) also show a reduced rate effect relative to typically developed listeners (Gabay et al., 2019).

The neural tracking of speech is clearly susceptible to influences from the listening conditions: it is strongly reduced when listening in noise, in competing speech, and in real world acoustic scenes—relative to in quiet (Fuglsang et al., 2017; Rimmele et al., 2015; Zion Golumbic et al., 2012). This reduction in neural tracking reflects the behavioral listening challenges posed by, for instance, background noise and reverberant room acoustics (e.g., Fogerty et al., 2020; Helfer, 1994; Nábělek, 1988). Nevertheless, except for the most extreme circumstances, human speech comprehension typically does not break down entirely in challenging listening conditions. For instance, in noisy or multitalker situations, theta oscillations are often still successful at tracking the dynamics of attended speech (Ding & Simon, 2012; Mesgarani & Chang, 2012; Zion Golumbic et al., 2012). Similarly, the human brain is capable of compensating for reverberation, with speech envelopes reconstructed from EEG responses to reverberant speech resembling the original “clean” speech more than the reverberant stimulus (Fuglsang et al., 2017). This raises the question how robust the neural oscillatory mechanism that underlies rate-dependent perception is against noise and reverberation: Is rate-dependent perception modulated by challenging listening conditions?

Therefore, the present study investigates how listeners encode the temporal information of speech when the signal is degraded by noise or reverberation. This question is tested using a rate-dependent perception paradigm with a phoneme contrast as target: German listeners were presented with three sentences played at either a fast or slow speech rate, followed by target words sampled from an /a/–/a:/ vowel duration continuum (e.g., bannen vs. bahnen). We predicted that a fast context sentence should increase the probability of participants reporting bahnen with long /a:/, while the same target word should be more likely to be perceived as bannen with short /a/ if embedded in a slow speech rate (a typical rate effect). Critically, we applied two types of non-linguistic signal degradation: white noise mixed with the speech signal at 0 SNR and reverberation simulating a “big room” (see Methods for details). Note that we used relatively “moderate” degrees of signal degradation, challenging listening while maintaining intelligibility, as corroborated by ceiling performance on a separate intelligibility test (for details see the documents on OSF [https://osf.io/4fgkz/]). Consequently, we could in principle predict that the rate effect will not be affected by our two types of “moderate” signal degradation. This prediction would be supported by earlier claims that rate-dependent perception “is driven by a timing mechanism that requires hearing input as intelligible speech” (Pitt et al., 2016, p. 343). Note however that this claim contradicts evidence for rate effects induced by nonspeech, such as fast versus slow tones (Bosker, 2017). Still, we could speculate that as long as the signal degradation does not impact intelligibility, the rate effect should remain stable.

Alternatively, the signal degradations could have similar effects on rate-dependent perception as increased cognitive load. According to the “shrinkage of time” account, the same speech is perceived as faster under high versus low cognitive load, which may apply likewise to forms of perceptual load, such as signal degradation. This would predict an overall increase in long /a:/ responses in conditions of signal degradation compared with in quiet. This prediction is supported by the claim that “energetic masking not only critically impairs lexical access, it also decreases the size of the time window over which information is integrated” (Mattys et al., 2009, p. 233), hence speeding up the perceived tempo. Note that this prediction applies to stimuli in which only the context is degraded but not the target word (as in Bosker et al., 2017). In contrast, if signal degradation would be applied to the entire stimulus (context and target), the speeding up of the perceived tempo would presumably apply to both contexts and targets, removing the perceptual tempo difference between context and target.

Finally, the signal degradation could also induce “noisy,” less precise temporal encoding of the speech rate, triggering a reduction of the rate effect in degraded speech versus quiet. Note that the two types of signal degradation—noise and reverberation—were chosen to compare their specific characteristics with regard to the way they distort the signal. White noise with its uniform spectrum and a lack of amplitude modulation was taken as a baseline for overall energetic masking of the signal. Since humans do not perceive all frequencies equally, white noise applies masking of all frequencies while not interfering with these natural perceptual nonlinearities. Its masking of the spectral information should reduce the overall clarity of the speech signal. Poorer access to spectral information might consequently disrupt the encoding of temporal information needed to calculate speech rate and in this way lead to reduced rate effects on the categorization of a target word. Reverberation, in contrast, involves reflections of sound from the room’s walls and surfaces that mix with the direct sound source, specifically inducing changes in the signal’s temporal envelope (Houtgast & Steeneken, 1973). This could more directly impair the encoding of the temporal dimension of speech, possibly in the form of reduced entrainment of neural oscillations, and hence reduce the rate effect.

However, listeners have also been shown to rapidly adapt to signal degradation, learning to overcome the listening challenge offered by persistent noise or reverberation after some exposure. This is evidenced, for instance, by intelligibility improvements over the course of speech-in-noise exposure, asymptoting after as few as 15 sentences; Cainer et al., 2008). This is also in line with neurobiological evidence that not only nonprimary but also primary auditory cortex show invariance to stable background noise (Kell & McDermott, 2019; Mesgarani et al., 2014). Human neural responses to abrupt changes in background noise show rapid and selective suppression of the acoustic characteristics of the speech-masking noise in as little as 1 second after noise onset (Khalighinejad et al., 2019). Considering this rapid adaptation to background noise, perhaps listeners are capable of quickly compensating for the masking noise in the present rate-dependent perception experiments, much like how humans learn to adjust their rate perception to atypical noise-vocoded input (Jaekel et al., 2017; Shannon et al., 1995), hence predicting similar rate effects in noise compared with in quiet.

Similarly, listeners can also adapt to reverberant environments (Beeston et al., 2014; Srinivasan & Zahorik, 2013; Stilp et al., 2016; Watkins, 2005; Watkins et al., 2011; Watkins & Makin, 2007). For instance, Watkins (2005) tested the perception of an English sir–stir continuum, which is mainly cued by the closure duration of the /t/ in stir (i.e., longer closure suggests the presence of a /t/). He showed that adding reverberation to the target word continuum shifts the categorization boundary towards more sir responses. This suggests that listeners perceptually incorporate the reverberation with the sound such that the added “tail” from reverberation is fused with the actual sound obscuring the (closure of) /t/. Critically, this effect was reduced if the target word was embedded in a reverberant sentence context suggesting that information from the context could be used to compensate for the masking effect on the target. This compensation for context even held across “changes in room” (i.e., specific characteristics of the reverberation; Watkins, 2005) and has been shown to depend on the temporal envelope rather than temporal fine structure of the context (Watkins et al., 2011). For the present question about rate-dependent perception when confronted with distorted speech, this ability to compensate for the consequences of reverberation and specifically its connection to the temporal envelope might interact with the predicted reduction of the rate effect due to distortion of the context. Therefore, how reverberation and background noise affect rate-dependent perception remains an intriguing question that lies at the intersection of listener normalization for prosodic variability (here: speech rate) and listener adaptation to challenging listening conditions.

In sum, in the present study we investigated the effect of rate-dependent perception in degraded listening conditions, specifically under two types of signal degradation, white noise and reverberation, compared with a “clear” condition forming the baseline without signal degradation. In order to compare the impact of different types of contexts on the same target stimuli, in Experiment 1 only the context sentences but not the target words were subjected to signal degradation. This design matches previous studies on rate-dependent perception where responses to identical targets were compared across conditions (i.e., most studies on the phonemic rate effect discussed above, e.g., Bosker et al., 2017; Reinisch, 2016a, 2016b). Since, however, in the present study such an abrupt change from noisy or reverberant context to a clear target may seem unnatural, Experiment 2 compared rate effects across conditions where contexts and targets were coherent in terms of signal quality. This allows for detecting potential effects of adaptation to degraded listening conditions on rate-dependent perception.

Experiment 1

Method

Participants

Participants were recruited via the web-platform^{Footnote 1} Prolific (www.prolific.co) [in February 2021] and were paid for their participation. In order to be eligible for the study, they were required to be a native speaker of German living in Germany, be between 18 and 50 years of age, use a desktop computer rather than their cellphone or a tablet, and wear headphones. Based on the number of participants in comparable previous studies (e.g., Bosker et al., 2017), 50 participants were recruited (27 female, 23 male), though data from one participant were excluded from analyses since this person reported in a postexperiment questionnaire to have stopped doing the task properly at some point during the experiment. Participants’ mean age was 28.5 years (SD = 7.5). They all confirmed to meet the criteria of being native speakers of German, and to have no history of hearing impairment or dyslexia. Nineteen reported to use over-ear headphones, nine on-ear, and 22 in-ear headphones. All participants gave informed consent to participate. The study was carried out in accordance with the research guidelines of the funding organization (German Research Council) and the requirements for good practice of the online platform (www.prolific.co) that was used for recruitment.

Materials

Stimuli were taken from a previous study (Reinisch, 2016b). Three German minimal word pairs differing minimally in the /a/–/a:/ vowel duration contrasts were selected as targets (bannen–bahnen, “banish”–“to channel”; rammen–Rahmen, “drive by impact”–“frame”; Ratte–Rate, “rat”– “installment”). Each target pair had been recorded in a different carrier sentence that did not contain any tokens of the two critical vowels. Those unique context-target pairings were kept for the present experiments. However, targets and sentences were manipulated separately before being spliced back together. In addition to using three context-target pairings, materials from two speakers were used. Both speakers were young female adults and native speakers of Standard German. Both voices had already been used in the previous study (Reinisch, 2016b), where the procedure of stimulus selection, manipulation of the duration continuum and speech rate manipulation of the context, as well as pretests are reported in detail.

In short, the /a/–/a:/ vowel duration continua were created by starting with the two speakers’ average duration of the long vowel for each word pair and subsequently creating 16 shorter continuum steps by using the duration tier in PRAAT (Boersma & Weenink, 2009) and PSOLA (pitch-synchronous overlap-add) resynthesis. The short endpoints were at the average duration of the speakers’ short vowels. All other segments in the words were set to an average value between the two speakers’ segments averaged over the words with the long and short vowel. The sentences were also manipulated using PSOLA to create two different rate conditions. For the fast rate condition, the entire sentences (though without targets) were compressed on an individual basis to be 15% faster than original recordings (resulting approximately in a rate of six syllables per second); and for the slow condition, sentences were expanded to be 10% slower than original (approximately 4.6 syllables per second). Two pretests then determined which part of the vowel duration continuum in the targets was suitable to yield responses from clearly more “long vowel” responses to clearly more “short vowel” responses without including steps where listeners would perform at ceiling. Based on the pretests reported in Reinisch (2016b), five continuum steps were selected per word pair for the present study. These were also the five middle steps used in the previous study and ranged from 107 to 149 ms for bannen-bahnen and rammen–Rahmen, and from 95 to 129 ms for Ratte–Rate. Note that different values and ranges result from differences in the phonological context in the words (i.e., vowel followed by a nasal vs. stop) and how natural a given manipulation sounded. These values were identical for the two speakers. The pretests also determined that the rate manipulation of the context sentences was sufficiently strong to shift the perception of the vowel duration depending on the context rate. For the minimal word pairs and continuum steps selected for the present study, the difference in “long vowel” responses following the fast versus slow contexts was 15%.

For the present study, these baseline stimuli formed the “clear” condition. This clear condition was further manipulated to create the noise and reverberation conditions. First, the complete sentences including the targets at different vowel duration steps were manipulated. Note that this resulted in degraded context sentences including the targets. However, the goal of Experiment 1 was to test the effect of signal degradation on the context sentences only. Therefore, the manipulated targets were spliced off and replaced by the targets from the baseline condition (i.e., no manipulation). No silent interval was left between carrier sentence and target. Figure 1 shows the spectrograms of the three conditions in Experiment 1.

For the noise manipulation, an existing PRAAT script^{Footnote 2} was used and further adapted by the first author such that it mixed all speech sound files with the same predefined sound file containing white noise at an SNR of 0. White noise was chosen to physically mask all frequencies equally while leaving the natural differences of perceiving different frequencies intact. The SNR was chosen such that the noise was clearly audible and potentially interfering but the sentences were still intelligible. For the reverb manipulation the vocal toolkit plugin (Corretge, 2012–2021) in PRAAT was called via a script written by the first author. The plugin allows to add reverberation to each sound file by convolution with an impulse response file that is provided by the toolkit. The option “Room Big”^{Footnote 3} was selected at a mix of 50%. This resulted in well-audible reverberation while keeping a reasonable level of intelligibility. Finally, all stimuli were normalized for RMS amplitude. Example stimuli for all conditions can be found on the Open Science Forum (OSF; https://osf.io/4fgkz/). Note that the levels of noise and reverberation were chosen specifically to challenge listening while maintaining intelligibility. A separate intelligibility test with new German-speaking participants confirmed ceiling performance with these signal distortions (i.e., 99% correct in quiet, 98% in noise and 99% in reverberation; see OSF [https://osf.io/4fgkz/] for details). Thus, the present study serves as a starting point for exploring potential effects of signal degradation on rate-dependent perception.

Design and procedure

Stimuli were presented blocked by context condition with order of blocks roughly counterbalanced across participants. In the end, 7-9 participants completed each of the six possible orders. The slight imbalance was caused by the automatic assignment of block orders that did not account for participants who did not complete the experiment and were hence not included in the present dataset. Within each block, all stimuli were presented in fully random order (speakers, sentences/targets, rates, continuum steps) twice with the restriction that all stimuli had to be presented once before being repeated. The experiment was implemented in the Gorilla Experiment Builder (www.gorilla.sc), an online platform supporting web experiments.

Participants were instructed by means of written text that on each trial they would be presented auditorily with a sentence ending in a word that might sound ambiguous between two options. Their task was to indicate by button-press which of the two possible options they heard. For each sentence, the two possible target words (of the minimal pair) were presented visually on the screen with the letters “f” and “j” written underneath the words. These were the buttons that participants were asked to press on their computer keyboard to indicate their choice. They should press “f” if they thought they heard the word on the left, and “j” if they thought they heard the word on the right. The word with the long vowel was always presented on the right, so any potential bias was the same across conditions and experiments. On each trial, the text appeared at the same time as the audio started playing and stayed on the screen until the response was logged by button press. The next trial started automatically after 1,000 ms.

Participants were informed up front that some of the stimuli might sound “noisy” but they should ignore this noise. They received three practice trials, randomly sampled from the main experiment but identical for all participants, one in each condition, in the order clear context, noise, reverberation. After these practice trials, participants were asked to adjust the sound level of their computer to a comfortable level such that they won’t need to change it anymore during the experiment. After another three (randomly selected) practice trials, they were informed that now they were not supposed to change the volume anymore for the rest of the experiment. The experiment started by pressing space bar. Between blocks as well as once within each block, participants were allowed to take a self-paced break. The experiment consisted of a total of 360 trials and took approximately 30 minutes to complete.

Results

Statistical analyses were conducted using linear mixed-effects models as implemented in the lme4 package (Bates et al., 2015) in R (Version 4.0.3; R Core Team, 2020) using a logistic linking function (Jaeger, 2008) to account for the binomial nature of our dependent variable, which was Response, with the long vowel /a:/ coded as 1 and the short vowel /a/ coded as 0. Fixed effects were Continuum Step, Speech Rate, Condition, and all interactions. In addition, Speaker was modeled as a covariate since an exploratory model-fitting procedure using log-likelihood ratio tests suggested a significant improvement of model fit when Speaker (contrast coded to −0.5 and 0.5) was included. Note that the inclusion of the covariate does not affect the interpretation of our main factors of interest (i.e., Continuum Step, Speech Rate, Condition) since those are modeled with regard to the mean of the levels of the covariate. The additional inclusion of trial number within each block (centered and rescaled) as a covariate did not improve the model fit. The same held for the inclusion of block order (six levels) which additionally led to convergence issues. Hence, neither Trial Number nor Block order were included as covariates in the final model.

Of the fixed factors of interest, Continuum was entered as a continuous variable coded to be centered on zero (i.e., subtracting the mean), Speech Rate was contrast coded to fast rate coded as 0.5, and slow rate coded as −0.5. Condition was factor coded with the level clear context mapped onto the intercept (as it serves as a baseline), and contrasts being reported for clear versus noise, and clear versus reverberation.

The random-effects structure included a random intercept for participants. Random slopes were then added one at a time and kept in the model if they significantly improved the model fit as determined by model comparisons using log-likelihood ratio tests. We report the best fitting model that converged and did not give us a singularity error. Unless noted otherwise, random slopes for Continuum Step, Speaker, Condition, and Speech Rate were included. Note that a random intercept over items was not included, since any single factor contributing to variability in items had too few levels as to be meaningful as a random factor.

The results of the final model are listed in Table 1, and the rate effects across conditions are illustrated in Fig. 2. The factors that are mapped onto the intercept hold for the clear Condition and show effects of Speech Rate with more long-vowel responses following a fast versus slow context (typical rate effect), an effect of Continuum with more long-vowel responses for longer vowels. Critically, a number of interactions was found. Specifically, the interaction between ConditionNoise and Rate as well as the interaction between ConditionReverb and Rate demonstrated that the rate effect was smaller in the two degraded speech conditions as indicated by the negative estimate. Additionally, the effect of Continuum differed between the clear and the reverberation context such that the effect of Continuum was smaller, that is, the categorization slope was shallower, following the reverberation than the clear context.

Table 1 Results of the fixed effects of the statistical model for Experiment 1

Full size table

Additional analyses using statistically equivalent models with the factor Condition coded such that the levels noise and reverberation were mapped onto the intercept showed that despite the reduction of the rate effect relative to the clear condition found in the main model, effects of rate were found for each of these conditions—reference is noise: b_(Rate) = 0.70, SE = 0.09, z = 8.17, p < .001; reference is reverberation: b_(Rate) = 0.57, SE = 0.08, z = 6.80, p < .001. Furthermore, even though we did not set out to match and compare the two degraded context conditions directly, the additional models suggest that the magnitude of the decrease in the effect of rate did not differ between the noise and reverberation condition relative to the clear condition—reference is noise: b_{(Rate:conditionReverb)} = −0.13, SE = 0.10, z = −1.40, p = 0.162. This opens the issue for future studies to address in more detail how rate-dependent perception changes not only in different types but under different degrees of signal degradation. The dataset, code, and results for all models can be found on OSF (https://osf.io/4fgkz/).

Discussion

Experiment 1 tested the effect of signal degradation on rate-dependent perception of a German vowel duration contrast. We found that relative to a clear context (without degradation) the rate effect as shown by the difference in proportion of long-vowel responses in a time-compressed fast versus time-expanded slow context sentence was smaller when the context sentence was masked by white noise or degraded by reverberation. Note that both types of signal degradation were applied to the context sentences only; the target words were always presented without degradation (see Fig. 1). Hence, the reduced rate effect in noisy and reverberant contexts suggests that the signal degradation hindered the uptake of information relevant to the calculation of speech rate. The implications with regard to accounts of speech perception in degraded listening conditions will be discussed in the General Discussion.

With regard to the perception of the vowel duration continuum, results showed differences between the clear and the reverberation context condition. A flatter categorization curve of the continuum was found following the context with reverberation. Since reverberation tends to smear spectral information over time, it likely reduces the possibility for the extraction of precise temporal cues. This could have impacted the reliance on the actual vowel duration during target categorization, lowering perceptual precision.

However, the main goal of Experiment 1 was to assess the magnitude of the rate effect on identical targets following different types of contexts. To achieve this goal, we varied the signal degradation in the context while keeping the target words constant (i.e., always clear). As a result, the coherence of the signal between context and target differed across conditions. While one could imagine a loud noise to stop abruptly while listening to speech or one moving outside a reverberant environment, it is evident that the clear context condition was the most natural one with regard to coherence between context and target. This raises the question how the outcomes of Experiment 1 generalize to more naturalistic listening conditions, where signal degradations are typically relatively stable.

In order to address this issue of acoustic coherence between context and target, Experiment 2 was designed to test rate-dependent perception when not only the context sentences but also the targets were degraded by noise or reverberation. As has already been discussed in the introduction, listeners have been shown to compensate in perception for degraded listening conditions involving noise (Cainer et al., 2008; Kell & McDermott, 2019; Khalighinejad et al., 2019; Mesgarani et al., 2014) or reverberation (Beeston et al., 2014; Srinivasan & Zahorik, 2013; Stilp et al., 2016; Watkins, 2005; Watkins et al., 2011; Watkins & Makin, 2007). If acoustic coherence between context and target allows for compensation for the degradation then the rate effect may not be reduced relative to a clear context in Experiment 2.