Method
All stimuli, raw data, code for analysis, and software for creating the visual stimuli are available online at https://osf.io/b94yx/.
Participants
One hundred sixty-six native English speakers, ages 18–23 years, with self-reported normal hearing and normal or corrected-to-normal vision were recruited from the Carleton College community. Participants provided written consent and received $5 for 30 minutes of participation. Carleton College’s Institutional Review Board approved all research procedures.
Stimuli
Stimuli were selected from the Speech Perception in Noise (SPIN) database (Kalikow, Stevens, & Elliott, 1977). We included both high-predictability (HP) and low-predictability (LP) sentences to assess whether any effect of the visual signal depends on predictability (see Van Engen, Phelps, Smiljanic, & Chandrasekaran, 2014 for evidence of greater visual enhancement from a face for semantically constrained sentences), and presented sentences in two-talker babble (see Helfer & Freyman, 2005 for evidence of greater visual enhancement in two-talker babble than steady-state noise). A female native English speaker without a strong regional accent produced all target sentences. Stimuli were recorded at 16-bit, 44100 Hz using a Shure KSM-32 microphone with a plosive screen, and were edited and equated for RMS amplitude using Adobe Audition prior to being combined with the corresponding visual signal. The target speech was delivered binaurally at approximately 66 dB SPL and noise at 70 dB SPL (SNR = −4 dB) via Sennheiser HD 280 Pro headphones. We used a custom JavaScript program to create four types of visual stimuli: audio-only, static, signal, and yoked (see Table 1 for descriptions, and Supplementary Materials for examples of each type).
Table 1 Four conditions of Experiment 1 In all conditions, the visual stimulus appeared as a small, filled-in circle. In the conditions in which the circle was modulated (signal and yoked), the diameter ranged from 50 to 200 pixels (approximately 1.1–4.5 cm), the amount of time between graphics updates (i.e., the time step) was 50 ms, and the average size of the moving low-pass filter for the acoustic signal was 151 samples. In the conditions in which the circle was unmodulated (audio-only and static), the diameter was fixed at 50 pixels. When the circle diameter was modulated, the luminance of the circle also changed linearly as a function of the amplitude of the acoustic signal, with 100% software luminance corresponding to 100% software sound level and 39% software luminance corresponding to 0% software sound level (i.e., silence). When unmodulated, the circle remained at 39% software luminance. The luminance manipulation was included to more effectively draw the listener’s attention to salient moments in the auditory stream.
Design and procedure
Each participant was randomly assigned to one of the four conditions. Participants sat a comfortable distance from a 21.5-inch iMac computer, and were presented with the same 140 target sentences in a pseudorandomized order (70 HP and 70 LP, intermixed) in a continuous stream of two-talker babble. Participants were asked to type the target sentence in a response box and then press enter, and were encouraged to guess when unsure. Participants were instructed to continue looking at the screen throughout the experiment because the circle may provide helpful cues about the contents of the target speech. The onset of the speech began a variable amount of time (1,500 ms–3,000 ms in 500 ms steps) after the end of the previous trial.
Responses were scored off-line by research assistants. We analyzed recognition accuracy for both the full sentences (given that information about speech onset is likely to be most helpful for items early in the sentence) and sentence-final words (to assess whether the visual signal benefits HP words more than LP words; see Van Engen et al., 2014). The first three sentences of the pseudorandomized list were counted as practice, and were therefore not included in the analyses. At the end of the study, participants were asked to subjectively rate task difficulty and perceived accuracy at completing the task. Because this was not of primary importance to the experiment, these results can be found in the Supplementary Materials.
Results and discussion
Responses were corrected for obvious typographical and spelling errors, and homophones were counted as correct. Responses that contained both words of a contraction (e.g., “I have”) were scored as correct for the single contracted word. Articles (“the,” “a,” “an”) were excluded from analysis, and compound words (e.g., “bullet-proof,” “household,” “policeman”) were coded as two separate words. One participant was excluded from all analyses due to low accuracy (less than three SDs below the mean), so the final analysis included 165 participants.
Data were analyzed using linear mixed-effects models via the lme4 package in R (Version 3.3.3; Bates et al., 2014). To determine whether condition affected accuracy, we first built two nested models predicting recognition accuracy—one that included only type (HP or LP) as a fixed effect, and one that included both type and condition (audio-only, static, signal, yoked) as fixed effects. For all models, participants and items were entered as random effects, and the maximal random effects structure justified by the design was used (Barr, Levy, Scheepers, & Tily, 2013; see Supplementary Materials for a description of the random effects structure we employed for each set of analyses). A likelihood ratio test provided by the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017) indicated that a model with type as the only fixed effect was preferred to a model with both type and condition as fixed effects for the analysis of all words (χ23 = 3.06; p = .38) as well as the analysis of final words only (χ23 = 1.49; p = .68); that is, we found that the circle did not affect recognition in either analysis (see Fig. 1). We performed two additional model comparisons for the sentence-final word data to assess the influence of type (HP vs. LP), as well as the interaction between condition and type. We did not conduct these analyses for the full sentence data, as only the final word was predictable from context. A likelihood ratio test indicated that a model with both condition and type was preferred to a model with only condition (χ21 = 31.54; p < .001), suggesting that the effect of type was significant. Examination of the summary output for the full model indicated that HP words were recognized more accurately than LP words (β = −1.11, SE = .19, z = −5.93, p < .001). Finally, we found that a model without the Condition × Type interaction was preferred to a model that included the interaction (χ23 = 3.57; p = .31), indicating that the effect of condition was similar for HP and LP words.
The finding that the abstract visual stimulus used in this study did not facilitate speech recognition is consistent with the results of Schwartz et al. (2004) and Summerfield (1979), and may suggest that some level of phonetic detail is necessary for visual enhancement. However, it is possible that temporal features of the abstract visual stimulus enhanced low-level attentional processes, thereby reducing “listening effort” (LE)—the cognitive resources necessary to comprehend speech (Downs, 1982; see also Pichora-Fuller et al., 2016). If participants were already attending to the speech task to the best of their abilities, then these attentional benefits would not lead to improved recognition, but may instead make the recognition task less cognitively demanding.
Research on LE is based on the assumption that an individual’s pool of cognitive and attentional resources is finite (Kahneman, 1973; Rabbitt, 1968), so as a listening task becomes more difficult, fewer resources remain available to complete other tasks simultaneously. Critically, LE levels cannot necessarily be inferred from recognition scores—some interventions, such as noise-reduction algorithms in hearing aids, may reduce LE without affecting speech recognition (Sarampalis, Kalluri, Edwards, & Hafter, 2009). Thus, it may be that an abstract visual stimulus like a modulating circle reduces LE without improving recognition accuracy. Experiment 2 examined this possibility using a dual-task paradigm, a commonly used method of quantifying LE (see Gagné, Besser, & Lemke, 2017).