Method
All stimuli, raw data, code for analysis, and software for creating the visual stimuli are available online at https://osf.io/b94yx/.
Participants
One hundred sixty-six native English speakers aged 18–23 with self-reported normal hearing and normal or corrected-to-normal vision were recruited from the Carleton College community. Participants provided written consent and received $5 for 30 minutes of participation. Carleton College’s Institutional Review Board approved all research procedures.
Stimuli
Stimuli were selected from the Speech Perception in Noise (SPIN) database (Kalikow, Stevens, & Elliott, 1977). We included both high-predictability (HP) and low-predictability (LP) sentences to assess whether any effect of the visual signal depends on predictability (see Van Engen, Phelps, Smiljanic, & Chandrasekaran, 2014 for evidence of greater visual enhancement from a face for semantically constrained sentences), and presented sentences in two-talker babble (see Helfer & Freyman, 2005 for evidence of greater visual enhancement in two-talker babble than steady state noise). A female native English speaker without a strong regional accent produced all target sentences. Stimuli were recorded at 16-bit, 44100 Hz using a Shure KSM-32 microphone with a plosive screen, and were edited and equated for RMS using Adobe Audition prior to being combined with the corresponding visual signal. The target speech was delivered binaurally at approximately 66 dB SPL and noise at 70 dB SPL (SNR = -4 dB) via Sennheiser HD 280 Pro headphones. We used a custom Javascript program to create four types of visual stimuli: audio-only, static, signal, and yoked (See Table 1 for descriptions, and Supplemental Materials for examples of each type).
Table 1. Four conditions of Experiment 1 In all conditions, the visual stimulus appeared as a small, filled-in circle. In the conditions in which the circle was modulated (signal and yoked), the diameter ranged from 50 to 200 pixels (approximately 1.1–4.5 cm), the amount of time between graphics updates (i.e., the time step) was 50 ms, and the average size of the moving lowpass filter for the acoustic signal was 151 samples. In the conditions in which the circle was unmodulated (audio-only and static), the diameter was fixed at 50 pixels. When the circle diameter was modulated, the luminance of the circle also changed linearly as a function of the acoustic signal amplitude with 100% software luminance corresponding to 100% software sound level and 39% software luminance corresponding to 0% software sound level (i.e., silence). When unmodulated, the circle remained at 39% software luminance. The luminance manipulation was included to more effectively draw the listener’s attention to salient moments in the auditory stream.
Design and Procedure
Each participant was randomly assigned to one of the four conditions. Participants sat a comfortable distance from a 21.5-inch iMac computer, and were presented with the same 140 target sentences in a pseudorandomized order (70 HP and 70 LP, intermixed) in a continuous stream of two-talker babble. Participants were asked to type the target sentence in a response box and then press enter, and were encouraged to guess when unsure. Participants were instructed to continue looking at the screen throughout the experiment because the circle may provide helpful cues about the contents of the target speech. The onset of the speech began a variable amount of time (1500 ms–3000 ms in 500 ms steps) after the end of the previous trial.
Responses were scored offline by research assistants. We analyzed recognition accuracy for both the full sentences (given that information about speech onset is likely to be most helpful for items early in the sentence) and sentence-final words (to assess whether the visual signal benefits high-predictability words more than low-predictability words; see Van Engen et al., 2014). The first three sentences of the pseudorandomized list were counted as practice, and were therefore not included in the analyses. At the end of the study, participants were asked “On a scale from 1 to 7, how difficult did you find this task?” and “What percentage of the sentences do you think you identified accurately?” These measures were included to assess whether participants’ subjective experience of difficulty was affected by the circle.
Results and Discussion
Responses were corrected for obvious typographical and spelling errors, and homophones were counted as correct. Responses that contained both words of a contraction (e.g., “I have”) were scored as correct for the single contracted word. Articles (“the,” “a,” “an”) were excluded from analysis, and compound words (e.g., “bullet-proof,” “household,” “policeman”) were coded as two separate words. One participant was excluded from all analyses due to low accuracy (worse than three SDs below the mean), so the final analysis included 165 participants.
Data were analyzed using linear mixed-effects models via the lme4 package in R (version 3.3.3; Bates et al., 2014), and we used the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017) to obtain p-values for model parameter estimates. To determine whether condition affected accuracy, we first built two nested models predicting recognition accuracy—one that included only type (HP or LP) as a fixed effect, and one that included both type and condition (audio-only, static, signal, yoked) as fixed effects. For all models, participants and items were entered as random effects, and the maximal random effects structure justified by the design was used (Barr, Levy, Scheepers, & Tily, 2013; see Supplementary Materials for a description of the random effects structure we employed for each set of analyses). Given that the data were binomially distributed (1 = correct; 0 = incorrect), we used generalized linear mixed effects models with a logit link function for this set of analyses. A likelihood ratio test indicated that a model with type as the only fixed effect was preferred to a model with both type and condition as fixed effects for the analysis of all words (Χ23 = 3.06; p = 0.38) as well as the analysis of final words only (Χ23 = 1.49; p = 0.68); that is, we found that the circle did not affect recognition in either analysis (Figure 1). We performed two additional model comparisons for the sentence-final word data to assess the influence of type (HP versus LP), as well as the interaction between condition and type. We did not conduct these analyses for the full sentence data, as only the final word was predictable from context. A likelihood ratio test indicated that a model with both condition and type was preferred to a model with only condition (Χ21 = 31.54; p < 0.001), suggesting that the effect of type was significant. Examination of the summary output for the full model indicated that HP words were recognized more accurately than LP words (β = -1.11, SE = 0.19, z = -5.93, p < 0.001). Finally, we found that a model without the condition-by-type interaction was preferred to a model that included the interaction, (Χ23 = 3.57; p = 0.31), indicating that the effect of condition was similar for HP and LP words.
Five participants failed to complete the subjective effort portion of the task, so N = 160 for the effort analysis. Subjective data were analyzed by comparing ordinary linear regression models, since each participant only responded once. Models predicting participants’ subjective ratings of difficulty did not provide a better fit for the data than ones that did not include it, either for judgments of numbers of words correctly identified (F3,159 = 0.61, p = 0.61), or for difficulty (F3,159 = 0.08, p = 0.97), indicating that subjective measures of difficulty did not differ across participant groups (see Table S1 for group means).
The finding that the abstract visual stimulus used in this study did not facilitate speech recognition is consistent with the results of Schwartz et al. (2004) and Summerfield (1979), and may suggest that some level of phonetic detail is necessary for visual enhancement. However, it is possible that temporal features of the abstract visual stimulus enhanced low-level attentional processes, thereby reducing “listening effort” (LE)—the cognitive resources necessary to comprehend speech (Downs, 1982; see also Pichora-Fuller et al., 2016). If participants were already attending to the speech task to the best of their abilities, then these attentional benefits would not lead to improved recognition, but may instead make the recognition task less cognitively demanding.
Research on LE is based on the assumption that an individual’s pool of cognitive and attentional resources is finite (Kahneman, 1973; Rabbitt, 1968), so as a listening task becomes more difficult, fewer resources remain available to complete other tasks simultaneously. Critically, LE levels cannot necessarily be inferred from recognition scores—some interventions, such as noise-reduction algorithms in hearing aids, may reduce LE without affecting speech recognition (Sarampalis, Kalluri, Edwards, & Hafter, 2009). Thus, it may be that an abstract visual stimulus like a modulating circle reduces LE without improving recognition accuracy. Experiment 2 examined this possibility using a dual-task paradigm, a commonly used method of quantifying LE (see Gagné, Besser, & Lemke, 2017).