Effect of training
Our first research goal was to test the effectiveness of the current implicit training paradigm in improving listeners’ performance in the phonetic categorization task. We hypothesized that trained listeners would outperform untrained participants at posttesting. Figure 2 presents pretest and posttest performance for the different conditions (left-hand vs. central vs. right-hand column) as a function of testing room, separately for the trained and untrained speech tokens from the trained speaker (top and middle row) and the speech tokens from untrained speaker (bottom row).
The results in this figure partially support our prediction. Specifically, focusing on performance for trained and untrained speech stimuli from the trained speaker, the results show that learning differed across conditions. Consistent with our hypothesis, participants in the three-room condition (3R) improved their performance at posttesting for both trained and untrained speech tokens, whereas no evidence of learning was observed for the control condition. On the other hand, contrary to our predictions, negligible improvements were observed for participants trained in the one-room (1R) condition, suggesting that the current implicit training paradigm in a fixed environment was ineffective. Finally, for the untrained speaker (bottom row), there was no evidence of learning in any of the conditions, consistent with our previous work and the literature (e.g., Lively et al., 1993; Pruitt et al., 2006; Vlahou et al., 2012).
These results were subjected to statistical testing. We first focused on performance on trained and untrained speech stimuli from the trained speaker, where the strongest learning effects were expected (see Vlahou et al., 2012). A mixed ANOVA with condition (1R, 3R, control) as a between-participants factor and with time (pretest, posttest), room (bathroom, cafeteria, office, classroom, anechoic), and token type (trained tokens, untrained tokens) as within-participants factors found significant main effects of token type, F(1, 38) = 5.45, p = .025, ηp2 = 0.13; room, F(4, 152) = 3.87, p = .008, ηp2 = 0.09; and time, F(1, 38) = 5.40, p = .0255, ηp2 = 0.12; and Token Type × Time × Room, F(4, 152) = 2.76, p = .035, ηp2 = 0.07. Importantly, there was a significant Time × Condition interaction, F(2, 38) = 3.93, p = .028, ηp2 = 0.17, suggesting that there was learning for some conditions, but not for others. Furthermore, this learning was not limited to the particular speech tokens presented during training, as no interaction involving time, condition, and token type was significant (with the exception of the Token Type × Time × Room interaction, all interactions involving token type and time were nonsignificant; Token Type × Time: F(1, 38) < 1, ns; Condition × Token Type × Time: F(2, 38) < 1, ns; Condition × Token Type × Room × Time: F(8, 152) < 1, ns. Although the Token Type × Time × Room interaction might be of interest in other contexts, it was not further analyzed here, as the main focus of this study is on the type of training (condition). Instead, in the following analysis, the trained and untrained speech tokens from the trained speaker were pooled together.
To further investigate the significant Time × Condition interaction, separate ANOVAs were performed focusing on how the amount of learning varied across the different types of training (condition). We first examined performance in the control condition. Repeated measures (RM) ANOVA with factors of time, room, and speaker found no significant main effects or interactions, F(4, 48) = 1.82, p = .139 for room; F(1, 12) = 1.64, p = .224 for speaker (all other Fs < 1). The lack of main effects or interactions involving time confirmed that repeated testing without training did not yield improvements in performance.
Our second main research question was to investigate the differences between the two implicit training conditions. Figure 3 presents pretest and posttest performance for the one-room and three-room implicit conditions. A mixed ANOVA, with condition (1R, 3R) as a between-participants factor and time, room, and speaker as within-participants factors, showed that performance varied across rooms (main effect of room), F(4, 104) = 8.02, p = .0002, ηp2 = 0.24. More pertinent to our research questions, there was a significant improvement from pretest to posttest (main effect of time), F(1, 26) = 6.28, p = .019, ηp2 = 0.19, interacting with condition and speaker, F(1, 26) = 4.43, p = .045, ηp2 = 0.15, but not with room. The significant Time × Condition × Speaker interaction shows that the improvement was approximately evenly distributed across rooms, but differed depending on the number of rooms used in training and on whether the tested speaker differed from the trained speaker.
To further investigate this, Fig. 3 plots the data collapsed across rooms and separately for the two speakers and the different training conditions. This figure shows that there was no improvement due to 1R training either for the trained or the untrained speaker, but that there was an improvement in performance of approximately 7% due to 3R training for the trained speaker. However, this improvement did not generalize to the untrained speaker.
To confirm these observations, partial ANOVAs were performed separately for each type of training. For the 3R data (see Fig. 3, right-hand panel), a RM ANOVA, with speaker and time as factors, showed a significant main effect of time, F(1, 14) = 8.12, p = .013, ηp2 = 0.37, and a significant Time × Speaker interaction, F(1, 14) = 7.48, p = .016, ηp2 = 0.35. Paired t tests showed significant learning for the trained speaker, t(14) = −4.33, p = .0007, and no improvement for the untrained speaker, t(14) = −0.82, p = .43. This result shows that when implicit training is performed in three varying reverberant and anechoic rooms, participants can learn to discriminate the new nonnative phonetic contrast, while no such learning occurs when only one room is used during training.
Because many of the participants in the 1R group were trained only in the anechoic room, it is not clear whether the critical feature of the 3R training versus the 1R training was that reverberation was present during a majority of 3R training trials (whereas none was present for the 1R anechoic participants) or that the amount of reverberation varied a lot across the 3R training trials (whereas it was fixed for the 1R anechoic as well as reverberant participants). To distinguish these two options, a partial ANOVA was performed on the 1R training data in which participants were further split into two different groups, the anechoic 1R group (seven participants; circles in Fig. 3) and the reverberant 1R group (four participants in classroom, two in bathroom; diamonds in Fig. 3). A mixed ANOVA, with group (1R-anechoic, 1R-reverberant) as a between-participants factor and time and speaker as within-participants factors, found no difference between the two training groups, no improvement from pretest to posttest, and no interactions, condition, F(1, 12) = 1.98, p = .184 (all other Fs < 1, ns). The result that there was no learning for either the 1R-AN or 1R-RE group suggests that the critical factor for the implicit learning paradigm for the 3R group was the room variation, as opposed to just presence of reverberation. However, due to the small number of participants in the 1R subgroups, the results of this analysis should be examined with caution.
Generalization of learning
To summarize the generalization of learning observed in this study, the 3R data were reanalyzed with the trained speaker’s trained and untrained tokens treated separately and with the rooms grouped by whether they were used in the training or not. Figure 4 shows pretesting and posttesting performance for the trained and untrained speakers and speech tokens as a function of the room group. The trained rooms group represents the average of the anechoic, bathroom, and classroom data, and the untrained rooms group represents the average of office and cafeteria data. Confirming the previous results, for the trained speaker (left-most and middle column in Fig. 4), participants improved approximately equally across trained and untrained speech tokens and rooms. For the untrained speaker (right-most panel), there was no evidence of learning in either room type. Thus, the implicit training in this study generalizes to untrained tokens of the trained speaker and to the untrained rooms, but not to untrained speakers. These results were confirmed by a repeated-measures ANOVA, with time, room group (trained, untrained), and token type (trained tokens, untrained tokens, untrained speaker) as factors, which found significant main effects of time, F(1, 14) = 11.38, p = .005, ηp2 = 0.45, and room group, F(1, 14) = 10.21, p = .007, ηp2 = 0.42, as well as significant interactions of Time × Token Type, F(2, 28) = 4.42, p = .028, ηp2 = 0.24, while there were no interactions involving time and room group: Room Group × Time, F < 1; Room Group × Token Type × Time, F(2, 28) = 1.59, p = .22.
Training
We analyzed performance during training to test whether it could explain the differences in phonetic learning between the single room and multiple room conditions. The training game speed was evaluated here as an indirect measure of performance. Bars in Fig. 5 shows average speed over the first–fifth sessions, for the 1R and 3R conditions, whereas circles show individual participant data. Participants in the three-room training group were consistently faster in the game, but both groups showed similar improvement in performance (increase in the game speed) across each session. A mixed ANOVA, with condition (1R, 3R) as a between-participants factor and session (1–5) as a within-participants factor, found a main effect of condition, F(1, 25) = 13.30, p = .002, ηp2 = 0.35, confirming that participants in the 3R condition were faster than those in 1R. It also found a main effect of session, F(4, 100) = 56.01, p < .001, ηp2 = 0.69, suggesting that, overall, participants became faster in the game during training. However, importantly, there was no interaction (F < 1), suggesting that both groups improved equally during implicit training.
The observed faster performance in the three-room condition could be a potential confound. For example, it could mean that the 3R players were, on average, more experienced videogame players, which might have enhanced their abilities on a wide variety of perceptual tasks (e.g., Bejjanki et al. 2014; Green, Li, & Bavelier, 2010) and allowed them to benefit more from implicit training. To examine this further, we analyzed pretesting and posttesting performance after removing the six best performers from the 1R and the six worst performers from the 3R conditions, respectively, based on overall speed in the game, collapsed across sessions (outliers indicated by open circles in Fig. 5). The asterisks in Fig. 5 show average game speed after removing the best and worst performers from the 3R and 1R groups, respectively. The asterisks corresponding to the 1R and 3R groups are well aligned in each session, showing that performance in the game is similar across the groups. Importantly, the basic effect of 1R versus 3R training is unaffected by removing these participants (data not shown), producing results very similar to those shown in Fig. 5. Thus, it can be concluded that the better training performance (faster average on-screen character motion) in the 3R group is most likely due to a random imbalance between the participant groups, but that it does not confound the greater learning observed in the 3R group compared with the 1R group.
Baseline data
To ensure that training effects are not explained by differences in baseline performance we examined baseline performance of participants to the phonetic stimuli across the different rooms with varying reverberation and the two speakers. Figure 6 shows pretest performance as a function of testing room, averaged across all 41 participants (i.e., averaged across conditions) and plotted separately for the two Hindi speakers (during training, only one of the two speakers was used for each participant, counterbalanced across participants). Accuracy was slightly higher for the second Hindi speaker, but performance was overall comparable across the two voices. Performance varied across rooms, such that for both speakers it was consistently worse for bathroom and similar for the other rooms. A two-way repeated-measures ANOVA with factors of Hindi speaker (Speaker 1, Speaker 2) and room (bathroom, cafeteria, classroom, anechoic, and office) confirmed this, showing a main effect of room, F(4, 160) = 3.14, p = .023, ηp2 = 0.07, a trend for an effect of speaker, F(1, 40) = 3.23, p = .079, ηp2 = 0.07, and no interaction between speaker and room (F < 1, ns).
These results suggest that nonnative listeners were initially able to better overcome distortions caused by reverberation in rooms with modest levels of reverberation (office, classroom, and cafeteria) than in the very reverberant bathroom. Specifically, averaged across the two Hindi speakers, performance in office, classroom, cafeteria, and anechoic room was almost constant at approximately 60.6%. This matched well the initial performance of 61.2% obtained without any room simulation in a previous study that used the same phonetic distinction from the same Hindi speakers (Vlahou et al., 2012). For the more challenging bathroom environment, performance dropped to 56.9%, consistent with acoustic analyses showing that bathroom had the largest T60 and lowest C50 out of the rooms examined here (see Table 2). This shows that, although the bathroom reverberation might not be overly disruptive for understanding conversational speech, in the context of a difficult phonetic identification task performed by nonnative listeners, it caused a significant decline in performance. Importantly, however, even the bathroom performance is still above chance, t(40) = 5.44, p < .0001), suggesting that participants were able to distinguish the phonetic contrast to some extent prior to training.