Mental imagery training
Twenty-three of the 24 participants completed all three of the imagery training sessions, and one participant completed only two of the training sessions. The average number of words recalled by the participants remained fairly consistent across the 14 memory tests used in the imagery training session, even as the encoding task became more difficult (longer word lists, shorter encoding times, more abstract words). Figure 1 shows the average number of words recalled on each memory test. Participants recalled an average of 8.33 words per test list during the first training session, 7.42 words per list during the second training session, and 8.04 words per list during the third training session. A repeated-measures ANOVA showed that there was a significant effect of test number on the number of words recalled on each test, F(13, 284) = 3.68, p < .001, ηp
2 = 0.14. Paired t tests comparing average performance in each of the three training sessions showed that participants performed significantly worse during the second training session relative to the first session, t(23) = 2.87, p < .01, Hedges’s g
av
= 0.63, and significantly better during the third training session relative to the second training session, t(22) = 2.67, p < .01, Hedges’s g
av
= 0.38. As the memory tests became more difficult, participants reported that it became more difficult to create mental images for the word lists, and they felt that the imagery strategy was less effective for the more difficult lists. Additional analyses are presented in the Supplemental Materials.
Working memory training
Twenty-four of the 25 participants in the WM training group completed at least 12 of the 14 WM training sessions, and one participant completed nine of the training sessions. The participants who completed at least 12 of the WM training sessions were included in the analysis of the WM training. On average, the participants’ performance improved across the training sessions for both training tasks. During the first training session, the participants had an average n-back level of 1.81 and an average symmetry span difficulty level of 3.77. On the 12th training session, the participants’ average n level was 4.23 and their average symmetry span difficulty level was 5.43. However, there was a great deal of variability across participants. The average n level achieved by each participant on the 12th training session ranged from 1.0 to 8.83. Similarly, the average level of difficulty achieved by each participant for the 12th session of the symmetry span task ranged from 3.18 to 7.33. Figure 2 shows the changes in performance across the WM training sessions. Repeated-measures ANOVAs showed that there were significant effects of training session on scores for both the symmetry span task, F(11, 242) = 14.76, p < .001, ηp
2 = 0.40, and the n-back task, F(11, 231) = 32.74, p < .001, ηp
2 = 0.61. Paired t tests comparing the first and 12th training sessions showed that participants scored significantly higher on the 12th session in both the symmetry span task, t(22) = 8.56, p < .001, Hedges’s g
av
= 1.50, and the n-back task, t(21) = 7.64, p < .001, Hedges’s g
av
= 1.93.
Baseline memory tasks
Although participants generally improved their performance on the tasks on which they were trained, the key question was whether their training would affect their performance on untrained memory tasks. To address this question, we compared the three training groups’ changes in performance on the three pre- and posttraining baseline tasks. Given the large age range of the participants, the statistical tests were run with and without including age as a covariate. Including age did not change the results of the tests unless otherwise noted.
Listening span task
Five participants (four from the control group and one from the WM training group) were excluded from the analysis of the listening span task due to a problem with the presentation of the sound files during the pretraining session. The average total scores from the remaining participants in each training group are shown in Table 1. A one-way ANOVA showed that the performance of the three groups did not differ significantly on the pretraining test, F(2, 66) = 1.33, p = .27, ηp
2 = 0.04.Footnote 1 Paired t tests were used to assess each group’s change in performance between the pretraining and posttraining sessions. All three groups performed significantly better during the posttraining session, control group: t(20) = 3.23, p < .01, Hedges’s g
av
= 0.57; imagery training group: t(23) = 3.36, p < .01, Hedges’s g
av
= 0.43; WM training group: t(23) = 1.87, p < .04, Hedges’s g
av
= 0.24. However, a one-way ANOVA showed that the three groups were not significantly different in terms of how much their performance improved, F(2, 66) = 1.69, p = .19, ηp
2 = 0.05.
Table 1 Means and (Standard Deviations) of Baseline WM Task Scores for Each Training Group
Rotation span task
Two participants were excluded from the analysis of the rotation span task (one from the control group and one from the WM training group) due to failure to complete the posttraining test. The mean accuracy for the remaining participants in each training group are shown in Table 1. A one-way ANOVA showed that the performance of the three groups did not differ significantly on the pretraining test, F(2, 69) = 0.26, p = .77, ηp
2 = 0.01. Paired t tests were used to assess each group’s change in performance between the pretraining and posttraining sessions. All three training groups performed significantly better during the posttraining session, control group: t(23) = 3.79, p < .001, Hedges’s g
av
= 0.59; imagery training group: t(23) = 3.61, p < .001, Hedges’s g
av
= 0.41; WM training group: t(23) = 4.09, p < .001, Hedges’s g
av
= 0.56. However, a one-way ANOVA comparing the change in performance across all three training groups showed that there were no significant differences between the groups, F(2, 69) = 0.49, p = .61, ηp
2 = 0.01.
Recognition memory task
The participants’ recognition memory performance (average proportion correct and standard deviation for each condition) is shown in Table S2 in the Supplemental Materials. The bulk of our analysis focused on using d′ as a measure of recognition memory performance. The hits and false alarm rates were used to calculate d′ for each test condition, and the average d′ values are shown in Table 2.
Table 2 Mean and (Standard Deviation) d′ Scores for Each Condition of the Recognition Memory Task for Each Training Group
A 3 × 7 repeated-measures ANOVA (Training Group × Memory Test Condition) was used to compare the memory performance of the three groups before and after training. Prior to training, the d′ scores of the three groups did not differ significantly, F(2, 420) = 2.19, p = .12, ηp
2 = 0.19. However, after training, there was a significant main effect of training group for the d′ scores, F(2, 420) = 6.77, p < .01, ηp
2 = 0.48. One-way ANOVAs comparing the performance of the three groups on each condition of the posttraining recognition memory test showed that the performance of the groups differed significantly on all seven conditions (all Fs > 4.08, all ps < .02, all ηp
2 > 0.10). In other words, although the three groups of participants had similar recognition memory performance prior to training, their d′ scores were significantly different for every condition after training. (See the Supplemental Materials for additional analysis of the participants’ pretraining performance on the recognition memory test.)
The crucial comparison for examining the effects of the memory training techniques on recognition memory performance was the difference between pretraining and posttraining performance across the three training groups. For each test condition, one-way ANOVAs were used to compare the change in hit rates (posttraining hit rates minus pretraining hit rates) and the change in d′ scores (posttraining d′ minus pretraining d′, or Δ d′) across the three groups. The average d′ difference scores are shown in Fig. 3. The changes in hit rates for the three groups were significantly different for every test condition where participants were tested on old words, once-presented words: F(2, 70) = 3.50, p = .04, ηp
2 = 0.09; short-lag repeated words: F(2, 70) = 3.33, p = .04, ηp
2 = 0.09; long-lag repeated words: F(2, 70) = 3.60, p = .03, ηp
2 = 0.09; short-lag quizzed words: F(2, 70) = 4.23, p = .02, ηp
2 = 0.11; long-lag quizzed words: F(2, 70) = 8.12, p < .01, ηp
2 = 0.19. However, the three groups did not differ in terms of the change to their false-alarm rates in response to new, unstudied words, F(2, 70) = 0.22. The changes in d′ for the three groups were significantly different for long-lag quizzed words, F(2, 70) = 5.67, p < .01, ηp
2 = 0.15, and marginally significant for the short-lag repeated words, F(2, 70) = 2.63, p = .07, ηp
2 = 0.07, and the short-lag quizzed words, F(2, 70) = 2.92, p = .06, ηp
2 = 0.08.
Post hoc paired t tests were used to assess each group’s change in recognition memory performance between the pre- and posttraining sessions. The control group’s performance in the pretraining and posttraining sessions did not differ significantly for any of the test conditions, whether measured by hit rates (all ts < 1.03, all ps > .16), false-alarm rates, t(24) = 0.60, or by d′ scores (all ts < 0.44, all ps > .33). The participants in the mental imagery training group had nearly identical false-alarm rates before and after training, t(23) = 0.27, but significantly better hit rates in the posttraining session for the once-presented words, t(23) = 1.80, p = .04, Hedges’s g
av
= 0.39, the short-lag repeated words, t(23) = 2.40, p = .01, Hedges’s g
av
= 0.49, and the short-lag quizzed words, t(23) = 2.17, p = .02, Hedges’s g
av
= 0.35. When measured by d′ scores, the participants in the imagery training group had significantly better performance after training for the words in the short-lag repetition condition, t(23) = 1.74, p < .05, Hedges’s g
av
= 0.40, but no significant differences in d′ scores for any of the other test conditions, once-presented words: t(23) = 1.19, p = .12, Hedges’s g
av
= 0.30; long-lag repetitions: t(23) = 0.94, p = .18, Hedges’s g
av
= 0.24; short-lag quizzes: t(23) = 1.43, p = .08, Hedges’s g
av
= 0.32; long-lag quizzes: t(23) = 1.33, p = .10, Hedges’s g
av
= 0.28.
In contrast to the other two groups, the participants in the WM training group had significantly lower hit rates in the posttraining session relative to the pretraining session for the long-lag repeated items, t(24) = 2.81, p < .01, Hedges’s g
av
= 0.46, the short-lag quizzed words, t(24) = 2.26, p = .02, Hedges’s g
av
= 0.40, and the long-lag quizzed words, t(24) = 4.55, p < .01, Hedges’s g
av
= 0.72. Their performance was marginally worse for the once-presented words, t(24) = 1.65, p = .06, Hedges’s g
av
= 0.21, but the difference was significant when controlling for age, p = .01. Their false alarm rates did not differ significantly before and after training, t(23) = 0.52. The participants in the WM training group also had significantly worse d′ scores after training for the long-lag repetition, short-lag quiz, and long-lag quiz conditions, long-lag repetitions: t(23) = 2.48, p < .02, Hedges’s g
av
= 0.41; short-lag quizzes: t(23) = 2.21, p < .02, Hedges’s g
av
= 0.33; long-lag quizzes: t(23) = 4.04, p < .001, Hedges’s g
av
= 0.68. Their posttraining performance was marginally worse for the once-presented words, t(23) = 1.61, p = .06, Hedges’s g
av
= 0.27, and there was no significant difference in performance for the short-lag repetition condition, t(23) = 1.10, p = .14, Hedges’s g
av
= 0.18.