Cognitive Load Theory, Resource Depletion and the Delayed Testing Effect
- 497 Downloads
The testing effect occurs when students, given information to learn and then practice during a test, perform better on a subsequent content post-test than students who restudy the information as a substitute for the practice test. The effect is often weaker or reversed if immediate rather than delayed post-tests are used. The weakening may be due to differential working memory resource depletion on immediate post-tests with resource recovery due to rest following a delayed post-test. In three experiments, we compared an immediate post-test with a 1-week delayed post-test. Experiment 1 required the students to construct a puzzle poem and found working memory resource depletion occurred immediately after learning compared to a delay. Experiment 2 using text-based material tapping lower element interactivity information and experiment 3, again using a puzzle poem, compared study-only with the study and test groups. A disordinal interaction was obtained in both experiments with the study-only groups superior to the study–test groups on immediate content post-tests and reverse results on delayed tests. Working memory capacity tests indicated a non-significant increase in capacity after a delay compared to immediately after learning with medium size effects, but in experiment 2, there were no working memory differences between the study-only and the study and test groups. Experiment 3 increased element interactivity and found an increased memory capacity for the study-only group compared to the study and test group with the immediate test contributing more of the difference than the delayed test. It was concluded that increased working memory depletion immediately following learning with a test contributes to the failure to obtain a testing effect using immediate tests.
KeywordsCognitive load theory Testing effect Resource depletion Element interactivity
The testing effect occurs when students, having been given information to learn and then practice on a test during that learning, perform better on a subsequent post-test than students who restudy the information as a substitute for practice on the test (Roediger et al. 2011; van Gog and Sweller 2015). Interestingly, findings by various researchers have obtained the result with weaker effects or even reverse effects if the subsequent post-test is administered immediately after learning rather than after a delay (Roediger and Nestojko 2015). Currently, there is no agreed rationale for this increase of the testing effect after a delay. One possibility is associated with working memory resource depletion, a function that recently has been incorporated into cognitive load theory (Chen et al. 2018). Working memory resource depletion assumes that a depletion of limited working memory resources may occur following cognitive effort. Extensive cognitive effort may result in decreased performance compared to conditions requiring less extensive effort. As indicated below, incorporating this assumption into the cognitive load theory may allow us to account for the differential results of immediate and delayed post-tests when studying the testing effect.
Cognitive Load Theory
Cognitive load theory (Sweller 2011, 2012; Sweller et al. 2011) is an instructional design theory that has been used to devise instructional materials and formats. The theory is based on the evolutionary structure of knowledge and on human cognitive architecture. Both are closely related and underpin the experiments described in this paper.
The Evolutionary Structure of Human Knowledge
Human knowledge can be divided into biologically primary and secondary knowledge (Geary 2012; Geary and Berch 2016). We have evolved to acquire primary knowledge. It has several characteristics that flow from its evolutionary origins. For students without learning disabilities, it cannot be taught because it is acquired easily, automatically and largely unconsciously (Paas and Sweller 2012; Sweller 2015, 2016a). Examples are general problem solving, thinking skills and learning to listen to and speak a native language.
Secondary knowledge is knowledge that we need for cultural reasons. We can acquire secondary knowledge, but it is usually acquired with deliberate, conscious, effort. It usually needs to be taught explicitly (Kirschner et al. 2006; Sweller et al. 2007). Virtually, every topic that is taught in educational institutions consists of biologically secondary knowledge.
There is another knowledge distinction that is closely related to the distinction between biologically primary and secondary knowledge: the distinction between generic-cognitive and domain-specific skills (Sweller 2015, 2016b; Tricot and Sweller 2014). Because of their critical importance to human functioning, most generic-cognitive skills such as problem solving, thinking and learning are biologically primary, while most domain-specific skills are biologically secondary. For example, while we have had to evolve general strategies for solving problems, we have not specifically evolved to multiply out the denominator first when presented the problem, (a + b)/c = d, solve for a. We need to be taught that first move.
Human Cognitive Architecture
There is a cognitive architecture associated with the acquisition of biologically secondary, domain-specific knowledge. The architecture provides an example of a natural information processing system that is analogous to the processes of natural selection (Sweller and Sweller 2006). It can be specified by five major principles, summarised as follows.
Knowledge Acquisition and the Borrowing and Reorganising Principle
We usually borrow knowledge from others. Almost all we know is borrowed from other people’s memory via, for example, information from educational institutions, texts, the media, information technology and conversations. The borrowed information is reorganised by combining it with information we already hold.
Problem Solving and the Randomness as Genesis Principle
Knowledge first must be created before it can be borrowed from others (Sweller 2009). Knowledge is created while problem solving using a random generation and testing procedure.
Working Memory and the Narrow Limits of Change Principle
Long-Term Memory and the Information Store Principle
While working memory is limited when dealing with novel information, a very large amount of familiar information can be held in long-term memory for unlimited periods of time.
Relations Between Working and Long-Term Memory and the Environmental Organising and Linking Principle
Based on cues from the environment, large amounts of organised information can be transferred from long-term to working memory to generate appropriate action. Working memory has no known limits when processing previously organised information from long-term memory (Ericsson and Kintsch 1995).
Based on this cognitive architecture, the central goal of instruction is to generate knowledge in working memory to be retained for subsequent use in long-term memory. If the limitations of working memory when processing novel information are not taken into consideration, then instructional designs are likely to be ineffective resulting in minimal changes to long-term memory for the learner. If there are no changes in long-term memory, then no learning can occur. The experiments of this paper are based on the assumptions of this cognitive architecture.
To-be-learnt information can be placed on a continuum of element interactivity where the degree of interaction between elements is dependent on the number of separate elements that need to be processed simultaneously in working memory (Sweller 2010). An example is comprehending the grammar of a language. The correct and understandable construction of a sentence, phrase or clause in English is dependent on rules. The sentence, “I am eating a green apple” must be stated according to rules of English grammar. A sentence such as, “Eating I am an apple green” is understandable but grammatically incorrect. Each word or element is constructed in relation to each other component or element that enables correct grammar. If a word is missing or the syntax is incorrect, it affects the entire sentence. For a complete novice learning English as a second language, when learning this correct sentence structure, element interactivity would be high and so, due to the narrow limits of change principle, the load on working memory is equally high. This load is intrinsic to the rules of English grammar and imposes an intrinsic cognitive load. However, learning English vocabulary for English as a second language students is low in element interactivity because each word or element can be learned in isolation from each other word. The word “apple” can be learned independently of the word “eating” and “green”. As a result, working memory load is intrinsically low in this learning task. The task may be quite difficult if there are many vocabulary words to learn, but this difficulty is not due to interactions between words or elements and so the element interactivity and intrinsic cognitive load associated with learning a second language vocabulary remains low even though the task may be difficult.
Element interactivity due to intrinsic cognitive load can be varied by changing what has to be learned or by changing the expertise of learners. Once learning has occurred, the information store principle embeds interacting elements within knowledge held in long-term memory. That knowledge then can be transferred to working memory using the environmental organising and linking principle as a single element that does not impose a heavy working memory load. For example, readers can read the words of this text easily and quickly because their knowledge, held in long-term memory, can be transferred to working memory as a very limited number, or even as a single element, rather than as multiple, interacting elements. Without that knowledge, the text is likely to overwhelm working memory due to the narrow limits of change principle.
Element interactivity also can vary due to extraneous cognitive load that is imposed by instructional procedures (Sweller 2010). For example, showing someone how to solve a problem reduces the number of interacting elements that they have to process compared to having learners search for a solution during problem solving.
The Testing Effect
The testing effect occurs when people learning content have a test during the learning and then perform better on a subsequent test of that content than those who restudied during learning (see Roediger et al. 2011; van Gog and Sweller 2015 for reviews). There has been a variety of instructional content that demonstrated the testing effect, including factual information (e.g. Carpenter et al. 2008; McDaniel et al. 2007), word translation (e.g. Pashler et al. 2005), word lists (e.g. Carpenter and De Losh 2006; Wheeler et al. 2003), map reading (Carpenter and Pashler 2007), videotaped lectures (Butler and Roediger 2007), animations (Johnson and Mayer 2009) and prose passages (e.g. Roediger and Karpicke 2006a, b).
Despite the robustness of the testing effect, Hanham et al. (2017) and van Gog and Sweller (2015) indicated that the effect is more likely to be evident with content tapping low element interactivity knowledge such as content that can be memorised serially. Content that needs to have the elements of information processed simultaneously, such as high element interactivity information used in some studies, has produced the effect, but not with the same consistency of results. Work using problem solving that is usually high in element interactivity frequently results in a failure of a testing effect when tests occur immediately after learning rather than after a delay.
For example, Van Gog et al. (2011) compared the effects of worked example-only and example–problem pairs. Worked example–problem pairs provide learners with two worked examples of a problem while example–problem pairs provide a worked example followed by a problem that needs to be solved. It can be noted that comparing the two conditions precisely mirrors the testing effect paradigm. Van Gog et al. (2011) found no significant differences between the two conditions on an immediate test. In contrast, a study by Van Gog and Kester (2012) found that an example-only condition outperformed an example–problem pair condition, indicating a reverse testing effect on a delayed test of 7 days. There were no differences on an immediate test providing evidence that the reverse testing effect that was found may be more likely on delayed rather than immediate tests using these high element interactivity materials. Other work using high element interactivity information such as children learning to read the diagrammatic material of a timetable failed to obtain a testing effect (see Leahy et al. 2015). As well, Van Gog et al. (2015) found no evidence of a testing effect over four problem-solving experiments using a variety of conditions and domains. This recent work indicating difficulty obtaining a testing effect with complex information replicates much older research (see Kühn 1914 and Gates 1917).
These results can be explained by the cognitive load theory by assuming that for high element interactivity information, more than one exposure to the information is required before it can be understood and learned. It follows that seeing the information twice should be superior to seeing it once followed by a test. For low element interactivity information, the information is more likely to be fully understood and learned after a single exposure. Two exposures inherent in a study–restudy procedure may result in a redundancy effect that interferes with learning (Kalyuga et al. 2003).
There is considerable evidence (Roediger and Nestojko 2015) that the testing effect is more likely to be obtained after a delayed than an immediate test although, as previously outlined, Van Gog et al. (2015) found no delayed testing effect. Currently, there appears to be no satisfactory explanation why a delayed test should enhance the testing effect whereas an immediate test does not. Contemporary versions of human cognitive architecture as used by the cognitive load theory do not categorise knowledge held in long-term memory on this dimension. We are not explicitly aware of one category of knowledge that displays an improved immediate performance on a test that then subsequently declines rapidly over time compared to another category of knowledge that displays an improved delayed test performance because it declines slowly over time. Neither is it clear in research on the testing effect why testing should favour the acquisition of the slowly decaying category, while studying favours the acquisition of the rapidly decaying category. Working memory resource depletion may provide an answer that explains why delayed testing reverses the results of immediate testing.
Working Memory Resource Depletion
There has been little research on the effects of cognitive effort and working memory depletion. Chen et al. (2018) initiated work in this area within a cognitive load theory framework by providing evidence that the spacing effect could be explained by working memory resource depletion. In an earlier work, Schmeichel (2007) conducted a series of experiments in which undergraduate students were required to complete self-control tasks that were expected to deplete their working memory. A working memory test was completed after the task. The results indicated that the more difficult conditions increased cognitive effort and depleted working memory resources. Schmeichel et al. (2003) used similar independent variables as Schmeichel’s (2007) study. Performance-dependent variables were reasoning, problem solving or reading comprehension tasks although working memory capacity was not used as a dependent variable. Under resource depletion conditions, the test performances were lower. In addition, Schmeichel et al. (2003) found a lowering of performance following mental effort on complex but not simple tasks. Nonsense syllable memorisation showed no effect compared to more complex memory tasks.
Healey et al. (2011) extended the previous findings by Schmeichel et al. (2003) and Schmeichel (2007). Using span tasks, they found that depletion effects were only evident when the first task’s to-be-ignored stimuli were matched with the to-be-remembered stimuli in working memory tasks. In their experiments, for example, ignored arrows impaired memory for arrows but not for words, and ignored words impaired memory for words but not for arrows. That is, subsequent task effects depended not on similarities among the resources needed for the tasks but rather on similarities of the stimuli. Note, however, that these tasks were not complex nor similar to the instructional material used in our current experiments.
Experiment 1 tests the hypothesis that an immediate test after cognitive effort reduces working memory capacity compared to a delayed test that allows working memory recovery. The working memory test used was based on the Daneman and Carpenter (1980) procedures for this and the subsequent experiments. Experiment 2 tested two hypotheses associated with the testing effect. The first concerned relative, immediate test performance following either a study–restudy sequence compared to a study–test sequence. It was predicted that because the study–test sequence might be more demanding than the study–restudy sequence, a study–test group should exhibit a lower working memory capacity due to greater working memory resource depletion than a study–restudy group and that lower working memory capacity by the study–test group should prevent the testing effect from being exhibited on a final test. In contrast, after a delay, it was hypothesised that working memory resources should recover, resulting in no difference between groups on a working memory capacity test. If more learning occurs because of testing during learning, and with both groups having recovered from working memory depletion effects, that learning should be exhibited on the final, delayed test by the study–test group exhibiting enhanced performance compared to the study–restudy group.
Experiment 2 used relatively low element interactivity information. Experiment 3 replicated experiment 2 using higher element interactivity information, testing the same hypotheses. It was predicted that larger working memory depletion effects might be obtainable using the higher element interactivity information of experiment 3.
The participants were 73 first year educational psychology students from four tutorial classes at a university in Sydney, Australia. University ethics approval was granted (no. 5201600947) and written approval was received from the participants. Ages ranged from 18 to 30 years. The classes were un-streamed in ability levels and had no experience in writing this type of text before the experiment. The students were previously randomly assigned into tutorial groups. Each tutorial class was taught by the same tutor during four consecutive blocks. The instructional material was part of the tutorial content related to the topic of problem solving. Students in each tutorial class were allocated to an immediate working memory test or a delayed working memory test group. There were 39 participants in the immediate group and 34 participants in the delayed group. The unequal numbers were due to student absences from class.
The participants had to produce an original puzzle poem. During the learning phase, a sheet of paper divided into two columns was presented to all participants. Instructions were listed as rules in the right-side column with separate arrows pointing to an example in the left side column. By explicitly following these rules, a correct puzzle poem would ensue. Figure 1 indicates the information provided. An estimation of element interactivity when studying this worked example was conducted.
Estimation of number of interacting elements for experiment 1
1. Note the right side column and read bold text at top.
2. Read first dot point and note there must be six lines in the poem.
3. Read second dot point and follow arrow.
4. Note the first letter of the first underlined word from ‘slug’.
5. Read third dot point.
6. Follow the two arrows and note rhyming words for the first and third lines.
7. Read fourth dot point.
8. Follow the two arrows and note rhyming words for fourth and sixth line.
9. Read fifth dot point.
10. Note that ‘SELMAN’ matches first letter of each first word in each line.
11. Read sixth dot point.
12. Note that last line ‘matches’ answer ‘SELMAN’.
13. Also note repetitive wording throughout the poems for example, ‘The x letter starts with x but this word is not a x’.
For most learners, any part of the task that required them to process all of these elements simultaneously was likely to create a high working memory load due to high element interactivity. An example of knowledge tapping high element interactivity is recalling that the indicated first letter, in the indicated first word, in each line, must be part of a vertical word, to match the answer. In contrast, a part of the task that required consideration of only one isolated element will tap knowledge that is low in element interactivity and should not overextend working memory. An example of tapping isolated elements thus low element interactivity knowledge is recalling that there must be a final line description of the vertical word or, the distinctive repetitive wording of the poem is ‘The first letter starts with…but this word is not a …’.
The resource depletion task was a computer-paced PowerPoint presentation consisting of 71 slides. A sheet was given to the students to record their answers. Slides 1 to 3 were instructions totalling 75 s. This was followed by slides 4 to 7 totalling 25 s which were practice slides that contained two single sentence statements. Students had to write the statement’s last word and indicate if that statement was correct or incorrect. Answers could only be written when the following slides 6 and 7 appeared stating “Memorized last words” and “Circle whether each statement was Correct or Incorrect”. In the practice set, only two statements had to be held in working memory before the slides appeared to write answers. After the introduction and practice slides (100 s in total), the presentation proper commenced for a total of 360 s. The testable answers then proceeded as three sets of 3, 4 and 5 statements, respectively. For example, in one of the sets of 3 statements, learners were presented the 3 statements and had to remember whether each statement was correct or incorrect and remember the last word for each statement. In the sets as previously outlined, statements had to be retained before students were allowed to write. The time available for writing of answers was gradually increased from 5 to 12 s to reflect the number of statements in a set. Students were monitored closely to ensure that they did not write answers while the particular statement was on the screen as this would not reflect working memory retention.
Three examples of the types of statements are, “We take into consideration that nurture and nature are parts of moral development”, “A moral dilemma for one child would be the same for all children” and “Reasoning is an important part of morality according to various writers”. Data collected was only for the scores of the number of last words remembered.
The post-test given to all groups was a blank sheet of paper. Both groups had to compose an original puzzle poem without access to the instructions given in the instruction phase.
The experiment consisted of a pre-instruction phase of 5 min, an instruction phase of 10 min, the resource depletion task of 7.7 min and a post-test phase of 15 min. All the students were informed from a memorised script presented by the researcher. In the pre-instruction phase, students were told they were going to be taught how to write a type of poem by being shown a worked example on a worksheet. They were further told that, after the instructions, the delayed group would continue the experiment in the next week’s tutorial 7 days later and the immediate group would continue the experiment during the current tutorial. The resource depletion task was explained in detail, and the students were informed that sheets would be given out after the learning phase according to their group allocation. After the 15-min instruction time, both groups were told to stop writing and the instructions were collected. After collection, the delayed test group were thanked and directed to continue with a pre-prepared tutorial activity in the room but facing away from the PowerPoint screen.
The immediate groups were then given the common post-test phase of 15 min which had the participants compose a puzzle poem. The test sheet was a blank sheet of paper. Students were instructed to write a new poem that was not the same as the poem in the instructions by recalling the rules without access to the instructions. The delayed testing group followed the same procedure 7 days later in the same tutorial room. They had their resource depletion task and common post-test (also without access to the instructions) during this time.
The poem written during the test phase was assessed according only to content explicitly taught in the instruction phase. We divided the content into that estimated to be tapping higher element interactivity knowledge and that estimated to be tapping lower element interactivity knowledge (see Table 1 for the marking criteria). Marks were not awarded for grammar or spelling, but nonsense words at the end of lines were accepted if they rhymed. Skills in writing and knowledge of correct rhyming words were not pertinent. Marking was objective as students either matched the criteria or did not. Nonetheless, the writing was marked independently by two markers. An intra-class correlation coefficient was conducted for reliability on all of the two sets of original marking, ICC = 0.91, p < 0.01. A total of ten essential criteria was developed that were needed to correctly compose the poem: six for knowledge tapping lower element interactivity and four for knowledge tapping higher element interactivity. The scores were converted to percentages. Also, Cronbach’s (1951) alpha for internal consistency of the ten-item subscale indicated α = 0.62.
The following are some examples of the students’ test writing and the marking procedure: Student 7 from the immediate test group completed this poem and received full marks. These were 4 marks for criteria tapping high element interactivity knowledge and 6 marks for criteria tapping low element interactivity knowledge.
In contrast to this example, some students met or failed to meet the strict criteria in various ways. Some poems had no rhyming words at the end of each line or in the incorrect lines. Some poems had no vertical highlighted first letter to compose a word. Some poems had a mismatch between the final word and the first letters. Thus, there was a wide variation in poems. Only students explicitly following the rules were given marks. The following is an example of a student’s poem not completely following the instruction.
The first letter starts with s elf but it is not a star
The second letter starts with o ctopus but it is not a boat
The third letter starts with c at but it is not a car
The fourth letter starts with c alf but it is not a dog
The fifth letter starts with e ar but it is not a human
The sixth letter starts with r abbit but it is not a frog
The word is a ballgame
F amily ensures friendships
A tale is always told
M any decide to take trips
I nsecurity can be seen
L oving relationships all around
Y ou never know where they’ve been
Those in which a person grows up with sharing blood relation - family
Results and Discussion
Results (means and standard deviations) for the dependent variables of experiment 1
Resource depletion test
Post-test scores tapping high element interactivity knowledge
Post-test scores tapping low element interactivity knowledge
Resource Depletion Task
A one-way ANOVA was conducted for the resource depletion task using percentage of the last words correct. There was a significant difference between groups favouring the delayed group, F(1, 71) = 5.44, MSe = 257.89, p = 0.023, ηp2 = 0.07.
A 2 (immediate or delayed group) × 2 (element interactivity) ANOVA was conducted on the test results with repeated measures on the last factor. All students finished within the 15-min test period. There was no significant difference between the immediate and delayed test instructional groups, F(1, 71) = 1.54, MSe = 573.22, p = 0.219, ηp2 = 0.02. There was a significant difference between high and low element interactivity questions, F(1, 71) = 84.82, MSe = 296.86, p < 0.001, ηp2 = 0.54. There was a significant interaction F(1, 71) = 4.10, MSe = 296.86, p = 0.047, ηp2 = 0.05. Because of the significant interaction, t tests were conducted on element interactivity. There was no significant difference between groups on scores tapping high element interactivity knowledge, t(71) = 0.17, p = 0.433, Cohen’s d = 0.04. There was a significant difference between groups on scores tapping low element interactivity knowledge, t(71) = 2.10, p = 0.020, Cohen’s d = 0.49 with immediate test scores higher than delayed test scores.
The results of experiment 1 confirmed a significantly depleted working memory score for the immediate compared to the delayed group, indicating that resource depletion occurred immediately after learning but recovered during the delay. With respect to test scores, immediate test scores were higher than delayed test scores using low element interactivity information, presumably due to forgetting.
Experiment 1 confirmed that working memory depletion occurs during learning and is apparent on an immediate test. That depletion is reversed during rest. Given these results, it is possible that the differential results that often occur for testing effect experiments depending on whether the final test is presented immediately or after a delay may be due to working memory depletion effects. This hypothesis depends on testing imposing a greater cognitive load than restudy and that this increased load depresses performance on an immediate but not a delayed test. Experiment 2 tested these hypotheses.
In many syllabi, primary school students learn how to write various types of text. Our instructional material content for experiment 2 was composed of the steps to producing the text type of a persuasive argument. It was a template built on a series of rules. The use of these rules results in relatively low levels of element interactivity. The rules could be learned serially and could be learned in any order.
The working memory test task would indicate increased working memory depletion using the immediate compared to the delayed test.
The working memory test task would indicate an increased working memory depletion by the study with test groups compared to the study-only groups. This would result in an increased depletion on the immediate post-test by the study with test group compared to the study-only group and that difference would disappear on the delayed test.
That there would be a learning group (test vs. no test) × time of post-test (immediate vs. delayed) interaction with the immediate post-test indicating a reverse testing effect (study-only superior) and
The delayed post-test indicating a testing effect (study-testing superior).
The participants were 56 grade 4 students from two un-streamed ability level classes at a Sydney private school aged from 9 to 10 years. The university and the school system granted ethics approval for the study (no. 5201600619). In addition, written parental permission was granted as well as verbal consent from the teachers and students. The students had some very limited experience in writing this type of text before the experiment. The experiment was conducted in the second term of the school year and during the first lesson periods of the day. The students were randomly assigned into four groups on the day of the experiment after the pre-instructions by the researcher. There were 13 (9 female and 4 male) participants in the study-only group immediate condition, 16 (9 female and 7 male) participants in the study with test group immediate condition, 12 (7 female and 5 male) participants in the study-only group delayed condition and 15 (8 female and 7 male) participants in the study with test group delayed condition. Note that some students who were originally put into groups had to be withdrawn unexpectedly due to other lessons. This resulted in the four groups being unequal in number.
The materials provided instruction on constructing a persuasive argument. The year 4 primary school English curriculum in New South Wales requires the reading of material similar to the material used in the experiment.
The four groups were provided with an introduction to the text type of a persuasive argument and its components on a single sheet of paper (see Fig. 2). For the study-only groups, the paper consisted of two columns. In the left column at the top was the question: “Should school canteens only sell healthy food?” The right-hand column had the second question: “Should schools have a compulsory school uniform?” Both sides used the same template with headings and a sample response for each heading. There were five headings and four sub-headings of the introduction outlining the opinion and what is covered in the essay, the first reason and supporting point for that reason, the second reason and a supporting point for that reason, the opposite view and a countering point, then the conclusion. For the study with test groups, the paper consisted of the same two columns with the same headings, except that there were spaces to fill in the template on the second column. It was here that the study with test groups had to write their argument: “Should schools have a compulsory school uniform?” In contrast, the study-only groups had this argument provided.
The resource depletion task was similar to that used in experiment 1 but was modified by the area of content and reduction of duration to reflect the lower ability levels of the younger aged students. Timing was assessed by a previous pilot study with a small group of students. Statements were of a language level suited to this grade of students. An example of two of the statements was “September is the ninth month of the year” and “Eighteen divided by two always equals eight”. The computer-paced PowerPoint presentation consisted of 51 slides. The introduction about the task was explained verbally by the researcher instead of instructions being presented by slides as was the case in experiment 1. The slides commenced immediately after these instructions with two practice sets of 2 statements. The presentation for required answers then had three sets of 2, three sets of 3 and three sets of 4 statements. The total presentation time was 469 s.
The same format sheet used in experiment 1 was given to the students to record their answers. Students had to write the statement’s last word. There were no differences between the correct and incorrect option and so is not reported here. There were 27 statements in total. Answers could only be written when the slides appeared that stated “Write”. The writing time gradually increased from 10 to 15 s dependent on the number of statements. The researcher ensured that the students did not write answers while the particular statement was on the screen. The post-test, whether immediate or delayed, was a blank sheet of paper where students had to write a persuasive argument about the topic “Should cars be allowed in cities”.
The experiment consisted of (1) pre-instruction (5 min), (2) instruction (15 min), (3) resource depletion task (10 min) and (4) post-test (15 min).
All the students were informed from a memorised script presented by the researcher that they were going to be taught how to write a persuasive argument by being shown worked example/s on a worksheet. They were further told that, during the entire instruction phase, they would have to concentrate carefully by reading the worksheet. The resource depletion task and the post-test were explained and the students were then allocated to four random groups of immediate study only, immediate study with test, delayed study only and delayed study with test. The two immediate test groups were told they would be having their writing test (the post-test) after the activities. The two delayed test groups were told they would be having the resource depletion test and post-test in 7 days. All groups were given the instructional material for the persuasive argument and all groups had the same time of 15 min to learn their material. After the 15-min instruction time, the instruction sheets were removed and the two delayed test groups were directed to complete a pre-prepared reading activity. They completed this work while facing away from the PowerPoint screen. The immediate test groups continued in the experiment by participating in the resource depletion task and completing the post-test of writing a persuasive argument.
Seven days later, the two delayed test groups received their resource depletion task and post-test of writing a persuasive argument during morning classes. This occurred during a silent reading activity for the two immediate test groups. Apart from the delay, the procedure for the immediate and delayed groups was identical.
Knapp and Watkins (2005) outlined structural features of persuasive texts. These are statement of position (introduction), reasons for the position, opposite view and restatement of the original position (conclusion). The written responses produced by the participants were evaluated according to these structures. Language, spelling or punctuation was not graded. Marks were awarded for knowledge explicitly taught in the instruction phase. Our concern was if responses were in the right sections or not, and therefore, they were assigned for adhering to the template correctly. While scoring was largely objective, the writing was marked independently by two markers. An intra-class correlation coefficient was conducted for reliability on all of the two sets of original marking, ICC = 0.89, p < 0.01. Marking subjectivity was limited because the students either met the criteria or not. A Cronbach’s (1951) alpha for internal consistency of the five-item subscale was conducted and this was shown as α = 0.62.
The Introduction paragraph was marked out of 3. One mark was given for commencing with a signifying question, 1 mark for stating their position and 1 mark for indicating that they would give reasons for their position. Students were awarded no marks if they stated reasons for their position in this segment. A test example written by one student for the Introduction was, “Do you have a job in the city and drive to work? I am going to give you three reasons why we should not only allow public transport into cities”. This answer was awarded 3 marks out of 3. The student had a signifying statement/question, stated his/her position and was going to give three reasons. This response followed the worked examples.
An incorrect example was from a student who wrote “My opinion is that we should not allow public transport because it is bad for the environment”. This response was awarded no marks out of 3. The student had no signifying question or statement, did not indicate how many reasons would be presented and proceeded to give a reason. This student did not follow the worked example for an Introduction.
The next paragraph was the First Reason and it was marked out of 3 with 1 mark for stating “Firstly” or “First”, 1 mark for the reason and 1 mark each for a supporting point for the reason. This superior response was awarded 3 marks as it had a “Firstly”, a reason and a supporting point for that reason: “Firstly, if you have bought a car to drive then what’s the use of it if have to catch a train or taxi. We all know that buying a car is expensive and we have to have to get rid of it and it’s even more costly”.
The following test example from the First Reason was only awarded 1 mark out of 3 as it did not comply to the instructions. This student wrote: “If you hoped (sic) on a bus and it was really hot it would be cramped and stuffy and bus stops would be full and you have know (sic) where to sit down”.
The next section was The Second Reason and marked out of 3 with the same criteria as the First Reason. The third paragraph was the Opposite View and a point about it. This was marked out of 2. One mark was given for providing an opposite view and 1 mark for the supporting statement. The following example was awarded 1 mark for the opposite view and 1 mark for the single supporting statement: “Some people say that it’s good because it costs less and you don’t have to fill up with petrol which cost lots. But after time all that money you spend builds up to so much more”. Accordingly, this response was awarded 2 out of 2.
The Conclusion was marked out of 2 consisting of 1 mark for denoting a conclusion, 1 mark for re-affirming the initial position and 1 mark for any other relevant concluding comments. Similar to the Introduction, no marks were awarded if another new reason was given.
The following student’s conclusion was awarded 2 marks out of 2 as s/he reaffirmed the initial position and gave no new reasons: “In conclusion, I think that public transport should only be used in the cities. Hopefully I persuaded you”. In total, the test was marked out of 14 and converted to percentages.
Results and Discussion
Results (means and standard deviations) of experiment 2
Immediate group study
Immediate group study with test
Delayed group study
Delayed group study with test
Resource depletion test
A 2 (immediate or delayed) × 2 (study only or study with test) ANOVA was conducted on the resource depletion task for the correct last word in each statement. There was no significant difference between the immediate and delayed tests although the difference favoured the two delayed groups on memory accuracy of the last word, F(1, 52) = 3.71, MSe = 576.88, p = 0.059, ηp2 = 0.06 with these groups obtaining higher scores. There was no significant effect between the study and test groups, F(1, 52) = 0.15, MSe = 576.88, p = 0.700, ηp2 = 0.002 nor was there a significant interaction, F(1, 52) = 0.49, MSe = 576.88, p = 0.487, ηp2 = 0.009.
A 2 (immediate or delayed) × 2 (study only or study with test) ANOVA was conducted on the post-test results. There was no significant difference between immediate and delayed post-tests scores, F(1, 52) = 0.58, MSe = 362.77, p = 0.450, ηp2 = 0.01. There was no significant difference between the study-only and the study–test groups, F(1, 52) = 1.24, MSe = 362.77, p = 0.271, ηp2 = 0.02. The interaction was significant, F(1, 52) = 11.50, MSe = 362.77, p = 0.001, ηp2 = 0.18. Because of the significant interaction, t tests were conducted between groups. There was a no significant difference between the two immediate groups although the means favoured the study-only group, t(28) = 1.64, p = 0.056, Cohen’s d = 0.56, indicating no testing effect. There was a significant difference between the two delayed groups favouring the study with test group, t(27) = 3.13, p = 0.002, Cohen’s d = 1.51, indicating a testing effect.
These results provided limited evidence that working memory resources were depleted and recovered after the 7-day delay. The content post-test displayed a common testing effect result. Studying non-significantly improved the immediate group results with an intermediate effect size, but reading and practising significantly improved the delayed group results with a large effect size. There was no significant difference on the working memory test between the study and test groups possibly because the test was not sufficiently sensitive to show differences using low element interactivity information. It was hypothesised that those differences may be detected using higher element interactivity information which provided the rationale for experiment 3.
Estimation of elements for experiment 3
1. Note the center column and read bold text at top.
2. Read first dot point and note there must be six lines in the poem.
3. Read second dot point and follow arrow.
4. Note the first letter of the first underlined word from ‘fish’.
5. Read third dot point.
6. Follow the two arrows and note rhyming words.
7. Read fourth dot point.
8. Follow the two arrows and note rhyming words.
9. Read fifth dot point.
10. Follow the two arrows and note rhyming words.
11. Note that line six is related to the ‘answer is’ word.
12. Note that line 6 ‘matches’ answer ‘FRUIT’.
13. Read sixth dot point.
14. Note that ‘FRUIT’ matches first letter of each first word in each line.
15. Also note repetitive wording throughout the poems, for example, ‘My first letter is in…but I am not a…’s
The participants were 50 grade 4 students from two classes at a Sydney private school aged from 9 to 10 years. The university and the school system granted ethics approval for the study (no. 5201600619) which was covered by the study in experiment 2. In addition, written parental permission was granted as well as verbal consent from the teachers and students. The two classes were un-streamed in ability levels and had no experience in writing this type of text before the experiment. The experiment was conducted in the fourth term of the school year and during the first lesson periods of the day. The students were randomly assigned from the two classes into four groups on the day of the experiment. There were 13 (6 female and 7 male) participants in the immediate study group condition and 12 (6 female and 6 male) participants in the immediate study with test group condition. There were 15 (7 female and 8 male) participants in the delayed study-only group condition and 10 (7 female and 3 male) participants in the delayed study with test group condition. Note that some students in the two delayed testing groups were absent in the second phase of the experiment resulting in uneven numbers.
The instructional materials were similar to experiment 1. We used a puzzle poem with minor differences in the format and wording compared to experiment 1. A reduction of lines and sequence of rhyming words was used to reflect the younger age of the students and the content of the poem was not related to the tertiary content of experiment 1. Other minor changes were the use of seven not eight lines in total and the last word of lines 1 and 2, 3 and 4, and 5 and 6 had to rhyme in pairs, respectively. This procedure was a less complex order of rhyming than used in experiment 1. Two worked examples were on the sheet for the study-only groups. In contrast, the study with test groups had a worked example with a blank column on the right where they could write their responses during the instruction phase.
In summary for this experiment, the instructional material was high in element interactivity with similar rules to those used in experiment 1. The resource depletion task was the same one used in experiment 2. The post-test was writing an original puzzle poem.
The experiment consisted of a (1) pre-instructions (5 min), (2) an instruction (15 min), (3) the resource depletion task (7.6 min) and (4) a post-test (15 min) phase.
All the students were informed from a memorised script presented by the researcher that they were going to be taught how to write a puzzle poem by being shown worked example/s on a worksheet. They were further told that, during the entire instruction phase, they would have to concentrate carefully by reading the worksheet. Dependent on which group they were in, they were instructed to either study two worked examples (the study-only group) or study one worked example and practice what they had learnt by writing in the far-right column (the study with test group). The resource depletion task and post-test were explained. The immediate test groups were told they would be having the resource depletion and post-test after their learning time. The delayed test group were told they would be having these three components in 7 days. All groups were given the instructional material for the puzzle poem and all groups had 15 min to learn their material. After the 15-min instruction time, the instruction sheets were removed and the delayed groups were taken to another area to complete a pre-prepared reading activity. The immediate test groups continued in the experiment by participating in the resource depletion task and completing the post-test.
The post-test sheet was a blank sheet of paper. Students were instructed to write a new poem that was not the same as the poem in the instructions by recalling the rules without access to the instructions.
The delayed test group received their resource depletion task and post-test during morning classes 7 days later. During this time, the students in the immediate groups were withdrawn to another class by a class teacher.
The scoring was out of 10 marks converted to percentages. An estimation of elements was completed using the same approach as in experiment 1. It was not possible to correctly construct a poem unless students followed all of the rules as explicitly shown in the worked example/s. Each mark awarded was for having a criterion correct; therefore, marking was straightforward. As for the previous experiments, however, writing was marked independently by two markers. An intra-class correlation coefficient was conducted for reliability on all of the two sets of original marking, ICC = 0.87, p < 0.01. Cronbach’s (1951) alpha for internal consistency of the ten-item subscale was conducted (α = 0.82).
The following are some examples of the students’ test writing and the marking procedure: Student 3 from the immediate study with test group completed this poem.
The student matched all criteria and scored 10 marks out of 10. In contrast, student 5 from the immediate group wrote:
My first letter is in snake but I’m not a cake
My second letter is in orange but I’m not a fake
My third letter is in lid but I’m not a ice
My fourth letter is in igloo but I’m not a mice
The fifth letter is in duck but I’m not a thong
My whole is strong
Answer is solid
This student matched only 6 out of 10 of the explicit criteria outlined in the instructions. The poem was awarded 1 mark for having the last word of each line rhyme in pairs, 1 mark for having the first letter being part of the vertical word (apple), 1 mark for having the first letter make a word, 1 mark for having a 6th line explanation, 1 mark for having six lines in the poem itself and 1 mark for “answer is”.
All my friends are nice
Parker likes rice
Paul sits on the mat
Lyla is my cat
Eggs are yum
I like fruit flavoured gum
Answer is fruit
Results and Discussion
Results (means and standard deviations) of experiment 3
Immediate group study
Immediate group study with test
Delayed group study
Delayed group study with test
Resource depletion test
A 2 (immediate or delayed) × 2 (study only or study with test) ANOVA was conducted on the resource depletion task. There was a non-significant main effect between the immediate and delayed test although the means favoured the two delayed groups, F(1, 46) = 3.56, MSe = 554.63, p = 0.065, ηp2 = 0.07. There was a significant difference between the study groups favouring the study-only groups, F(1, 46) = 4.07, MSe = 554.63, p = 0.049, ηp2 = 0.08. The test timing by study group interaction was not significant, F(1, 46) = 1.70, MSe = 554.63, p = 0.199, np2 = 0.03.
The significant difference favouring the study-only groups was analysed further. The immediate study-only group had a significantly greater working memory capacity than the immediate study–test group, t(23) = 2.36, p = 0.014, Cohen’s d = 0.93. There was no significant difference between the two delayed groups, t(23) = 0.5, p = 0.311, Cohen’s d = 0.22. As can be seen by the effect sizes, the immediate test with its large effect size contributed more to the significant result than the delayed test with a much smaller effect size. These results suggest that the major contributor to the significant difference in memory capacity between the study-only and the study–test groups was the difference between the two immediate test groups.
A 2 (immediate or delayed) × 2 (study only or study with test) ANOVA was conducted on the post-test results. There was no significant difference between the immediate and delayed post-test results, F(1, 46) < 1, MSe = 761.02, p = 0.975, ηp2 < 0.01, nor an overall testing effect, F(1, 46) = 0.51, MSe = 761.02, p = 0.479, ηp2 = 0.01; however, the interaction was significant, F(1, 46) = 6.88, MSe = 761.02, p = 0.012, ηp2 = 0.13. Because of the significant interaction, t tests were conducted between groups. There was no significant difference between the two immediate groups although the means favoured the study-only group, t(24) = 1.36, p = 0.093, Cohen’s d = 0.51. There was a significant difference between the two delayed groups favouring the study with test group, t(24) = 2.34, p = 0.014, Cohen’s d = 1.07.
These results showed that by using high element interactivity information, there was some evidence of differential resource depletion immediately following a test compared to restudy. The same 2 × 2 experimental design was used as in experiment 2. The study plus testing groups had a reduced working memory compared with the studying plus restudying groups with a larger effect size on the immediate test compared to the delayed test. That difference may have precluded a testing effect on the immediate test but allowed a testing effect on the delayed test resulting in the significant post-test interaction.
The testing effect provides an interesting phenomenon that suffers from inadequate theoretical explanations. It has been clear for over a century that including a test during learning can facilitate subsequent post-test performance compared to restudying. Furthermore, there is substantial evidence that the effect is far more likely during a delayed rather than an immediate test of performance. The effect can reverse on immediate tests. As far as we are aware, there is no obvious reason that has been presented to indicate why testing during learning can facilitate long-term learning but retard immediate learning.
Recent enhancements to cognitive load theory (Chen et al. 2018) may explain why the testing effect can reverse using immediate rather than delayed tests. The theory is based on the assumptions that educationally relevant instructional material consists largely of domain-specific, biologically secondary information and that the cognitive architecture that governs the processing of that information requires a limited capacity and duration working memory to deal with novel information, a large long-term memory to store the products of learning, an ability to obtain most of the required information from other people and the remaining information by generating it during problem solving. Once stored in long-term memory, information can be retrieved to working memory and used to govern action appropriate to the external environment.
This categorisation of knowledge and the cognitive architecture on which cognitive load theory is based assumed a fixed capacity working memory. This assumption needed to be modified (Chen et al. 2018). Working memory resources can be depleted with activity and restored with rest. With this addition, phenomena that otherwise seemed uninterpretable could be explained. Working memory resource depletion and recovery can be used to account theoretically for the reversal of the testing effect using immediate or delayed tests. The current experiments tested the relevant hypotheses.
Experiment 1 tested the hypothesis that working memory resources were depleted after cognitive effort and replenished after rest. The results indicated a smaller capacity working memory was available immediately after learning than after rest, an essential requirement if the reversal of the testing effect depending on timing is to be explained by working memory resource depletion. Experiments 2 and 3 directly tested the hypothesis that the use of testing rather than restudy increased working memory resource depletion immediately after the learning phase with recovery following a delay. The data of both experiments found an interaction on post-test results with both experiments indicating a superiority of restudy on an immediate post-test (a reverse testing effect) and a superiority of testing on a delayed post-test (a testing effect). In both experiments, the working memory tests provided some evidence of a depletion of working memory immediately after learning. Contrary to our hypothesis, on immediate tests, experiment 2 found no evidence of decrease working memory capacity following testing compared to restudy during learning. In contrast, experiment 3, using higher element interactivity information found more depletion following testing during learning with most of the effect occurring on the immediate test.
These results provide support for the hypothesis that working memory resource depletion may be a factor in the reversal of the testing effect using immediate rather than delayed post-tests. While this reversal effect has been known for some time (Roediger and Nestojko 2015), there has been little consensus on the reasons for the reversal effect. Working memory resource depletion may provide a promising candidate hypothesis.
While experiments 2 and 3 used groups with low numbers of participants, the fact that experiment 3 was a replication of experiment 2 with the only difference being the levels of element interactivity reduces the probability of chance results. Nevertheless, it would be useful to run similar experiments with increased sample sizes.
From a theoretical perspective, working memory resource depletion after cognitive exertion is an important addition to cognitive load theory. The assumption that for any individual, working memory capacity remains largely consistent, with the only change being due to the distinction in functioning between the narrow limits of change principle and the environmental organising and linking principle, is unviable. Working memory capacity can change substantially due to depletion after cognitive effort and recovery after rest.
From an educational perspective, it is clear from a large number of studies that testing during learning can enhance post-test performance. The timing of those post-tests is critical. Entirely different results may be obtained if tests are presented immediately after cognitive exertion rather than after cognitive rest.
- Carpenter, S. K., Pashler, H., Wixted, J. T., & Vul, E. (2008). The effects of tests on learning and forgetting. Memory & Cognition, 36, 438–448. https://doi.org/10.3758/MC.36.2.438.
- Gates, A. I. (1917). Recitation as a factor in memorizing. Archives of Psychology, 6(40).Google Scholar
- Geary, D. C. (2012). Application of evolutionary psychology to academic learning. Applied Evolutionary Psychology. https://doi.org/10.1093/acprof:oso/9780199586073.003.0006.
- Geary, D. C., & Berch, D. B. (2016). Chapter 9: Evolution and children's cognitive and academic development. Evolutionary Psychology, 217–249. https://doi.org/10.1007/978-3-319-29986-0_9.
- Kirschner, P., Sweller, J., & Clark, R. (2006). Why minimal guidance during instruction does not work: an analysis of the failure of constructivist, discovery, problem-based, experiential and inquiry-based teaching. Educational Psychologist, 41(2), 75–86. https://doi.org/10.1207/s15326985ep4102_1.CrossRefGoogle Scholar
- Knapp, P., & Watkins, M. (2005). Genre, text, grammar: technologies for teaching and assessing writing. Sydney: UNSW Press doi not available.Google Scholar
- Kühn, A. (1914). Über Einprägung durch Lesen und durch Rezitieren [On imprinting through reading and reciting]. Zeitschrift für Psychologie, 68, 396–481 doi not available.Google Scholar
- Roediger, H. L., Putnam, A. L., & Smith, M. A. (2011). Ten benefits of testing and their applications to educational practice. Psychology of Learning and Motivation: Advances in Research and Theory, 55, 1–36. https://doi.org/10.1016/b978-0-12-387691-1.00001-6. CrossRefGoogle Scholar
- Roediger, H. L., & Nestojko, J. F. (2015). The relative benefits of studying and testing on long-term retention. In J. G. W. Raaijmakers, A. H. Criss, R. L. Goldstone, R. M. Nosofsky, & M. Styvers (Eds.), Cognitive modeling in perception and memory: a festschrift for Richard M. Shiffrin (pp. 99–111). New York: Psychology. https://doi.org/10.1037/e633262013-206.Google Scholar
- Sweller, J. (2011). Cognitive load theory. In J. Mestre & B. Ross (Eds.), The psychology of learning and motivation: cognition in education (Vol. 55, pp. 37–76). Oxford: Academic. https://doi.org/10.1016/b978-0-12-387691-1.00002-8.CrossRefGoogle Scholar
- Sweller, J. (2012). Human cognitive architecture: why some instructional procedures work and others do not. In K. Harris, S. Graham, & T. Urdan (Eds.), APA educational psychology handbook (Vol. 1, pp. 295–325). Washington: American Psychological Association. https://doi.org/10.1037/13273-011.Google Scholar
- Van Gog, T., Kester, L., Dirkx, K., Hoogerheide, V., Boerboom, J., & Verkoeijen, P. P. J. L. (2015). Testing after worked example study does not enhance delayed problem-solving performance compared to restudy. Educational Psychology Review, 27(2), 265–289. https://doi.org/10.1007/s10648-015-9297-3.CrossRefGoogle Scholar