Introduction

The retention of foundational knowledge is crucial in learning and teaching mathematics (Hopkins et al., 2016). Since mathematics teachers are required to explain their understanding of mathematical concepts and procedures to students during instruction, mathematics teachers need to have a well-connected deep understanding of fundamental mathematics (Ma, 2020). Advanced mathematics, like abstract algebra, serves to deepen and confirm more rigorously the specific mathematical ideas secondary teachers will teach (Wasserman, 2018). Pre-service mathematics teachers, therefore, have to be taught in a way that promotes understanding and long-term retention of mathematical concepts and the relationship between them. In other words, in university mathematics courses, the learning strategies applied should prevent forgetting and enhance understanding and the creation of long-term knowledge. Retrieval practice (the strategic use of retrieval to enhance memory) and worked examples (showing students a step-by-step demonstration of how to solve a problem) are both recommended as effective methods for improving learning (Adeniji & Baker, 2023; Dunlosky et al., 2013). However, they are based on different underlying cognitive processes. In this article, the effectiveness of worked examples and a specific type of retrieval practice were compared in an abstract algebra course concentrating on medium- and long-term retention. We conducted an experiment for second-semester pre-service mathematics teachers in the context of the “Algebra and Number Theory 2” course (ANT 2) in the 2017/18 spring term. During the semester, we divided the class into two groups of sizes 39 and 37. One of the groups studied the material using retrieval practice, while the other one studied with worked examples. They took two tests on the material they learned on the topic of polynomials: one on the 6th week of the semester and another one at the end of the semester. This way, we measured what students could retrieve in the medium-term. To measure the long-term effects of these two methods, students took a post-test five months after their final test, on the topic of polynomials. This test was taken by all the students who completed the course “Algebra and Number Theory 2” and who had enrolled for the continuation of the course, Algebra and Number Theory 3 (ANT 3). In this paper, we analyze the results of these three tests.

Literature review

Testing and test-potentiated learning

Unless the learned material is consciously reviewed, people forget about half of the newly gained knowledge within days or weeks (Averell & Heathcote, 2011; Ebbinghaus, 1913; Murre & Dros, 2015). Testing works against forgetting. Several studies showed that knowledge acquired by testing is more resistant to interference effects and leads to a lower forgetting rate (Kliegl & Bäuml, 2016; Racsmány & Keresztes, 2015; Szpunar et al., 2008). In addition, it has a promoting effect on learning: it produces better organization of the acquired knowledge, enhances its transfer to new contexts, and produces faster access to learned information (Chan et al., 2018; Jacoby et al., 2010; Racsmány et al., 2018; Zaromb & Roediger, 2010). Testing, also known as retrieval practice, is a learning technique that involves recalling to-be-remembered information from memory. It can refer to any activity (such as questions during class, quizzes, flashcards, brain dump, and examination questions) that requires retrieving information from memory without the help of any external sources. Although it was more than 100 years ago that testing was first studied (Abbott, 1909), it is only in the last 20 years that the testing effect—the fact that active retrieval produces better retention than passive rereading—has become the focus of research (Rowland, 2014). The direct effects of retrieval can be seen when students study a set of materials and then practice retrieval without restudying or receiving feedback after retrieval (Karpicke, 2017). Any gains in learning from practicing retrieval, without restudy or feedback, represent the direct effects of retrieval processes on learning.

Also, testing can have an indirect effect on learning: unsuccessful retrieval attempts can enhance the effectiveness of subsequent restudy. The attempts to retrieve information from memory may improve later restudy items even in conditions when the retrieval attempt fails and feedback is not given (Arnold & McDermott, 2013; Grimaldi & Karpicke, 2012; Izawa, 1966; Wissman & Rawson, 2018). This effect is known as test-potentiated learning. Since in an authentic mathematics education context, it is difficult to separate (and we do not necessarily aim to distinguish) the indirect and direct effects of testing, we did not separate the two in this study but measured direct and indirect effects together.

Testing and test-potentiated learning have proved to be successful in several learning environments, such as text memorizing, foreign language vocabulary, general knowledge facts, learning materials that include visual or spatial information, and skill learning (Rowland, 2014) and have been demonstrated as powerful tools for improving learning not only in laboratory circumstances but in real educational environments as well (Dunlosky et al., 2013; McDermott et al., 2014; Roediger et al., 2011). However, more applied research is needed in mathematics learning (Agarwal et al., 2021). The literature on the application of retrieval practice in real mathematics educational environments, in particular in learning higher mathematics, is limited. Since several factors can affect retention, such as context, subject, and material complexity, it is legitimate to ask whether testing is an effective tool for learning higher mathematics.

Testing and test-potentiated learning when learning complex materials

Some researchers argue that retrieval practice is not beneficial for “complex” materials (e.g., Van Gog & Sweller, 2015). They suggest that the testing effect decreases or disappears as the complexity of learning material increases. By “complex material,” they mean “high in element interactivity, containing various information elements that are related and must therefore be processed simultaneously in working memory” (Van Gog & Sweller, 2015, p. 248). Also, the research of Van Gog and Kester (2012) points to the direction that in the field of science, the testing effect might not work in acquiring problem-solving skills from worked examples. On the other hand, a wealth of studies, such as the research of Butler (2010), Chan (2010), and McDaniel et al. (2009), emphasize the benefits of retrieval practice for learning complex materials. Although only a few studies investigated the testing effect on mathematical problem-solving in a classroom environment, recent studies suggest that increasing retrieval practice may be an effective way of learning (Fazio, 2019; Lyle & Crawford, 2011; Lyle et al., 2016, 2020; Szeibert et al., 2022). The study of Lyle and Crawford (2011) was carried out in a statistics for psychology course, where students had to answer a small set of questions at the end of each lecture to retrieve information learned the same day. This method significantly and substantially increased exam scores. Lyle et al. (2020) investigated spaced versus massed retrieval practice in a precalculus course for engineering students and their impact on long-term retention by independently manipulating the amount and the spacing of retrieval practice. The long-term retention of the students on the material learned in the precalculus course was assessed one month after the end of the course. On this post-test, the memory was significantly better for precalculus objectives that had been the target of spaced quizzing versus massed quizzing. Also, increasing the number of quiz questions did not significantly affect retention. In similar research by Lyle et al. (2016), across-semester retention was measured less directly. In this study, researchers came to the same conclusion as in the abovementioned experiment. Finally, the study of May (2021) investigated spaced-retrieval practice in a mathematics course for second-year pre-service mathematics teachers. The topics of the course were analytical geometry, functions, remainder and factor theorems, Euclidean geometry, and matrices. He found that the intervention enhanced retention in those categories of knowledge and reasoning that were relatively similar to what was presented in class and at the same time, it required imitative reasoning and well-established procedural knowledge. In categories where flexible and creative use of conceptual and procedural knowledge was required, students’ performance was less impressive. The results of May’s study show that repeated and spaced retrieval was effective in enhancing near transfer but was not that effective for far transfer.

Worked examples

The effect of testing when learning complex material is not obvious. Numerous studies report that when learning flexible procedures, retrieval practice is no more effective than repeated studying (Leahy et al., 2015; Van Gog & Kester, 2012; Van Gog & Sweller, 2015; Van Gog et al., 2015). Moreover, it can even be suboptimal: some researchers found that in the case of math and science problem-solving, using worked examples instead of problem-solving tasks was more beneficial (Atkinson et al., 2000; Renkl, 2014; Van Gog & Rummel, 2010). By worked examples, we mean a step-by-step demonstration of how to perform a task or how to solve a problem (Clark et al., 2006). In terms of problem-solving, studying carefully selected worked examples can often be a more effective way of learning than solving further problems. This worked example advantage is often called the worked example effect. The effectiveness of using worked examples is usually explained by the cognitive load theory. It helps in reducing extraneous cognitive load and in increasing germane cognitive load in the initial phase of learning. In other words, it frees up working memory resources so that the learner can pay more attention to constructing and automating schemas (Kalyuga et al., 2010; Sweller, 1988, 2010). Typically, it is beneficial for novice learners, and it becomes less effective as expertise increases (Kalyuga et al., 2010).

Testing versus worked examples

Testing and studying worked examples can be both effective learning methods in the field of problem-solving. However, which one is more effective in the medium and long term when learning to solve abstract mathematical problems is still a question. Regarding mathematics education, one of the most relevant works on the topic of comparing testing and worked examples is the work of Yeo and Fazio (2019). In their research, the efficacy of retrieval practice versus studying worked examples was examined on the topic of Poisson distribution. They argue that both testing and studying worked examples are required to effectively learn complex material. Their experiment included a learning phase and a test phase. The test phase occurred either five minutes or one week after the learning phase. During the one-session learning phase, one part of the participants learned the material using retrieval practice, while the other part learned it by (re)studying worked examples. They found that the optimal learning strategy depends on various factors. Namely, it depends on the learning goal, the retention interval, and the kind of knowledge being learned (stable facts or flexible procedures). The learning processes involved (schema induction vs. memory and fluency building) are also a major factor. They argue that learning objectives determine when it is better to test or study. When the learning goal was to remember stable facts in the text of a worked example, repeated testing resulted in higher recall performance than repeated studying one week after the intervention. When the learning goal was to learn a novel math procedure, the optimal learning strategy depended on the retention interval (five-minute or one-week delay) and the nature of the material (identical or nonidentical learning problems). Five minutes after the learning phase, repeated studying was more effective than repeated testing both with identical and nonidentical problems. One week after the learning phase, repeated testing was as effective as repeated studying with nonidentical learning problems and it was more effective than repeated studying with identical learning problems. Their research findings suggest that the testing effect can occur for flexible procedures as well.

Research focus and research questions

The “Literature review” section focused on studies that have not only investigated the retrieval practice in real educational environments but explicitly studied it within the subject of mathematics. In mathematics, problems often require developed problem-solving skills—the application of procedures and deep conceptual understanding are necessary. Also, mathematical problems themselves are usually complex. The present research was carried out in an abstract algebra course. We investigated prospective mathematics teachers’ retention of ANT 2 problem-solving skills. University algebra courses play a huge role in improving students’ abstraction ability, making connections within different parts of mathematics and science, and developing a deeper understanding of concepts, such as formal calculations, introducing algebraic structures, and understanding their behaviors. By the level of abstraction, we mean the degree of connection between pieces of information; it refers to the extent to which a unit of knowledge is bound to a given context (Hiebert & Lefevre, 1986). In other words, the level of abstraction increases as the knowledge moves away from a specific context. Even though abstract algebra is taught in university-level mathematics, it has numerous connections to primary and secondary school mathematics (Even, 2011; Wasserman, 2018.). Understanding concepts related to abstract algebra and uncovering its connections to primary and secondary school mathematics has a significant impact on mathematics teachers’ beliefs and their practices of teaching (Wasserman, 2014). We believe that understanding algebra at a higher level is essential for mathematics teachers. However, many studies showed that students have difficulties learning abstract algebra (Agustyaningrum et al., 2021; Veith et al., 2022; Wasserman, 2018). Even if students pass their algebra exam, they have trouble remembering the learning material later. The question arises as to whether continuous testing helps retain information in abstract algebra in the long term. It would be a big step forward for university mathematics education if it turned out that testing leverages to increase long-term retention in studying and solving problems in abstract mathematics. In this research, we aim to explore and gain a better understanding of how testing affects studying and solving problems in abstract algebra, if at all.

Research questions

In this study, we were curious about the medium-term (within-semester) effects and long-term (across-semester) effects of testing in solving complex problems in abstract mathematics.

The research questions were the following:

  • Do students who study abstract algebra in a testing way (described below) perform better on complex mathematical problems than those who study by worked examples (described below) on the midterm and the final test (in the medium term)?

  • Do students who study abstract algebra in a testing way (described below) perform better on complex mathematical problems than those who study by worked examples (described below) on the post-test five months after the final test (in the long term)?

  • To what extent do students forget abstract algebra in the long term—five months after the final test—studying it in a testing way (described below) versus studying by worked examples (described below)?

Methods

Sample and study design

The experiment consisted of two parts: in the first part, we investigated the medium-term effect of the interventions in an undergraduate abstract algebra course, while in the second part, we focused on the long-term effect.

The first part of the study was conducted in the spring semester of the academic year 2017/2018 at ELTE Eötvös Loránd University in the capital of Hungary. Participants in the study were N = 76 prospective high-school teachers enrolled in the “Algebra and Number Theory 2” course (ANT 2), so they were students, who had already attended a semester at the university.

The lectures were presented by one of the researchers. He was an experienced university mathematics lecturer who taught this course many times. The course consisted of a 90-min lecture and a 90-min practice session and lasted for 13 weeks. The students attended the same lectures, while their practice sessions were taught in six groups of 15 students on average. All students learned the same material and solved the same types of problems during their practice sessions. Students were not allowed to miss more than two practice sessions; therefore, students attended almost all practice sessions.

If one wants to implement retrieval practice in the classroom, one needs to consider a few requirements: the first retrieval should take place within 24 h, and it must be an activity where students do not use any external help (for example, their books, notes…). Students should be involved in it and should not be exposed to a high level of stress. Also, it should not take a lot of time. Furthermore, the form of testing and the types of questions have to be properly chosen. Taking all these aspects into account, the design of the experiment was the following. There were two types of groups: a testing group and a worked example group. The testing group and the worked example group were selected from the different practice groups of the academic year 2017/18: half of the practice groups were chosen to be testing groups (N = 39), and the other half of the practice groups were selected to be worked example groups (N = 37). Two practice groups were run by professors, two by younger members of the department, and two groups were taught by a Ph.D. student. One of each pair of groups was a testing group and the other one was a worked example group. This way, the teacher’s effect was eliminated as much as possible. These groups did the same exercises during the practice sessions. However, there was a difference between the structures of the two types of groups’ practice sessions in the last 5–10 min of each session.

At the end of the practice sessions, students from the testing group took a test on the material of the given practice session without any aid, or any external help. These end-of-class tests consisted of two open-ended questions related to the topic of the given lesson. This way, they had to retrieve immediately what they had just studied. These tests were corrected (in a simple way: correct or incorrect solution) and evaluated (0–1–2 points). Students did not get the evaluated tests back, and they did not get any feedback about how to solve the problems of the end-of-class tests unless they asked for it. During the semester, the corrected tests were shown to the students if they were curious about their end-of-class test results. However, this was a rather sporadic phenomenon. To make sure that students took these tests seriously, they needed to score 50 percent of the maximum points by the end of the semester to pass the course. It was not difficult to achieve 50 percent, all the students managed to achieve it. To give the reader an impression of these end-of-class tests, six questions are presented (see Table 1). We intended to ask questions that were neither too easy nor too hard for the students, they were of moderate difficulty. The solution of these tasks required the same knowledge content and ideas as the tasks of the midterm and final tests, but they were easier and not identical to them.

Table 1 Sample questions of the end-of-class tests

The same two problems that the testing group solved in the end-of-class tests were shown to the worked example group at the end of the practice sessions. However, in this group, the teacher solved these problems, not the students—in the worked example group, the solutions to these problems were presented step-by-step by the teacher. This way, students could concentrate on the solution pathway. Students’ topic-related problem-solving skills were measured two times during the semester: they took a midterm and a final test on the 6th and the 13th week of the semester, in March and May 2018.

The second part of the study was carried out in the fall semester of the academic year 2018/2019 at ELTE Eötvös Loránd University in the capital of Hungary during the course ANT 3. In the teacher training program, ANT 3 can be taken in either in the third or the fifth semester. Participants of the study were those pre-service teachers who took the course ANT 3 in their third semester and participated in the first part of the study. For example, those who did not pass ANT 2 or completed ANT 2 and had another class instead of ANT 3 did not participate in this part of the study. In this part of the study, we investigated the long-term effect of the two methods. We measured the long-term retention of ANT 2 material by a post-test. Students took the post-test on the topic of polynomials in October 2018, five months after they took their final exam in ANT 2. Altogether, 33 students took this post-test, N = 13 studied ANT 2 in a testing way (testing group), and there were N = 20 students from the worked example group. Students had been informed that they would take a test on what they had learned so far; however, they expected to be asked only about materials learned in ANT 3, not ANT 2. Tests on announced topics can influence student motivation because knowing about an upcoming test often leads students to increase their study efforts (Roediger & Karpicke, 2006). In that case, the effect of retrieval on learning is mediated by enhancements in subsequent restudy, which is why we did not explicitly tell students about the upcoming post-test. All the participants agreed to take the post-test and not doing well in the post-test did not have any negative consequences.

The material covered by the study

The topics covered by the course Algebra and Number Theory 2 can be found in the Appendix. Topics for the study were a subset of the ones presented in the course, namely, Polynomials over fields; Equalities, main coefficient, normal polynomial; Operations with polynomials (addition, subtraction, multiplication, Euclidean division), the zero polynomial; Degree of a polynomial; Roots and factorizations of polynomials, Horner’s method; The maximum number of roots; Identity theorem of polynomials over an infinite field; The relation between polynomial functions and polynomials; Multiplication of roots, root factor form; Lagrange interpolation; Rational roots of polynomials of integer coefficients; The relation between coefficients and roots of a polynomial; Fundamental theorem of symmetric polynomials, power sums, Viete’s formulas; The fundamental theorem of algebra; Polynomials over Z; Gauss lemma; Reduction modulo p; Polynomials over \({\mathbb{Z}}_{p}\); Irreducible polynomials; Irreducibility criteria for polynomials; and Cyclotomic polynomials.

Instruments

In the first part of the study, students took two tests during the semester: one in March 2018 (in the 6th week of the semester) and another one in May 2018 (in the 13th week of the semester). We concentrated on their problem-solving abilities on polynomials. The first test—which students took in March—consisted of seven problems. Out of the seven problems, three tasks were on polynomials. On the second test—which the students completed in May—we asked students five problems, and all of them were on the topic of polynomials (see Appendix). The tests have essentially been the same every second year since 2003.

In the second part of the study, we were curious about students’ long-term retention of the topic of polynomials. We measured it by a post-test at the end of October 2018 which consisted of three problems (see Appendix).

In the Appendix, we also highlight some of the most relevant problems and their solutions presented in the course ANT 2. We argue that all of these problems are considered to be complex. By the definition of Van Gog and Sweller by “complex,” we mean “high in element interactivity, containing various information elements that are related and must therefore be processed simultaneously in working memory” (p. 248). Most of these problems asked about factoring or reducibility of high-degree polynomials over various algebraic structures, \({\mathbb{Z}},{\mathbb{R}},{\mathbb{Q}},{\mathbb{C}}\), etc.… If you look at Table 1, each problem requires different approaches and most of them can be approached from different directions. After choosing and applying an initial trick, in almost all cases, we obtain a lower-degree polynomial to factor, and so on.

Data analysis and results

To measure students’ achievement and development, we used a combination of qualitative and quantitative research methods. To answer our first two research questions (RQ1 and RQ2) we used quantitative research methods. All statistical analyses were conducted using R 4.2.3 software (R Core Team, 2023). Qualitative research methods were used to answer the third research question (RQ3).

RQ1: Do students who study abstract algebra in a testing way perform better on complex mathematical problems than those who study by worked examples on the midterm and the final test (in the medium term)?

We investigated the medium-term effect of retrieval practice versus worked examples in learning abstract algebra in the first part of the study. Students took a midterm and a final test on the 6th and the 13th week of the semester. N = 76 students took both tests, N = 39 from the testing group and N = 37 from the worked example group. To assess relationships between test scores, all scores were calculated as percentages. We used independent sample t-tests to compare the results of the midterm and the final test between the testing and the worked example groups. The normality of the distribution of test scores was checked on quartile-quartile (QQ) plots for the results of both groups on both tests. Assumptions for an equal variance were checked with F-tests for the results of the midterm (F(1,38) = 0.67, p = 0.21), and the final test (F = (1,38), p = 0.057). There was no significant difference between the midterm (t(74) = 1.19, p = 0.21) and final test (t(74) = 1,25, p = 0.24) scores of the two groups. It means that there was no significant difference between the scores of the testing group and the worked example group on either the first or the second test.

RQ2: Do students who study abstract algebra in a testing way perform better on complex mathematical problems than those who study by worked examples on the post-test five months after the final test (in the long term)?

In order to answer RQ2, in the second part of the study, we explored the long-term effect of retrieval practice in learning abstract algebra by a post-test. For this analysis, we chose to examine the scores of all four problems (two from the midterm and two from the final) that were asked again in the post-test. To obtain pre-test scores, we summed up the results of the four repeated tasks from the midterm and final. Participants of the study were those 33 pre-service teachers who completed all three tests: N = 13 students from the testing group and N = 20 from the worked example group. Test results were converted to percentages. Both pre- and post-test scores showed normal distribution in both groups on QQ-plots, so we used Pearson correlation to assess the relationship between pre- and post-test scores. Pre- and post-test scores moderately correlated (r(31) = 0.47, p = 0.006). We used one-way analysis of covariance (ANCOVA) to determine the effect of testing on post-test scores while controlling for pre-test scores. We validated our model and checked model residuals using the DHARMa package (Hartig, 2022). There was a significant difference between the testing and worked example group (F(1,30) = 8.9; p = 0,006). We performed a post hoc analysis using the emmeans package (Lenth, 2024) with pairwise comparisons applying Bonferroni adjustment. The testing group scored significantly higher (20.4 ± 3.3; adj. mean ± std. error) in the post-test than the worked example group (7.9 ± 2.6), adj. p = 0.0057 (see Fig. 1).

Fig. 1
figure 1

Post-test results (adjusted mean ± standard error) of students in the testing and the worked example group. The dots represent the adjusted mean values; the whiskers show the standard deviation (figure generated using R 4.2.3 software)

RQ3: To what extent do students forget abstract algebra in the long term—five months after the final test—studying it in a testing way versus studying by worked examples?

In order to answer RQ3, we investigated the post-test scores. We measured how well students remembered the topic of polynomials—which was a topic they covered in the ANT 2 course. Participants of the study were those 33 pre-service teachers who passed the course ANT 2 and took all the tests: the midterm, the final, and the post-test. Among the 33 students, N = 13 studied ANT 2 in a testing way (testing group), and there were N = 20 students from the worked example group.

On the post-test, students had to solve three problems related to polynomials. The three problems of the post-test resembled the ones that we asked for either in the midterm or in the final test. Each problem involved an initial trick or method that was typical in the theory of polynomials and necessary to solve the problem. In the first pair, there were two high-degree polynomials that students had to factor over \({Z}_{2}\). The initial trick was that a sum can be squared elementwise over \({Z}_{2}\), or one could observe that 1 is a root of the polynomial. The second problem was Problem 2 in both cases; the task was to factor the polynomial \(p\left(x\right)={x}^{8}+{x}^{4}+1\) into a product of irreducible factors over ℝ. There are several ways to start the solution. The third problem was to factor a polynomial with integer coefficients. The first step in the solution was to realize that we had to apply the rational root test.

Those who managed to find the key idea of a problem got at least 1 point. Those who were not able to remember the key idea of a problem got 0 points. We define “forgetting” as achieving 0 points altogether on the post-test. In Table 2. we present the “forgetting rate” in the two groups. By forgetting rate, we mean the number of students who got 0 points on the post-test divided by the number of students. In the testing group (N = 13), 23.08% of the students forgot the material. In the worked example group (N = 20), 55.00% of the students got 0 points on the three problems of the post-test.

Table 2 Forgetting rate of the testing and the worked example group according to the post-test

Summary, discussion, and outlook

Summary and contribution to the field

The retention of foundational knowledge is crucial in learning and teaching mathematics. Retrieval practice, the strategic use of retrieval to enhance memory, and worked examples, showing students a step-by-step demonstration of how to solve a problem, are both recommended as effective methods for improving learning (Adeniji & Baker, 2023; Dunlosky et al., 2013). In this article, we compared the effectiveness of a particular way of using worked examples and a particular type of testing in an abstract algebra course concentrating on medium (within-semester) and long-term (across-semester) retention. We conducted an experiment in an authentic mathematics education context involving pre-service mathematics teachers in an “Algebra and Number Theory 2” course. We showed that in the medium-term there wasn’t any difference in problem-solving between the two groups—on the midterm and the final, the testing group performed the same as the worked example group. On the other hand, we found that testing was more beneficial in the long term than studying with worked examples. The improvement of those students who studied ANT 2 with testing was significantly larger than that of the other group. At the same time, the forgetting rate was more than twice as high in the worked example group than in the testing group.

The method investigated in this study—students take a short test on the material of the given day without any aid, or any external help at the end of each lesson—does not need financial resources, it takes only a small amount of time, and it is simple; thus, it is easy to implement into classrooms. Quizzes and tests are already commonly used assessment tools in tertiary mathematics education and they can have a positive effect on students’ achievement. However, the impact of tests on students’ learning also depends on how educators (and students) use them (Feudel & Unger, 2022). The research results show that taking short tests that require students to retrieve the material studied on the given day can have a powerful effect in the long term even when learning abstract mathematics. In addition to these results, the advantage of the method is that it does not require teachers to reorganize their lessons to a large extent; they only have to spend the last few minutes of the lesson on the test and it does not require any special equipment. It is of high importance that educators do not consider the time of the tests as “lost time.” Instead, they should see taking a 5–10 min long test at the end of the lesson as a method that enriches long-term retention and highly reduces the forgetting rate.

In Hungary—where this research was conducted—students’ attitudes to learning are often such that if an assignment is not compulsory, or if there is no disadvantage in not completing it, students tend not to take it seriously. That is why in this experiment, students from the retrieval practice group had to reach at least 50 percent of the maximum points on the end-of-class tests by the end of the semester—it was a requirement in order to pass the course. This way, we made sure that students took these tests seriously. Still, we wanted to minimize the stress factor of the end-of-class tests: 50 percent was not that hard to achieve, and with a little effort, students could easily reach the minimum level to complete this criterion.

Limitations of the study and outlook for further research

In this study, we investigated the medium- and long-term effects of a certain type of test-enhanced learning in an abstract algebra course. Although this method seems to be an efficient way to be able to retain abstract mathematics in the long term, further research is needed in this area to draw a far-reaching conclusion. We believe it would be important to test the method in different school environments: with different students, at different universities, and in other abstract mathematics courses as well.

In this experiment, the teacher corrected all the end-of-class tests. It would be practical to know how we can minimize teachers’ tasks while maintaining the effectiveness of the method. For instance, is it important to give feedback to students about their performance? If the answer is yes, then what is the best way to give feedback? From whom and how does the feedback have to come from? Can it be a student, a group of students, or a machine who gives it? The question may also arise as to whether the same efficiency can be achieved if students are not tested in class—under controlled circumstances—but are allowed to take the test at home. It would also be useful to know whether the outcome depends on the scoring system we use (or not use at all). Achieving 50 percent was not too difficult; still, it may have had an impact on students’ stress levels which might have affected their performance.

Furthermore, the form of the test should be considered as well. Several forms of testing can be applied, for instance, multiple choice questions, short answer questions, or true or false questions. The question is: are they as effective as open questions? We believe that true or false questions or multiple-choice questions do not provide the same results: we can imagine that if a question needs a more complex argumentation, but the final answer requires only a selection process (circling, matching, crossing, underlying), the student may simplify and not think deeply about the question.

Another limitation of our study may concern the decreased sample size for the post-test. We eliminated this issue by using ANCOVA instead of a t-test.

Last but not least, we cannot say that the beneficial effect of the retrieval practice method on long-term retention is entirely due to the retrieval itself. Retrieval practice has several indirect effects, as shown by McDaniel et al. (2015). For example, the end-of-class tests can give learners a fairly clear picture of what they can retrieve and apply in mathematical problem-solving, and what concepts, procedures, and skills they still need to acquire. Another indirect effect can be that taking tests by itself can help students develop test-taking skills which may improve performance on high-stakes exams (Adesope et al., 2017). One might also assume that there is a latent factor, namely that the method exerts its effect through the active engagement of students rather than via the retrieval itself. We have no valid information regarding students’ higher attention.

The abovementioned limitations open up several pathways for future research. We believe it would be important to investigate how we can minimize teachers’ tasks while maintaining the effectiveness of the testing method. Future research could explore if giving feedback to students should be a key element of the method. Does it have to come from the teacher, from a student, or from a machine? Changing the form of the end-of-class test from open questions to multiple choice questions, short answer questions, or true or false questions could also reduce teachers’ work. However, we have to examine if we can maintain the same performance concerning complex learning material in the long term with these types of questions.

Another possible direction of future research is the separation of direct and indirect effects of testing in a similar context, for example, the “attention effect” and the “activation effect.”

Also, it might be fruitful to explore more deeply whether there is a relationship between the level of abstraction of the material and the effectiveness of the method. In previous experiments, the authors examined the effect of end-of-class tests in learning high-school mathematics and in studying Number Theory problems at the university level as well (Szabó et al., 2023; Szeibert et al., 2022). The authors’ experimental results so far strengthen the positive effects of retrieval practice in abstract environments, and the more abstract the curriculum is, the later the testing effect becomes apparent. Finally, it would be interesting to explore to what extent the use of spaced or cumulative retrieval practice has a positive impact on students’ long-term retention in mathematics and solving complex abstract problems.