Introduction

The testing effect is the phenomenon that retrieving information from short- or long-term memory can—under many circumstances—strengthen one’s memory of the retrieved information (Donoghue & Hattie, 2021; Roediger & Butler, 2011; Rowland, 2014). Benefits of the retrieval effect (testing effect) compared to re-reading have been investigated several times in the past decade (Rowland, 2014; Adesope et al., 2017). Test-enhanced learning has been proved efficient on diverse learning materials including foreign word learning (e.g., Keresztes et al., 2014), text memorizing (Roediger & Karpicke, 2006), and skill learning (Kromann et al., 2009). Research on test-enhanced learning has been conducted in laboratory circumstances (Butler et al. 2007), in simulated school environments and in real school situations (e.g., McDaniel et al., 2013). However, the application of the testing effect in real mathematics educational environments needs to be examined from several aspects (Buchin & Mulligan, 2019; Dunlosky et al., 2013; Lyle et al., 2020; Peterson & Wissman, 2018). For this reason, our aim is to gain a better understanding of how to incorporate the testing effect in classroom settings and to lay the groundwork for further experiments. Therefore, our research is an exploratory case study, where we test a method based on test-enhanced learning in mathematics lessons in a high-needs vocational school.

Mathematics and long-term retrieval

Calderón and Caterino (2016) found a significant connection between long-term retrieval skills and mathematics achievement among middle and high school students. Based on their findings, they claim that research should be carried out in secondary schools on the effectiveness of educational interventions to improve long-term retrieval skills in general and on arithmetic facts and problem-solving procedures, in particular. Also, more empirical studies are required on interventions that include the development of cognitive skills (Kearns & Fuchs, 2013).

The question arises whether continuous retrieval helps with constructing long-term knowledge in mathematics. If the answer is yes, then in what form can the method be effectively included in in-class teaching? How to bring it into the classroom? When we translate test-enhanced learning into the classroom, we need to consider the following: retrieving must take place within 24 h, there should be no copying, no cheating, it should not take a lot of time, and students should be involved in it. The form of the testing and the type of questions have to be considered, as well. There are many possible ways of implementing test-enhanced learning in high school; we would like to show that a specific kind of formative assessment can be a fertile area for it.

Literature Review

Over the past decade, there has been a growing interest in the effects of retrieval practice on learning, and what the best way is to apply the benefits of retrieval (or testing) in complex tasks, materials, and assessments found in educational settings.

A large-sample investigation of Avvisati and Borgonovi (2020) examines the relationship between the testing effect and problem-solving in mathematics, on authentic educational material. They demonstrate that the amount of mathematical problems in the first test had a small positive effect on the average mathematics performance on the second test among fifteen-year-old students. However, their environment is not a real educational environment in the sense that they measured the effect of a single test practice. We cannot rely directly on their result and on the above-mentioned literature concerning the efficiency of testing effect in teaching mathematical problem-solving within real classroom settings, because the existence of the testing effect is not universally evident. As a next step, we describe some reasons that imply that this particular aspect needs further research.

According to Karpicke et al. (2015) to be able to retrieve, use, and apply knowledge in the long term, it is highly effective to practice retrieving. However, some researchers have suggested that retrieval practice is not beneficial for “complex” materials (e.g., van Gog & Sweller, 2015). They defined “complex material” as “high in element interactivity, containing various information elements that are related and must therefore be processed simultaneously in working memory” (van Gog & Sweller, 2015, p. 248). Their findings have been criticized by Karpicke et al. (2015) for the lack of an objective measure of complexity, and evidence demonstrating a positive testing effect with complex materials. They also emphasized the benefits of retrieval practice for learning complex materials by showing a wealth of recent research on the topic, such as the research of McDaniel et al. (2009), Chan et al. (Chan, 2010; Chan et al., 2006), and Butler (2010).

Concerning mathematics education, Yeo and Fazio (2019) examined the efficacy of retrieval practice versus studying worked examples and worked examples. The optimal learning strategy depended on the retention interval and the nature of the materials. They found that when students’ goal was to remember the text of a worked example, then after a 1-week delay, repeated testing was more effective than repeated studying. Meanwhile, when learning a novel math procedure and measuring performance immediately, repeated studying was more optimal than repeated testing, regardless of the nature of the materials. For long-term retention (on a 1-week delayed test), repeated testing was more effective than repeated studying with identical learning problems. With nonidentical learning problems, repeated testing was as effective as repeated studying.

Another aspect that needs further investigation is the efficiency of the testing effect in case of tasks requiring deductions. In the experiment of Tran et al. (2015) and its replication by Wissman et al. (2018), participants had to make deductions based on the sentences of a scenario as premises. The results of the tested group were not better than those of the re-reading group. Wissman et al. (2018) explained the results with the fact that the tested group had only 2 repetitions, while re-reading happened 10 times (as in Tran et al., 2015). Another possible explanation can be that test-enhanced learning can be efficient only if the retrieval is successful during the test (Karpicke & Roediger, 2007), and in Wissman et al., retrieval during the test was only 50% successful. In a second experiment, both the testing and the re-reading happened four times during the learning phase and the final test was a delayed one. In this case, the testing effect was detected. Similar results were obtained by Eglington and Kang (2018).

Each time a student is tested on a material, they get feedback on their performance (either formally from the teacher or only by their own personal impression), which can affect the students’ self-efficacy. By self-efficacy, we mean a student’s belief that they are capable of successfully performing a task—as defined by Corkett et al. (2011)—which is also in line with the definition of Fast et al. (2010). Self-efficacy must be taken into consideration when designing an experiment involving students’ achievement, because it is a strong predictor of academic achievement, course selection, and career decisions across domains and age levels (Britner & Pajares, 2006).

Several studies suggest that using educationally relevant retrieval activities and assessments improves the acquisition of the learning material (e.g., Agarwal et al., 2012; Butler et al., 2014; Dobson & Linderholm, 2015; Jensen et al., 2014; Lyle & Crawford, 2011; McDaniel et al., 2013; McDermott et al., 2014; Roediger et al., 2011). Although there have been only a few investigations about the testing effect on mathematical problem-solving in a real school environment, recent studies suggest that increasing retrieval practice may be an effective way of learning (Fazio, 2019; Lyle & Crawford, 2011; Lyle et al., 2016, 2020).

Using test-enhanced learning as formative assessment can be a possible way of implementing retrieval practice in high school. In this study, we use the term formative assessment according to the definition of Suurtamm et al. (2016, p. 14), i.e., for every type of activity that teachers use to get information about the current state of students’ knowledge, to provide feedback to students about their own learning, and to plan future instruction, including informal assessments that teachers might do as a part of daily instruction as well as more formal classroom assessments.

Recently, there has been a growing interest in investigating different kinds of assessment strategies that teachers can use regularly to clear up student thinking, such as the use of practical worksheet (Toh et al. 2011), the work with rubrics (Elrod & Strayer, 2015; Smit & Birri, 2014), and the use of learning progression-oriented assessment system (Kim & Lehrer, 2015).

There are teachers who consider learning as loading information into memory, and in their point of view, the role of retrieval is only to examine the efficacy of this input. This perspective affects the practice as well: they do not use tests to potentiate learning by evoking the retrieval effect through formative assessments; retrieval is required only at the end of (or after) the learning process, via writing a test or having an oral examination (Martínez-Sierra et al., 2020).

Methodology

Context of study

In Hungary—similarly to the USA—there is a mathematics achievement gap. Higher-income students consistently outperform lower-income students; also there is a huge gap concerning the performance between high schools: students from urban grammar schools consistently outperform students from urban vocational schools (Auguste & Miller, 2009; Bailey & Dynarski, 2011). It is important to address this gap for social as well as economic reasons (Auguste & Miller, 2009).

We aim to narrow the above-mentioned gap by using the testing effect as a special way of formative assessment. The method is the following: at the end of each mathematics lesson, students write a test on the material learned that day. These tests contain two problems: one theoretical and one practical problem. Tests are evaluated (0–2 points), students receive feedback on their tests and results, and the results affect the end-of-term grades. Using this method, the material learned that day is retrieved; also, the teacher can see how students understood the given material and he/she can use this information when planning the next lesson.

Participants

The research was conducted in the capital city of Hungary, in two high schools of Budapest. Participants of the experiment were ninth-grade students from a vocational school and from an elite grammar school. Besides examining their previous results, we asked their teachers to give us a description of the students’ (social) background and motivation. The teacher at the elite grammar school informed us that their students come from well-organized and supporting families, they are really motivated to study, they usually take part in lot of extracurricular activities, and the admission rate to university is nearly 100%. The situation in the vocational school is the perfect opposite according to their teacher: students have a lot of hard family issues (divorce of parents, having to take care of their smaller siblings, poverty), which can also be a reason for their lack of motivation and the incredibly low rate of applying for tertiary studies. In this vocational school, mathematics lessons are held in groups, not for classes. Ninth graders are assigned to math groups based on their scores on the entrance examinations. Our experimental group was from the vocational school (N = 9)—this group was the weakest group in mathematics with the lowest entrance examination scores.

There were two control groups in the experiment. One control group was formed by all the other groups from ninth grade of the vocational school (N = 23; control-vocational), the other (N = 34; control-grammar) consisted of students from the elite school, i.e., all the members of two classes of 9th grade.

Preliminary data collection.

First, teachers were chosen for the experiment considering the following criteria: we were looking for teachers who teach in different types of high schools but in the same grade. Then, we assigned the experimental and the control groups. To measure students’ achievement and development, we used both qualitative and quantitative methods of data collection and analysis.

Part 1: Teacher interviews

There have been several pieces of research on teachers’ beliefs in Mathematics Education focused on mathematical beliefs—beliefs about mathematics, mathematics teaching, and mathematics learning—of mathematics teachers (Beswick, 2007; Cross, 2009; Liljedahl, 2009; Maasz Schlöglmann, 2009; Philipp, 2007; Žalská, 2012). These studies showed that beliefs about the nature and function of mathematics play a central role in teachers’ conceptions of teaching and learning mathematics. Thus, these beliefs have a high impact on teachers’ pedagogical practice. Considering this phenomenon, we were looking for teachers for the experiment who think of mathematics, learning mathematics, and teaching mathematics in a similar way. Also, it was a criterion to teach in the same grade, and last but not least, they are willing to participate in the experiment. The teacher from the vocational school was given at the beginning by availability. We were looking for another teacher from an elite school for the experiment. We followed the concept of Martínez-Sierra et al. (2020) to find a teacher from an elite high school who has the same beliefs on mathematics, learning mathematics, and teaching mathematics. We conducted semi-structured interviews—which consisted of open questions—to collect data. The authors of this paper conducted the interviews on Zoom meetings, which were between 20 and 30 min long. All interviews were recorded. Besides collecting some personal and professional data, three questions focused on the beliefs in mathematics, its teaching, and learning: (1) What does mathematics mean to you? (2) What does learning mathematics mean to you? And (3) what does teaching mathematics mean to you? (See Table 1.)

Table 1 Participants’ mathematical beliefs (part 1 of 2)

Tables 1 and 2 show the mathematical beliefs of the eight teachers who were interviewed. In Table 1, the types of beliefs and their codes used in Table 2 can be seen. In Table 2, there is an X mark in the jth column of the ith row, if the belief marked by Bj was identified as a strong belief during the interview with teacher number i. For example, T1 and T8 both have an X in the column B5, which demonstrates that they both think “Learning mathematics is learning to reason and how to solve problems and take decisions” (belief B5).

Table 2 Participants’ mathematical beliefs (part 2 of 2)

In summary, the two teachers chosen for the experiment were teachers from different types of high schools, teaching in the same grade. Also, as demonstrated during the interviews, they were teachers who think of mathematics, learning mathematics, and teaching mathematics in a similar way.

Part 2: Experimental and control groups

After choosing the teachers—one from an elite high school and one from a socially handicapped vocational school—the experimental and control groups were selected for the research. The experimental and control groups both consisted of 9th graders, who learned the geometry material determined by the National Core Curriculum. We compared the core curriculum for vocational schools and grammar schools and found that for this part of geometry the learning material is identical for the two types of schools. Hence, the foundations of the acquirable knowledge were the same; students had to learn the same concepts.

In the vocational school, the allotted time for this topic was 4 weeks, 3 lessons per week, altogether 11 lessons. In the grammar school, they had 6 weeks, 4 lessons per week, altogether 24 lessons for the same topic. In each group (controls and experimental), the structure of the lessons followed the well-known model: checking the homework, learning something new, practicing the new material through tasks and problems. Furthermore, both teachers were constantly consulting with us and with each other to make sure that the experiment goes fluently and students get familiar with the same concepts, types of problems and tasks. As we mentioned, the structure of the lessons was the same—except for one thing: in the experimental group, students wrote a test at the end of each lesson on the material learned the given day. In these tests, there were two problems, mostly one theoretical and one practice problem to be solved—in some cases, problems were easy calculations. The control groups were taught in a “reported” way, which means that the solutions of the exact same problems were told by the teacher at the end of the lessons. In the grammar school, there were 13 extra lessons. In these lessons, students were practicing the material; no new concepts were being taught. At the end of the topic, each group wrote the same final test that contained a sequence of theoretical questions and four problems to be solved (see Appendix). This final test examines long-term retention, as well.

Research Questions and hypothesis

The previous phenomena and theoretical considerations lead us to transform the main goal of this research—investigate whether using test-enhanced learning in high school mathematics as a special kind of formative assessment can be an effective way of learning—into the following research questions:

Research Question 1

Do students from urban high-needs vocational schools perform better and achieve a higher level of geometrical understanding if they are taught with test-enhanced learning as formative assessment (described below) instead of the reported way (described above).

Research Question 2

If students of an urban high-needs vocational school use test-enhanced learning as formative assessment, are able to achieve a mean score as high as the students of an elite grammar school in the final test?

Our hypothesis, inferred from previous theoretical considerations, is that one way of addressing this mathematics achievement gap is to feed test-enhanced learning into mathematics lessons as a sort of formative assessment.

Data Collection on Students’ Achievements

Prior knowledge

The data collection was carried out from early March to early June, in 2016. Participants of the experiment were students from the ninth grade of a vocational school (the experimental group and the control-vocational group) and of an elite grammar school (the control-grammar group) in Budapest—the capital of Hungary. Altogether there are 1401 high schools in Hungary, 260 of them located in Budapest. The chosen grammar school is the fourteenth best high-school in Hungary, and the fifth-best in Budapest, while the vocational school takes the 506. place on the national ranking list (based on the results of the maturity examinations and the competency tests), and the 125. place in Budapest. Moreover, regarding the score of the entrance examination, the experimental group was the weakest one of the three groups (one experimental and the two control groups); also, they had the lowest scores among all classes in the vocational school.

The scores of the entrance examination are strong indicators of students’ geometry knowledge since three of the ten tasks were geometry tasks in that year on the entrance examination and one more task required some visual skills. Students of the control-grammar group must have reached nearly maximal score to be admitted to the school, meaning that they scored near the maximum in geometry, too. Control-vocational and experimental group students reached low scores, maximizing their geometry score at a low value. This demonstrates the knowledge gap they should have acquired in grades 1–8 according to the national frame curriculum concerning the following topics and skills: Grade 1–4: “The creation, recognition, and characteristics of triangles, squares, rectangles, polygons, and circles.” Grade 5–8 “Triangles and their categories. Quadrilaterals, special quadrilateral (trapezoids, parallelograms, kites, rhombuses). Polygons, regular polygons. The circle and its parts. Sets of points that meet given criteria.”

Theoretically, each student learned the names and properties of these shapes, the names and properties of their parts, and the hierarchy among them. Practically, the great difference in their entrance examination scores shows that it is true only for the control-grammar group, much less true for the control-vocational group, and even less true for the experimental group (being the weakest one according to their scores).

End-of-class tests

All tests that the experimental group wrote at the end of the 11 lessons were registered. The tests were evaluated (0–2 points), and students received feedback on their tests. The results of the test effected the end-of-term grades. In the control groups, the solutions to the exact same problems were told by the teacher at the end of the lessons. To give the reader an impression (see Table 3) about these end-of-class tests, we present five tests out of the eleven. The solution of these tasks required the same knowledge and ideas as the tasks of the final test, but they were not identical to them.

Table 3 End-of-class tests: some examples

Quantitative Data Analysis and Results

All members of the experimental group performed well on these end-of-class tests; their final results were 78%–96%. It means that retrieval attempts were successful.

At the end of the learning/teaching of a topic, each group wrote a final test that contained a sequence of theoretical questions (Task 1) and four problems (Tasks 2–5) to solve (see Appendix). First, we compared the total scores of the final test and then the scores task by task for each group using one-way variance analysis for independent sample and Sidak pairwise comparison. We detected a significant difference in the total scores of the final test, F(2,63) = 11.03 p < 0.01 h2p = 0.259.

Scores of the experimental group do not differ from the scores of the elite grammar school, and both groups performed much better than the control-vocational group. Analyzing the theoretical tasks the difference is also significant, F(2,63) = 23.69 p < 0.01 h2p = 0.429. Results of the experimental group and the control-grammar group are statistically the same, and the control-vocational group scored significantly weaker than the former ones.

Qualitative Data Analysis and Result of the Final Test

According to Dey (1993), there are three main levels of analysis in qualitative data analysis, namely description, classification, and association. During descriptive analysis, it is an important process to identify and explain the possible features of the individual, case, and events in the study. First, the solutions of children were described in detail; then, these were systematically categorized, and finally, the answers of the children from the experimental group were compared to the answers of the children from the other two groups.

In Table 4, it can be seen that the largest (although statistically insignificant) difference in the average score of the experimental group and the control-grammar group was in the first task. Therefore, in the following, we present an overview of the answers given by the students to this specific task and a more detailed analysis of this task.(Table )

Table 4 Quantitative Data Results
Table 5 Selected students from each group

Task 1 focuses on theory: students needed to remember and apply the definitions of symmetries, transformations, and polygons, and evaluate sentences containing symmetry problems. In symmetry problems, one must know and consider that different types of symmetries have different effects on the properties of shapes. This knowledge must be applied in two logical directions. Firstly, being able to identify which symmetry preserves which property, secondly, thinking in the inverse direction, given the preservation of a property, being able to identify which symmetry could have been applied. For example, some important properties of reflection (to an axis) are: points of the axis do not move and distances are preserved (equal sections are transformed into equal sections). In addition, reflection is bijective: each point has only one reflected image which does not belong to any other point, so we can search the “pre-image” for each image point. For example, in the first sentence, we have a given property: length – but without knowing which side of the triangle it is. In order to solve this task, one must see how it is possible to obtain each possible case. Then, they need to search which points of the plain remain unchanged in the transformation (finding the axis) and which points’ image has to be examined.

Below, we present six students’ solutions as examples (Table 6). These students were selected by taking the following factors into account: On the one hand, we wanted to choose students who passed the final test with an average score in their own groups; on the other hand, we tried to select test sheets where all types of errors have appeared. Table 7 shows the analysis (Dey, 1993) of all of the students’ solutions to the first task. Although there were no identical solutions; the generalization and the classification of the solutions were possible during the analysis. If an error/phenomenon occurred in the solutions of at least 30 percent of the students, we mark it as “common”, and an error/phenomenon occurred “rarely” if it appears in less than 30 percent of the students’ solutions.

Table 6 Presentation of the selected students’ work in the theory part (first task)
Table 7 Description of the students’ solutions in task number one

Checking Tables 4, 6, and 7, it can be noticed that in a certain sense the qualitative results match the quantitative results. The solutions of the experimental group are better than the solutions given by the students from the control-vocational group. Namely, more types of errors occurred in the latter, and the frequency of errors was higher as well. (For example, we can see in Table 7 that the error types “Inversing the statement,” “Not knowing the definitions,” and “Other logical error” are rare in the experimental group and common in the control-vocational group.) Although the answers given by the students from the experimental group are similar to the answers from the control-grammar group, students from the control-grammar group gave more precise and complete solutions. Furthermore, the errors “Thinking in examples, not in general” and “Inversing the statement” appeared more frequently in the experimental group than in the control-grammar group, as shown in Table 7.

For the second and third (problem-solving) tasks, students needed to apply a mathematical rule, e.g., \(angle \left(degree\right)=angle (radian)\bullet \frac{\pi }{180}\). For the fourth task they needed to be able to orientate themselves within a coordinate system, needed to know what a position vector and translation is, and needed at least one correct method of performing a translation in a coordinate system (e.g., “adding coordinates”—knowing which one should be added to which; “making steps as the vector indicates”—which coordinate is on which axis). The experimental group reached the level of the control-grammar group in all three tasks. They knew the rules, and they were able to apply them. In contrast to the experimental group, “do not know the rules” or “be unable to apply the rules” types of errors were typical by the students from the control-vocational group.

The second largest gap between the experimental group and the control-grammar group was in the fifth task. The solution of the fifth task is complex in the sense that it requires the joint application of the knowledge on triangles and circles. They must see clearly that the same segment has different roles at once (it is hypotenuse in the right triangle and chord in the circle). The same types of errors—calculation error, errors during rearranging the equations, thinking the perimeter is equal to the arc belonging to 90°—appear in the solutions of all three groups. The main difference between the experimental group and the control-grammar group was in the algebra part of the task. In the experimental group, all students knew how to begin the task; however, they made serious algebraic errors, (such as \(\frac{a}{b}\cdot c=\frac{a\cdot c}{b\cdot c}\)), and many calculation and rearranging errors—it was a too complex task for many students. Meanwhile, the students from the elite school did not make serious algebraic errors; they made algebraic mistakes only a few times. In the control-vocational group, it was typical not to reach the calculation part of the task, and even if somebody reached this part, usually they made calculation and rearranging errors or errors with the units.

While analyzing the solutions of the students in these critical tasks, we generated clusters that help organize the phenomena found in the solutions—types of errors, methods of thinking, the extent to which the solution is correct, etc. These phenomena and their frequency in the different groups are presented in Tables 7 and 8.

Table 8 Classifications of the students’ solutions in task number five

Result of the Research Question 1

The results of the final test showed that the method presented in this study—using test-enhanced learning as a special kind of formative assessment—were successful in the topic of geometry, in an urban high-needs vocational school. The experimental group performed outstandingly thanks to the method applied in the experiment. Those students who learned the material with the help of test-enhanced learning as formative assessment achieved significantly higher scores on each task and understood better the geometrical definitions found in the test, and they committed less errors in their argumentation and problem-solving strategies than those who were taught in a “reported” manner.

Result of the Research Question 2

According to the results of the final test, students of the urban high-needs vocational school were able to achieve a mean score as high as the students of an elite grammar school in the final test using test-enhanced learning as formative assessment. They were able to perform in the topic of geometry nearly as well as the students of an elite grammar school. Although the experimental group performed well on the final test, in the first and fifth task there was a difference between the experimental and the control-grammar groups. This difference can be attributed to the experimental group’s deficiency in algebra (miscalculations and errors in rearranging equations) and in formal logic. By means of the method, we were able to reduce the performance gap in long-term retention between students from the elite grammar school and students from the urban high-needs school: the long-term knowledge of vocational school students was comparable to that of grammar school students.

Finally, it is important to add that we monitored the experimental group’s performance in other subjects than mathematics, and in those subjects, there was no improvement in the test results during the experiment. It follows that the good results of the experimental group are not based on the simple training of taking tests, but on testing and thus reinforcing the Geometry knowledge itself.

Discussion

This research is an exploratory case study, where we tested a method—using test-enhanced learning as a special kind of formative assessment—in a high-needs vocational school. The purpose of the case study was to see whether the mathematics achievement gap between vocational schools and grammar schools can be reduced by using this specific method.

We found that using test-enhanced learning in high school mathematics as a special kind of formative assessment can be a fertile area for retrieval practice—the strategic use of retrieval to promote learning. We showed the efficiency of the method on secondary-school mathematics learning material in a real school-environment: members of the experimental group outscored their schoolmates and reached statistically the same scores as the control-grammar group from an elite school. By means of the method, we were able to reduce the performance gap in a given Geometry topic between students from the elite grammar school and students from the urban high-needs school.

One limitation of our study is that for practical reasons we could not balance all features of the groups, especially the experimental and the control-grammar groups. The main differences are the number of students in the group (N = 9 vs. N = 34), the time allotted for the experimental topic (11 lessons vs. 24 lessons), and the teachers who participated in the experiment. In order to reduce the latter effect, we chose teachers for the experiment who think of mathematics, learning mathematics and teaching mathematics in a similar way; however, it cannot be assumed that the teaching instructions of the two teachers were comparable. The learning process of the groups was tightly synchronized so that they get familiar with the same concepts, types of problems and tasks, the control-grammar group could do more tasks due to their more lessons, among them some more difficult ones. These differences can be considered as weaknesses from a purely scientific point of view, but we cannot omit the fact that the above-mentioned features of the groups are dictated by reality itself. If we had artificially modified the circumstances in order to get a more laboratory-like, sterile experiment, we would have alienated our research from the reality of these schools and our aim to investigate the possible benefits of test-enhanced learning applied in real school circumstances that may help us in closing the gap between those in vocational school and in grammar school.

One might also say that this good result is not only a consequence of test-enhanced learning but also has a strong connection to beliefs and self-efficiency. Our experimental group had already heard from some of their teachers that they are the weakest group. However, in these lessons, they performed well, not only relatively (compared to their average performance), but also in absolute terms: on the end-of-lesson tests, their average result was 80%. Thus, the good result of the final topic test might have been caused by their positive belief that they could do it—in other words, because of the improvement of their self-efficacy (Bandura, 2008). Moreover, we can say that test-enhanced learning helps improve not only the results of the students but also their self-efficiency. Based on the fact that testing effect is a memory-process while self-efficiency is an attitude, the above-mentioned concerns are not real limitations. We would add that our results and the circumstances show that test-enhanced learning not only helps students learn the given material but also helps them improve their general learning abilities. As we mentioned before, students in this school are unmotivated, do not write homework (based on the reports of their teacher)—briefly, there is no learning process outside of the school. Hence, any result must be in relation to our experiment.

The good performance of the experimental group might also be in relation to the fact that there are several indirect effects of testing as McDaniel et al. (2015) demonstrates. For instance, tests presumably provide students a fairly accurate gauge of what they know and what they do not know, thereby potentially allowing more effective study allocation (what to study) in preparation for a final (summative) assessment, and tests also may potentiate learning on subsequent study. Based on Adesope et al. (2017), we suggest that the use of retrieval practice learning activities will help students develop test-taking skills that may improve performance on high stakes tests.

This result may not be attributed to test-enhanced learning; still, we see the power of this study in the practical implications. Although further investigation is needed in the topic to draw a far-reaching conclusion, this method seems to be an effective way of learning mathematics that mathematics teachers can feed into their lessons to foster durable learning. Mathematics lessons already typically require students to retrieve key information multiple times and solve different kinds of problems. Using this method would not be a dramatic change in the structure of the lessons; however, it should help students retrieve and organize information.

As a conclusion, our results also affirmed that retrieval practice enhances learning in “educationally relevant tasks that are closer to the ultimate goal of education” (Karpicke et al., 2015) in the field of teaching and learning high school geometry, where memorization is not enough, there are a lot of different methods to use, and a high need for analogical and deductive reasoning.

Future research is to implement a similar experiment, but with a larger number of participants, and to examine this kind of test-enhanced learning method in different school environments to explore how effective the test-enhanced learning method is in different parts and levels of teaching and learning mathematics. We believe that well-applied testing can strengthen the retention in mathematics as well, and with this study, the mathematics education could be more effective.