It was unclear what difference to expect in terms of recall accuracy between groups and between sessions. We selected a basic science topic and 4th and 5th grade medical students, in order to maximize the odds of a low degree of prior knowledge. We chose the Golgi Complex because the majority of the curriculum does not build directly on this concept, and thus it was likely a forgotten topic. This was important because the lowest the a prior knowledge before our intervention, the smaller student sample would be required to discriminate significant differences in recall accuracy during the study sessions, thus rendering this study feasible.
Evolution of recall accuracy across sessions
There is an effect on recall accuracy reported by students along sessions. It was expected that the study-quiz group would out-perform the quiz group in terms of recall accuracy, at least on s1. Since the quiz task provides the learning materials as the correct answers to the OEQs and additional feedback at the end of the task, it has high learning value. Because we used a 4 point scale to grade recall accuracy, it was reasonable to consider the hypothesis that the quiz task provides enough learning value to master the content and thus expect both groups to report similar recall accuracy results.
The recall accuracy increase was stronger in session s1 for the study-quiz group. It was expected to see an increase in this session since the content was tailored to be fully covered within the 20 minute time limit. The strong gain indicates that this session was the one that accounted for the greatest increase in recall accuracy.
Findings by Karpicke et al. suggest that the testing effect plays an essential role in memory retention, and that after an initial contact with the learning material it is more beneficial to test rather than re-study the material . In addition, since using open-ended assessment questions as a means to learn improves knowledge retention [37,39,47], it was unclear how strong would that increase be in the quiz group. However that increase was only a modest one. That finding might be explained, at least in part, by minimization of the cueing effect - the ability to answer questions correctly because of the presence of certain questions elements [64,65] - through the usage of different questions for each information piece. OEQs are known to minimize cueing [65,66] and in addition, the different questions, although having the same content as answer, minimized that effect. This shows that pairing OEQs with LOs increases the value of the learning material.
In our study we found that recall accuracy increased more in the study-quiz than in the quiz group. If we assume that recall accuracy represents knowledge, then the most likely explanation for higher the increase in recall for the study-quiz group is the additional time-on-task. We were concerned that, because the metric is a subjective one, repeated contact with the content would cause the recall accuracy value to overshoot to nearly 100% after the first contact, regardless of prior knowledge or the time-on-task. However, recall accuracy evolved along sessions according to the underlying variables: recall accuracy at s0 was low because the student cohort did not have any formal contact with the Golgi over 2 years; the study-quiz group - with longer time-on-task - had higher results than the quiz group; recall accuracy improved along the sessions for both groups in part because of the effect of previous sessions.
Thus recall accuracy evolved in accordance to the factors influencing learning.
Adequacy of recall accuracy as a measurement of knowledge
The consistent differences in recall accuracy between groups give and indication that this measurement, although being of subjective nature, seems to be positively related with knowledge acquisition.
Karpicke et al. has shown that in a controlled setting, students cannot reliably predict how well they will perform on a test based on their JOL . Other studies conducted in ecological settings also have shown that the relationship of knowledge self-assessment with motivation and satisfaction are stronger than with cognitive learning [67-69]. Additional research found that in a blocked practice situation learners tend to be overconfident and JOLs are often unreliable .
Our study design differed from the classical designs for studying the effects of spaced repetition, knowledge retention and JOLs  because it was intended to describe recall accuracy evolution in a use-case similar to the real-world use of the system. Therefore, available evidence may not be completely applicable to this study. However, based on our results, we cannot completely refute the hypothesis that recall accuracy is independent of knowledge acquisition and dependent on affective factors. It is possible, though unlikely, that affective factors introduce a systematic error in recall accuracy grading. The colorful nature and intensity of such factors would most likely lead to a random error rather than systematic variation. This finds support in our results regarding recall accuracy variance components, since the flashcard component contributed substantially more than the participant component to the total variance. In addition, it is well known that higher time-on-task is one of the most important determinants of learning . Because recall accuracy was higher on the study-quiz group - with greater time-on-task - this is likely mainly explained by the learning effect.
Furthermore, other studies have measured JOLs differently than in this study. While other approaches typically measure JOL by requiring the subject to predict how well would they perform when tested in the future [29,40,70], our approach focuses on requiring subjects to compare their answer with the flashcard containing the correct information. Because our approach does not require a future projection and is additionally performed in the presence of both the recalled and correct answers, it is unlikely to vary independently of the learning effect.
Thus, we hypothesize that measuring recall accuracy immediately after the recall effort and in the presence of the correct answer may help students make sound JOLs. However further work is needed to compare recall accuracy with an objective measurement of knowledge, such as a MCQ test, in order to prove that hypothesis. Assuming a relationship between both variables is found, it would also be relevant to understand how different degrees of recall accuracy map to different degrees of knowledge.
Recall accuracy components of variance
Regarding the quiz group, the recall variance was mainly affected by the differences in flashcard and by the differences in participants. This indicates, firstly, that systematic differences in the flashcards were mainly responsible for the variation in recall scores, and secondly, to a smaller extent, differences between participants, possibly regarding affective and knowledge factors also played a role. The effect of the multiple sessions accounted little for the increase in recall accuracy over the sessions. The high G-coefficient for the flashcard variance component indicates the flashcards are very well characterized in terms of recall accuracy under these circumstances. Thus, factors intrinsic to the content, such as its size, complexity, or presentation, are very likely responsible for differences in recall accuracy between flashcards.
Assuming the recall accuracy is related to knowledge acquisition, systematic differences in recall accuracy between flashcards can indicate which materials are harder to learn and which materials are easy. Using this information to conduct revisions of the learning material may be useful to find content that would benefit from redesign, adaptation, or introductory information.
With respect to the study-quiz group, the contact with the content over multiple sessions was the main driver of recall accuracy improvement. Participant features had little effect in the increase recall accuracy over sessions and the flashcard features also accounted for less effect than in the quiz group. This suggests that the students in the study-quiz group increased their knowledge about the content and their prior knowledge had little effect in the learning process when using the study tools. This effect is most likely explained by the additional time-on-task of the study-quiz group. In addition, some of the effect may also be explained by findings in other studies that show that there is benefit in using repeated testing with study session in order to enhance learning [37,39,47].
Potential implications to educators
The way in which content can be organized to optimize learning has been extensively studied [26,52,54,72-74]. This study demonstrates how LOs can be of value for both study and self-assessment when combined with OEQs. The detailed insight on recall accuracy can be used by educators to classify LO difficulty and estimate the effort of a course. By providing a diagnostic test on the beginning a course in the form of the quiz task, educators can get a detailed snapshot of the material difficulty for the class. This data can be useful to evaluate educational interventions at a deeper level . Because the platform can be used by the students to guide learning on their own, educators can access real-time information of recall accuracy and use it to tailor the structure of the class to better meet the course goals. Furthermore, research has identified the delivery of tailored learning experiences as one of the aims that blended education approaches have yet fully reached .
In a hypothetical scenario where students repeatedly study and quiz, it is expected that the main component of recall accuracy variance is the session count. Deviation from such a pattern could suggest flaws in content design, excessive course difficulty or other inefficacies in teaching and learning methodologies. Sustained increases in recall accuracy mainly explained by the session would inform the educator of a continuous and successful commitment of the students. If educators take constructive action from such observations then a positive feedback cycle between student engagement and the success of the learning activity would be established. Because students know educators can take real-time action based on their progress, they engage more strongly in the learning activities. Stronger engagement will lead to better learning outcomes, that will lead to further tailored action by the teacher. Indeed, student engagement is the main driver of learning outcomes . Providing tools that can foster such engagement is key to achieve successful learning [77,78].
Potential implications to learners
Students need tools to help retain knowledge for longer periods and easily identify materials that are more difficult to learn . This goal may be achieved by providing learners with personal insight on their learning effectiveness, using personal and peer progress data based on self-assessment results .
The past recall accuracy can be used as an explicit cue to guide the learning process and help managing study time. Since JOL measurements are implicitly used by learner to guide the learning task [29,41], an explicit recall accuracy cue displayed for each flashcard in the form of a color code can improve the value of the JOL . The feedback that is thus formed between the quiz and the study task further promotes the spaced repetition of study and self assessment sessions and can improve student engagement, the main driver of successful learning. This is even more important at a time where students need to define tangible goals that allow them cope with course demands .
Each flashcard holds the recall accuracy for each student for each assessment. Increasing spaced repetitions of study and quiz increase the available recall accuracy data. Since notebooks can be constructed using any available flashcard, it is possible to create notebooks that include flashcards for which recall accuracy is already available. Therefore, advanced notebooks requiring background knowledge can include an introductory section composed of the most relevant flashcards about the background topics. This implies that without previous contact with the advanced notebooks, an estimate of how well the student recalls the background topics is already available. This increases the value of learning materials by fostering reutilization and distribution of LOs between different courses, educators and students [53-55,80] and promoting educator and student engagement .
Proposal for curricular integration
In recent years multiple educational interventions have described the benefits of implementing blended learning methodologies in medical education, namely in radiology , physiology , anatomy  and others [82,83]. However, the design of these interventions varies widely in configuration, instructional method and presentation . Cook asserted that little has been done regarding Friedman’s proposal  of comparing computer based approaches rather than comparing against traditional approaches .
The platform ALERT STUDENT intends to add value to the blended learning approach, through the collection of recall accuracy data, and prescription of a method that can be systematically applied in most areas of medical knowledge. Over this platform, interventions with different configuration, instructional method or presentation can be developed, and thus allow sound comparison between computer assisted interventions and comparison between different fields of medical knowledge. The platform does not intend, however, demote the usage of other tools, rather it intends to potentiate their usage. As an example, the platform could be used to deliver the learning materials and provide the study and quiz features, that would act in concert with MCQ progress tests during class. Educators could use information about recall accuracy and number of study and quiz repetitions to gain insight on the relationship between test results and student effort. That information would be relevant to help educators mentor students more effectively. Again, the information brought by recall accuracy could be helpful to tailor other instructional methods and thus drive student satisfaction and motivation.
Limitations and further work
This work has several limitations. Recall accuracy cannot be granted to correspond to knowledge retention. As previously mentioned, additional research is required to investigate the relationship between the two. In the light of our findings, it also becomes relevant to characterize recall accuracy in ecological scenarios and multiple areas of medical curriculum, under larger learning workloads.
We have indirectly characterized the effect of the study task on the recall accuracy. We expect however that an equivalent time on the quiz task alone would yield higher effects in recall accuracy, in consonance with the findings by Larsen et al. [36,37]. That is also a matter that justifies further investigation.
The system works around factual knowledge, therefore it is only useful in settings that require acquisition of such knowledge. Complex competences such as multi level reasoning and transfer cannot be translated in terms of recall accuracy. Ways in which the system could be empowered to measure such skills would constitute important improvements of the platform.