Flipped classrooms (FCs) are rapidly gaining popularity (Fredriksen 2020; Karabulut-Igu et al. 2018; Turan and Akdag-Cimen 2020). In FCs, the delivery of instructional content takes place before class and outside the classroom. Students often need to prepare by watching a video-recorded lecture online at home, so that classroom time can be spent on student-centered learning activities (Lage et al. 2000). Because generally positive effects for FCs have been found (e.g., Akçayır and Akçayır 2018; Bond 2020; Strelan et al. 2020), systematic research is being conducted on the main contributing factors.

The video-recorded lecture is such a key factor, and much of the research has focused on what makes the video-recorded lecture (more) effective (e.g., Lin and Chen 2019; Toftness et al. 2018a; Zhang et al. 2006). One design feature that is studied is quizzing in which a video-recorded lecture is complemented with questions to stimulate more active or deeper lecture processing (e.g., Christiansen et al. 2017; Cummins et al. 2016; Kovacs 2016). The present study investigates the impact of an understudied variant of quizzing, namely, the inclusion of open-ended, embedded questions.

Research on adjunct questions with texts (e.g., Andre 1981; Ozgungor and Guthrie 2004; Uner and Roediger 2018) suggests that the inclusion of embedded questions is likely to enhance the effectiveness of recorded lectures. Such questions can stimulate students’ retrieval practice and can help them realize that they did not comprehend a message, or cannot remember key facts. This should prompt a mental review or replay of the video-recorded lecture until that goal is achieved. Empirical research on embedded questions in video-recorded lectures has already shown promising results (e.g., Cummins et al. 2016; Leisner et al. 2020; Rice et al. 2019), but important design variations and effects of such questions appear to have been understudied thus far (e.g., Brinton et al. 2016; Haagsman et al. 2020; Ketsman et al. 2018; Kovacs 2016).

One design variation that merits systematic research is question format. Many studies on embedded questions in live lectures involve Audience Response Systems that generally present multiple-choice questions (e.g., Buil et al. 2016; Crandall et al. 2019; Hunsu et al. 2016; Khan et al. 2019; Pan et al. 2019; Shapiro and Gordon 2012; Shapiro et al. 2017). While it might be tempting to also use such questions in video-recorded lectures, these may not be the most effective kinds of questions to ask. If the aim is to stimulate student recall of the learning material, open-ended questions are probably more suitable than multiple-choice questions, which rely more on recognition (Rawson and Dunlosky 2012).

As yet, though, there have been few empirical studies involving open-ended embedded questions in video-recorded lectures (Cummins et al. 2016; Szpunar et al. 2013, 2014; Thomas et al. 2018). Three of the four studies found did not support a firm conclusion that open-ended questions result in better learning than no questions because these questions were blended with other stimuli for active video processing. That is, in the study of Cummins et al. (2016) open-ended and multiple-choice questions were mixed and in the studies of Szpunar et al. (2013, 2014) there were practice items along with open-ended questions. To our knowledge, the only investigation comparing open-ended questions with a non-question condition is a recent study by Thomas and colleagues (Thomas et al. 2018). That study found a positive effect of these questions on learning. In the present study, the intervention consists of asking open-ended embedded questions within a video-recorded lecture, which is compared to the lecture with no questions included.

Another design issue that should probably be further investigated is feedback. Empirical research generally shows that the presence of feedback enhances learning (Fiorella and Mayer 2018; Shute 2008). However, feedback (in the form of identification and possibly explanation of the correct answer) can pre-empt the students’ active processing of the lecture that should be promoted by the presence of questions. That is, the adjunct questions research suggests that the presence of feedback induces students to invest less effort in responding to the quiz questions and hence reduces learning (e.g., Hamaker 1986; Roelle et al. 2017). In the present study, there is no feedback on the responses to quiz questions. This distinguishes the present study from the research by Thomas and colleagues (Thomas et al. 2018), where feedback was given.

As related above, the effort to improve processing of instructional content presented in video-recorded lectures by including embedded questions is connected to the long-standing tradition of providing adjunct questions with texts (Hamilton 1985) and ties in as well with more recent research on quizzing with lectures (Brink 2013; Shapiro et al. 2017). These lines of research are described next. First, there is a discussion on the main kinds of dependent variables that are investigated in the present study. Thereafter, a detailed account is given on why open-ended embedded questions may be particularly well-suited to enhance learning from video-recorded lectures.

Adjunct questions and quizzing

Adjunct questions are questions added to an instructional text to enhance what is learned from that text (Rothkopf 1970). Meta-analyses have reported robust effects of adjunct questions on learning (Anderson and Biddle 1975; Hamaker 1986; Hamilton 1985). Texts with adjunct questions generally yield higher test outcomes than texts without such questions. Quizzing is an emergent educational trend that is reminiscent of the adjunct questions method. Quizzing is an instructional approach in which questions are included in live or video-recorded lectures to increase their effectiveness. The literature has indicated that quizzing can have a positive effect on engagement (e.g., Cummins et al. 2016; Mayer et al. 2009), appreciation and motivation (e.g., Buil et al. 2016; Zhu 2008) and learning (e.g., McDaniel et al. 2013; Shapiro et al. 2017).

Research generally shows that quizzing promotes active student engagement (e.g., Draper and Brown 2004; Khanna 2015; Shapiro 2009; Trees and Jackson 2007; Zhu 2008). The added presence of quizzing in live lectures has almost invariably been found to increase classroom attendance and students’ active participation during class (e.g., Caldwell 2007; Khan et al. 2019; van Daele et al. 2017; Wang 2020). Quizzing in video-recorded lectures has likewise often yielded desirable engagement related outcomes such as lower in-video dropout (Vural 2013) and persistence in processing the quiz questions (Kovacs 2016). The present study focused on processing time as measure of engagement as that can play a key role in learning (e.g., Rice et al. 2019; Shinaberger 2017). The hypothesis is tested that engagement is higher for the embedded-question videos, just as has been found in other studies (e.g., Kovacs 2016; Vural 2013). In addition, the study explores whether video engagement is related to learning.

Comparisons between quizzing and non-quizzing conditions in live and video-recorded lectures have further shown that quizzing often results in higher appreciation and stronger motivation (e.g., Buil et al. 2016; Hunsu et al. 2016). For instance, research generally indicates that students favorably appreciate the usability of video-recorded lectures (e.g., Baker et al. 2018; O’Callaghan et al. 2017; Spanjers et al. 2015). Research has not yet investigated whether such appraisals are affected by the presence of open-ended embedded questions. The present study explores this effect for the three usability measures proposed in the (extended) Technology Acceptance Model (Davis 1989), namely, usefulness, ease of use and satisfaction (Davis 1989; Davis et al. 1989; Joo et al. 2014). A recent meta-analysis on blended learning showed that quizzing was an important positive moderator for satisfaction (Spanjers et al. 2015). Accordingly, the experimental condition was expected to yield higher appraisals for this construct. No specific hypothesis was tested for the two other usability perceptions.

A motivational characteristic that is likely to influence students’ willingness to study video-recorded lectures is self-efficacy which is a person’s belief in the capacity to organize and execute the actions necessary to manage particular task outcomes (Bandura 1997). Self-efficacy has been found to be a predictor of future persistence and effort expenditure in comparable settings (Bandura 2012; Bandura and Locke 2003). Also, a recent meta-analysis on clickers—a technology that allows teachers to pose questions and to process the student responses during a live lecture—found that the largest non-cognitive effect of their presence was their contribution to self-efficacy development (Hunsu et al. 2016). This suggests that the added presence of quizzing (even when it involves multiple-choice questions) is likely to positively affect self-efficacy. Similarly, a recent experiment found that embedded questions in video-recorded lectures significantly increased self-efficacy (Tweissi 2016). The questions are likely to increase engagement with the lecture and support students’ confidence in their capacity to comprehend the message that is conveyed. Accordingly, the present study tests the hypothesis that open-ended embedded questions enhance self-efficacy.

Finally, quizzing in live and video-recorded lectures has generally been found to increase learning (Gier and Kreiner 2009; Lawson et al. 2006; Morling et al. 2008; Vural 2013). The literature has offered two main explanations for this effect on knowledge development.

One account is that questions encourage active processing (Mayer et al. 2009). The questions may change a more passive reception of knowledge during a lecture into a more active knowledge construction mode. That is, they can stimulate students to be more selective in the information they attend to (Mayer et al. 2009). In addition, they may induce students to (re)structure the information to make it more comprehensible, leading to development of a schema or mental model of the lecture content (see Jing et al. 2016). Finally, the presence of questions may activate prior knowledge that is connected to the new information (see Carpenter 2011). The questions then serve an integrative role. Students connect new with existing knowledge, relating lecture content to what they already know on a topic.

Another explanation comes from the testing effect (e.g., McDaniel et al. 2011, 2013). This is the finding that students are better at remembering previously presented information on which they have been tested than they are at remembering untested information. The testing effect is ascribed to retrieval practices (McDermott et al. 2014). That is, questions may stimulate students to recall or reconstruct information that addresses the quiz question. This active retrieval of lecture content more positively affects learning than other, more passive strategies such as summarizing or note-taking.

Research also shows that there can be important moderating factors such as context, placement and question format (e.g., Khanna 2015; Mayer et al. 2009; McDaniel et al. 2012; Toftness et al. 2018b). Many empirical studies on quizzing have been conducted in ecologically valid settings, namely, actual classrooms (e.g., Barr 2017; Shapiro et al. 2017). The studies have often included intact classes and involved existing courses that ran for weeks or even months (e.g., Batchelor 2015; Brink 2013; Shapiro et al. 2017). In addition, these studies have involved questions before, during and/or after the lectures (e.g., Carpenter et al. 2018; Khanna 2015; Shapiro and Gordon 2013). Furthermore, the questions that were asked included multiple-choice, short answer open-ended questions and combinations of the two (e.g., Mayer et al. 2009; McDaniel et al. 2012). These factors make it hard to draw firm conclusions about the effectiveness of specific quizzing arrangements (e.g., Mayer et al. 2009; Papadopoulos et al. 2018).

The present study is set up as a controlled true experiment involving a video-recorded lecture, in which only the presence of questions varies between conditions. The placement of the questions vis-à-vis the video-recorded lecture is important. Research shows that pre-questions posed before a lecture have limited effect on learning (e.g., Carpenter et al. 2018; Toftness et al. 2018b), and that embedded questions are more effective than post-questions asked after the lecture has been completed (e.g., Rice et al. 2019; Szpunar et al. 2014). The experiment therefore investigates embedded questions. These questions appeared automatically after each segment of a lecture.

Moreover, the study investigates the effectiveness of open-ended embedded questions. Multiple-choice is the most typical question type in video-recorded lectures (e.g., Garcia-Rodicio 2015; Jolley et al. 2016; Vural 2013); open-ended questions are rarely used (e.g., Szpunar et al. 2014; Thomas et al. 2018). The limited usage of open-ended questions appears at odds with their relative effectiveness. Empirical research suggests that open-ended questions may be more effective than multiple-choice questions for learning (e.g., Butler and Roediger 2007; McDaniel et al. 2007). For instance, Butler and Roediger (2007) investigated three learning techniques for processing the material presented in a lecture: studying a summary, taking a multiple-choice test or taking a short answer test. The findings revealed that the short answer test improved recall the most. An explanation for this effect was that answering open-ended questions requires students to engage in more taxing information retrieval attempts than answering multiple-choice questions, which hinges on recognizing the right answer among a number of alternatives. More generally, this research suggests that quiz questions that involve retrieval rather than recognition increase learning more (see Rawson and Dunlosky 2012).

As mentioned earlier, only a few controlled studies have investigated effects of embedded open-ended questions on learning from video-recorded lectures (Cummins et al. 2016; Szpunar et al. 2013, 2014; Thomas et al. 2018). In these studies, the experiment used a combination of open-ended questions with multiple choice questions, practice items, or feedback. The present study had no other support for learning than the open-ended questions. The tested hypothesis is that open-ended embedded questions increase knowledge of lecture content.

In short, the present study compared an experimental condition in which open-ended questions were asked in a video-recorded lecture with a control condition without such questions. Three research questions were investigated:

  • RQ1: What is the effect of condition on video engagement?

  • RQ2: What is the effect of condition on technology acceptance and self-efficacy?

  • RQ3: What is the effect of condition on knowledge development?



Forty social science students from the University of Twente volunteered to participate in the study. All students were fluent German speakers. The study included 10 male and 30 female students, with a mean age of 21.6 years (SD = 1.96). Students were randomly assigned to the control or experimental condition. Students received one credit point and cash payment of €7.50 for participation. Approval for the study was obtained from the Ethical Committee of the University. All instructional materials were in German.

Instructional instruments

The recorded lecture was drawn from YouTube (inCITI Singen 2015). It presented a public talk by Prof. Dr. Manfred Spitzer. The talk’s setting resembled a conference keynote speech, with the lecturer standing before a lectern on a platform facing a large audience.

The lecture consisted mainly of a narrative supported by a few PowerPoint slides that were presented on a large screen visible to the audience. The recording primarily displayed the speaker and tended to switch briefly to slide view when a new issue was brought up. The lecture dealt with the topic of “cyber illness”. It addressed the health risks of digitalization, especially for the development of young people. It was chosen because it was deemed an engaging presentation on a topic that was presumed to interest the participants.

The whole lecture lasted 28 min, 26 s. It was split into sections to create room for the embedded questions. Based on what seemed meaningful event boundaries, this led to four separate video sections: video 1 (7 min, 28 s), video 2 (8 min, 48 s), video 3 (7 min, 2 s), and video 4 (5 min, 8 s). Video 1 introduced the term “smomby,” which refers to a smartphone zombie. It explained the speaker’s claim that extensive exposure to phones, games and computers can cause serious physical and emotional health problems, including reduced empathy. Video 2 briefly discussed a slide with an overview of brain development over the course of a lifetime (see Fig. 1). The narrative that followed mainly concentrated on early language and sensorimotor development. Video 3 discussed two experiments that showed the negative effect of computer-based compared to tactile learning by young children. Video 4 discussed brain development at the other end of the age scale. The narrative concentrated on dementia, its consequences and antecedents. It ended in the speaker’s plea for brain training.

Fig. 1
figure 1

Slide with the theoretical model in the lecture (translated version)

The embedded questions included in the experimental condition were: (Q1) What skill diminishes when people spend a long time viewing a computer screen? (Q2) What are the key dimensions that are responsible for how well our brain develops? (Q3) What do babies do when they see something that does not fit with what they know? (Q4) What are the five technical features that are connected with reduced brain development? These questions all addressed a key aspect of the video segment that they followed. For instance, Q2 asked about an important aspect of the theoretical model featured in video 2. The answer is the three factors presented in bold across the top in Fig. 1.

Videos in the experimental condition ended with an automatically presented embedded question. Videos in the control condition simply ended at the end of each segment. Participants needed to select the next video to move the lecture forward. Both conditions saw the lecture in four videos rather than as a whole, to avoid confounding question-asking and segmentation (see Cheon et al. 2014; Spanjers et al. 2012a, b).

Research instruments

The lecture was presented on a specially created website connected to a logging instrument that recorded time-stamped viewer actions for each video. Three engagement measures were gathered: Basic play time, total time and replays. Basic play time was the percentage of unique video seconds set into play mode. A score of 100% for basic play provides tentative evidence that a video has been viewed in full, insofar as it has at least been played through in its entirety. Total time was the mean amount of time participants spent on each video. The measure (in s) included pauses. Due to a software glitch, total time could be computed only for the first three videos. Therefore, the comparison between conditions for this engagement measure did not include the time spent on the fourth video. Replays were operationalized as actions that follow after an initial viewing of the complete video, in which the user returns to an earlier part of the video and plays a segment of the video again before closing it to move to the next video or the question. Replays are likely to be affected by embedded questions and are a signal of restudying activities. Both the frequency and the duration of replays were measured.

A paper questionnaire measured technology acceptance and self-efficacy ratings. Its construction was based on the original questionnaires constructed by respectively Davis (1989) and Vollmeyer and Rheinberg (2006) with modifications to fit the specific context of the study. The questionnaire consisted of a total of 30 statements. There were six distractor items, and six items per construct. Usefulness was defined as the degree to which a person generally believes that viewing recorded lectures enhances learning (compare Davis 1989). Examples of usefulness statements are “Recorded lectures like these are useful for studying” and “Students benefit from having recorded lectures available.” Ease of use was defined as the degree to which a person generally believes that viewing recorded lectures is relatively effortless (compare Davis 1989). For ease of use, statements such as “Recorded lectures require less effort to follow than real lectures” and “Recorded lectures are easy to use” were presented. One item for this construct correlated poorly with the others and was dropped from further analyses. Satisfaction was defined as the degree to which a person experiences a positive emotion from viewing a specific recorded lecture (compare Joo et al. 2014; Shin et al. 2011). Satisfaction was measured with statements such as “I enjoyed viewing the video” and “It was a satisfying experience to view the video.”

For self-efficacy, statements about retention and comprehension of the content of the video-recorded lecture were presented (e.g., “I can write a good summary of the recorded lecture” and “I can remember the content of the recorded lecture quite well”). Responses indicating degree of agreement with each statement were given on a 7-point Likert scale, with scale values that ranged from completely disagree (1) to completely agree (7). Reliability analyses showed that there were satisfactory to good Cronbach’s alpha scores for the four constructs (usefulness = 0.81; ease of use = 0.66; satisfaction = 0.93; self-efficacy = 0.80).

A computer-based knowledge test measured retention and comprehension of the lecture. The test was presented on the same website as the videos and consisted of 6 open-ended, brief response items that asked for facts or concepts. Only one test item was the same as an embedded question (i.e., Q2 is the same as T4). The test questions were: (T1) What percentage of learning loss was found in a study where W-lan was installed in the classroom? (T2) What examples illustrate the effect of “smomby” behavior on young people’s character? Please also mention how bystanders reacted. (T3) What percentage of the brain is underused when motoric tasks are viewed on a computer screen as opposed to actual manipulation? (T4) What are the key dimensions that are responsible for how well our brain develops? [This item repeated the second embedded question] (T5) What types of knowledge should be tested to prove that babies need to feel rather than view on a computer screen? (T6) Five technology or technology-related aspects were mentioned for which extensive use, or exposure, could have negative consequences. These aspects were: (1) TV, DVD and video, (2) arcade games, (3) computer games, (4) continuously being online, (5) stress and multitasking. Mention as many of these consequences for each aspect that you can.

Just like the embedded questions, the test items referred to important lecture content from each section of the overall video, and with each of the four parts mentioned in at least one item. For instance, items T4 and T6 concerned the theoretical model. Item T5 asked for the kinds of knowledge tested in an extensively discussed experiment. A codebook provided clearly defined correct and incorrect responses and there were no difficulties in identifying them as such when scoring the responses. Items varied in the number of points that could be obtained. The score for each item was converted to the percentage of possible points obtained, and the overall test score is the mean percentage for all items on the knowledge test.


The experiment took place in a small room that seated four participants at a time (all from the same condition). Each participant worked on a laptop with a touchpad and mouse and wore earphones during the experiment. The experimenter told participants that they would view a recorded lecture consisting of four short video segments and that they would be tested on what they understood and remembered. Participants in the experimental condition were also alerted to the presence of questions at the end of each video. They were told that the questions could be used to prepare themselves for the knowledge test. The participants were told to view the videos one after the other in the indicated sequence. They could process each video as they wanted as long as it was open, but they were not allowed to revisit a video they had closed. Note-taking was not allowed. After viewing all videos, participants first completed the questionnaire (on paper) and then took the knowledge test (on the laptop).

Data analysis

Tests revealed that the control and experimental condition had the same distribution for gender (i.e., 5 males and 15 females) and did not differ in age. Assumption testing revealed violations of the normality distribution for the video engagement measures. Therefore, an effect of condition on scores on those measures was assessed with a Mann–Whitney test. ANOVAs could be used for the questionnaire and knowledge test data. Testing was two-tailed with α set at 0.05. For effect sizes, the r-statistic is reported (Field 2013). This statistic tends to be qualified as small, medium, and large for the values r = 0.10, r = 0.30, and r = 0.50, respectively.


What is the effect of condition on video engagement?

Analyses for basic plays yielded scores of 100% (or very close) for all videos and participants, indicating that all videos in both conditions were played at least once in full. The experimental group had a significantly higher total time score (Mdn = 26 min) than the control group (Mdn = 24 min 18 s), U(40) = 52.00, z = 4.01, p < 0.001, r = 0.63. Also, the experimental group had a significantly higher number (Mdn = 1.83) and duration (Mdn = 10.67 s) of replays than the control group (Mdn = 0.00; Mdn = 0.00). For number of replays, U(40) = 58.50, z = 4.23, p < 0.001, r = 0.67; for duration of replays, U(40) = 56.50, z = 4.38, p < 0.001, r = 0.69.

What is the effect of condition on technology acceptance and self-efficacy?

Table 1 shows the mean scores for the technology acceptance constructs and self-efficacy. The scores were uniformly positive and high, lying almost 2 standard deviations above the mid-scale value of 4. There were no differences between conditions, with all F-values < 1.00.

Table 1 Means (SDs) for self-efficacy, usefulness, ease of use, and satisfaction per condition

What is the effect of condition on knowledge development?

Table 2 shows the mean scores on the knowledge test. There was an overall effect of embedded questions on learning, F(1, 39) = 4.40, p = 0.043, r = 0.32. However, on closer inspection, this effect was limited to the one item (T4) that asked for information that was tested in an embedded question (Q2), F(1, 39) = 4.20, p = 0.047, r = 0.31. There was a positive but non-significant effect of embedded questions on the items that asked for other information that had not already been tested in the embedded questions, F(1, 39) = 1.81, p = 0.19.

Table 2 Means (SDs) for the knowledge test per condition

Exploration of the relationships between engagement (total time and replays-duration) and the knowledge test scores yielded low, non-significant rank correlations overall, as well as within conditions.

Discussion and conclusion

Video engagement

Participants engaged significantly and substantially longer with the video-recorded lecture that included embedded questions. The presence of quizzing resulted in higher overall scores for total time and replays. The findings are aligned with the results in two empirical studies that also reported that more time is spent on lectures with embedded questions compared to non-quizzed lectures (Kovacs 2016; Vural 2013).

Total time is a general signal of participant interactions with the video. In automated data analyses of video usage, total time is considered one of the most important signals of active processing (Guo et al. 2014). The data on replays complement these data. The finding that these replays were both more frequent and longer in duration with the presence of questions matches our expectations, but even so, merely attests to the effectiveness of questions as a stimulus for video processing.

Logging systems enable researchers to mine a variety of user interactions with the video. In this study, these measures were restricted to general records of video processing (i.e., basic play and total time), and a specific record of video processing that is a probable signal of a remediating action (i.e., replays). Future research might want to use more refined data mining techniques to probe more deeply into the effects of embedded questions on video interaction events. For instance, records could be examined for the number of users who attempt to answer the embedded questions, the correctness of the answers, whether replays occur before or after giving an answer, and the time spent on answering (e.g., Kovacs 2016; Li et al. 2015; Li and Baker 2018). Such information can provide answers to questions such as whether embedded questions affect video navigation, and whether there are interaction event peaks around these questions.

Technology acceptance and self-efficacy

The mean scores for usefulness, ease of use and satisfaction were about 5.5 on a 7-point scale in each condition. Participants believed that video-recorded lectures generally provide a useful resource for learning and are easy to process. In addition, they felt that the specific lecture yielded a satisfying experience. These findings are in line with a large number of studies that have reported positive student appraisals of video-recorded lectures (e.g., Baepler et al. 2014; Burgoyne and Eaton 2018; Kim et al. 2014).

There was no effect of condition on usability perceptions. The findings thus did not support the positive contribution of quizzing to satisfaction reported by Spanjers et al. (2015). The absence of such a contribution could have been due to the positive overall appraisals of the lecture itself. Voluntary comments from participants after the experiment indicated that they enjoyed the topic and how it was presented. These positive comments may have overshadowed any perceived benefits of quizzing within the lecture.

In both conditions, the self-efficacy score was considerably above the neutral midscale value. This suggests that participants were fairly confident about their knowledge development. They believed that they remembered and understood the video-recorded lecture well. The positive appraisals for self-efficacy hold considerable promise for students’ future engagement with video-recorded lectures. That is, current self-efficacy has been found to be a predictor of future persistence and effort expenditure in comparable settings (Bandura 2012; Bandura and Locke 2003).

There was no effect of condition on self-efficacy. This finding thus did not corroborate the outcome of a recent meta-analysis on clickers (Hunsu et al. 2016), nor did it replicate the laboratory study by Tweissi (2016) that found a significant effect of embedded questions on self-efficacy. One possible explanation is that, unlike in the present study, in most clicker studies, as well as in Tweissi’s research, feedback was given for the responses to the questions. Another explanation is that the intervention may have been too short to influence the students’ self-efficacy. It takes time to hone ones’ skills, and increase confidence in grasping the content of a video-recorded lecture; a single, 30 min lecture is unlikely to effect a major change in these facets. Future research might therefore want to investigate whether repeated exposure to video-recorded lectures with embedded questions benefits students’ self-efficacy development more from than repeated exposure to such lectures without questions.

Knowledge development

The presence of the embedded questions had a significant, medium-sized effect on what was learned from the lecture. By and large, this concurs with the findings on adjunct questions and on quizzing (e.g., Jing et al. 2016; Smith et al. 2010; Uner and Roediger 2018; Vural 2013). To our knowledge, the present study is one of the few controlled experiments on open-ended questions in video-recorded lectures, and it is the only study in which no feedback was given for responses to these questions (see Cummins et al. 2016; Szpunar et al. 2013, 2014; Thomas et al. 2018). The absence of feedback, in combination with asking open-ended instead of multiple-choice questions, was considered to be a strong stimulus for students to engage in information retrieval, which would thereby enhance their learning.

Dunlosky et al. (2013) have argued that open-ended questions are more likely to trigger elaborate retrieval processes than multiple-choice questions do. That is, their review of learning techniques indicated that practice tests that require more generative responses (such as recall or short-answer) are more effective than tests that require less generative responses (such as recognition). They also mentioned, however, that this conclusion is tentative and that further work is needed. Karpicke’s (2017) more recent review indicated that comparative studies from the last 10 years have yielded mixed outcomes and that finding proof for this claim is more complex than initially thought. Among other things, he pointed out that initial retrieval success can play a mediating role, because it is often higher for multiple-choice questions. In addition, he mentioned that feedback plays a mediating role and that the presence of feedback seems especially relevant for open-ended questions.

The present study found that even without feedback, the presence of open-ended questions enhanced learning. An important reason for the absence of feedback was that we wanted to prevent the risk that students might engage less in constructing their own answers when feedback was present. That is, research on adjunct questions warns that feedback can forestall the students’ attempt to retrieve or reconstruct the answer from memory, as students are inclined to depend more on the feedback as a way to obtain the correct answer (Hamaker 1986; Roelle et al. 2017). In addition, a recent meta-analysis on the testing effect provided a similar explanation for the surprising finding that feedback did not moderate learning (Adesope et al. 2017). In view of the substantial evidence in favor of feedback (e.g., Fiorella and Mayer 2018; Shute 2008), Adesope et al. (2017) suggested that more research is needed to reveal when feedback does or does not enhance learning. One feature of feedback that the meta-analysis did not have enough research on to analyze was timing (i.e., immediate versus delayed). To enhance the effectiveness of quizzing, future studies might want to investigate the contribution of delayed feedback, because this design can serve the dual goal of both stimulating and supporting the students’ thought processes, using a different time point for each. That is, the absence of immediate feedback for responses to the open-ended questions may keep the students challenged to construct their own answers, while information about the correct response that is available after a delay may help substantiate, enrich or challenge their own constructed answers in a productive way.

The logged records of the students’ actions involving video play revealed that the presence of questions led to more extensive playing, and hence potential viewing, of the videos. Unfortunately, these data could not be linked to the students’ answers to the embedded questions, because they were not recorded in the present study. This seems like an important issue for further research. If student answers to the embedded questions are known and (delayed) feedback is given, then plausible effects of both of these on additional video play can be evaluated (e.g., correct responses yield little or no extra engagement, incorrect responses stimulate repeated video play). A simpler set-up that such a study could use would be to pre-assess the difficulty level of the questions and then correlate that with video engagement.

Video engagement was not related to test performance. This outcome was unexpected, because empirical studies generally report that more video engagement leads to higher learning outcomes (e.g., Morris et al. 2005; Wei et al. 2015).

The finding stresses the point that video engagement is a proxy for video processing. It is a valuable, unobtrusive record that is a necessary but not sufficient prerequisite for comprehension and learning. As mentioned earlier, more refined records of video interaction events can provide a more detailed view on the effects of embedded questions on video engagement. Future research might want to complement these records with interview data to gather information on the reasons why users do, or do not, engage with embedded questions (e.g., Shin et al. 2018). Such studies could also considered recording verbal protocols or use other observational methods to obtain insights into how users process embedded questions. Such data could reveal whether the questions prompt users to reflect on the lecture and whether they connect the new information with prior knowledge, among others.

The experimental condition had a significantly higher score for the single repeated question, but did not differ from the control condition on the remaining composite test score with that question removed. This finding indicates that while embedded questions have a moderate effect on learning of questioned content, effects on learning of non-questioned content may be more limited. A similar cautionary note has been voiced for quizzing in applied settings using authentic educational materials (Agarwal et al. 2012; Nguyen and McDaniel 2015; Wooldrige et al. 2014). Future research might therefore want to test this by systematically varying the number and type of previously questioned or non-questioned items. Such research might also contribute to gathering information on the learning strategies involved in video replays. In addition, it might want to measure students’ self-regulated learning skills, as this appears to be an ignored moderator in research on quizzing (see Shapiro et al. 2017).

Some limitations of the study have already been mentioned, such as incomplete data on total time, and the absence of information about the answers to the embedded questions. One other limitation has not yet been mentioned, namely, the topic of segmentation. Embedded questions obviously need to be positioned somewhere during a video-recorded lecture. The issue is where such breaks can best be created. By their very nature, embedded questions break down a lecture into parts. This can induce a segmenting effect (see Mayer and Pilegard 2014). To disentangle the two factors (i.e., questions and segmentation), in the present study, both conditions received separate parts of the complete lecture. The video-recorded lecture was split into four segments (videos) and each new segment required a user action to start it playing, as recommended in the multimedia literature (e.g., Biard et al. 2018). In the control condition, each video simply ended when an event was completed; in the experimental condition, there was an open-ended question. Given this parallel construction, it is possible to draw conclusions about the effect of embedded questions. Since embedded question automatically split up a complete lecture into sections, future research on their effects might want to turn to the multimedia literature for a principle-based approach to creating meaningful segments (e.g., Khacharem et al. 2013; Mura et al. 2013; Spanjers et al. 2012b).