1 Introduction

With the rise of video platforms and the simplified availability of digital tools, an increasing number of teachers use videos for their teaching. While many teachers take existing videos from the internet, more and more teachers also create short video explanations as they can be specifically tailored to their teaching and, consequently, to their students’ learning goals (Jaekel et al., 2021).

Scholars have used a variety of labels for videos used in the instructional context, such as instructional videos, demonstration videos, Kahn academy videos, classroom videos or video explanations (Köse et al., 2021). To highlight their explanatory character, we use the term video explanations to describe videos that combine spoken language and visualizations to explain a topic, a principle, or a process to learners. Thus, these videos merge features of traditional instructional explanations (in the classroom) with the advantages of a multimedia learning environment (Kulgemeyer, 2018a).

As the quality of video explanations is relevant for students’ learning outcomes (Kulgemeyer, 2018a), it is important to establish a comprehensive framework for evaluating the quality which can be used by researchers and practitioners to make an informed decision. Different researchers already summarized recommendations for effective video instruction which mostly rely on multimedia principles and often focus on videos in the STEM domains (e.g., Brame, 2016; Kay, 2014). Another strand of research analyzes video quality based on research regarding explanations in the classroom (Kulgemeyer, 2018a).

Using these frameworks directly for analyzing the quality of different video explanations, however, falls short for at least two reasons. First, most frameworks summarize criteria based on a single research strand, e.g., on instructional explanations (e.g., Kulgemeyer, 2018a). or multimedia design (e.g., Brame, 2016), however, do not connect different research traditions. Thus, the focus is either only on content (instructional explanations) or only on the design of the videos (multimedia) although both are relevant and more importantly – interdependent factors of video quality. As a result, the focus lies on certain aspects of the video explanation in isolation rather than taking a more holistic view by including different aspects of quality. Second, some frameworks focus on a specific video type and/or a specific discipline (e.g., worked examples in math) and, consequently, include criteria that are not relevant when evaluating a video with different content in another discipline (e.g., the principle of labor division in economics).

Accordingly, the overall aim of this paper is to connect the different research traditions by identifying criteria that—when combined—can be used to evaluate the overall instructional quality of different video explanations. The second aim is to operationalize the criteria in a reliable measure, i.e., a rating framework, which can be used by researchers and practitioners to evaluate video explanations. To this end, in a first step, we identify relevant criteria in a theoretical framework by integrating existing theoretical and empirical literature regarding instructional explanations in economic education (Findeisen, 2017; Schopf & Zwischenbrugger, 2015; Schopf et al., 2019), video explanations (Kulgemeyer, 2018a) and findings from technology-enhanced learning research which highlight the effective use of multimedia principles to reduce cognitive load (e.g., Brame, 2016; Kay, 2014; Mayer et al., 2020). In a second step, we develop a rating framework and test its validity and reliability with 36 videos created by preservice economic teachers. For the generation of the theoretical framework, we mostly rely on reviews that compile results from different research disciplines and content areas. Therefore, the theoretical framework and the rating framework are to a large extent interdisciplinary. Although the rating framework was refined by using videos from future economic teachers, we, nevertheless, contribute a valid and reliable framework that can be used by researchers and practitioners of different domains to better assess the quality of video explanations.

As we draw mostly from the literature regarding instructional explanations in the classroom and multimedia and design principles, we first summarize the literature in those two fields in a more general manner. As our research aim is to identify broad quality criteria for videos that explain a certain principle, rather than demonstrate a certain behavior, we leave aside specific video features that are mainly analyzed in the context of observational learning and modeling examples, such as instructor characteristics (e.g., van Gog et al., 2014) or video perspective (e.g., Fiorella et al., 2017). Due to our goal to assess the quality of the video itself rather than its usage within a more complex learning setting, we did not dive deeply into the literature on interactive elements in videos (e.g., Delen et al., 2014) or combining videos with generative learning tasks (e.g., Fiorella et al., 2020).

After reviewing the literature and contrasting existing measurements for video explanations, we explain how our framework was developed based on previous work before describing the framework in detail. In the following section, we describe the context of the videos which were used to define the rating framework and provide an overview of the psychometric quality of the instrument and the results.

2 Theoretical Background and Literature Review

2.1 Quality Criteria for Instructional Explanations

In different strands of research, criteria for the quality of instructional explanations have been developed and discussed (Findeisen, 2017; Kulgemeyer, 2018b; Lee & Anderson, 2013; Leinhardt, 2001; Schopf & Zwischenbrugger, 2015; Schopf et al., 2019; Wittwer & Renkl, 2008). Findeisen (2017) developed an explanation quality framework for economic and business education, which we use in the current study because it allows us to integrate the different research strands and to organize the previous research on the quality of instructional explanations. Accordingly, we structured the state of research regarding quality criteria along the categories of content, learner orientation, representation/design, language, and process structure.

The content of instructional explanations should be correct, accurate, and complete. In economic education, Schopf et al. (2019) argued that a learner should be able to understand what the content of the explanation—for example, a concept or a principle—is about, why it works, how it works and what it is useful for within the domain. In this regard, the relevance of the explained content for the domain should be clear for the learner by, for instance, exemplifying in which context a certain topic or principle is needed. Besides the correct use of technical terms, it is relevant to explain a subject topic stepwise and to give reasonable explanations about why certain steps are necessary and how they relate to the domain principle (Wittwer & Renkl, 2008).

The second category, learner orientation, means that an explanation should be adapted to the learner group. Prior knowledge is one of the most relevant factors in this regard (Wittwer & Renkl, 2008). Kulgemeyer (2018b) called this criterion “adaptation to the explainee” (p. 120). It is important to consider the learners since explanations can easily be too difficult or too easy. In addition to the learners’ prior knowledge, the content of an explanation should also be connected to other learner characteristics by, for example, taking their perspective on a certain topic into account and connecting the topic to their (daily) life experiences (Schopf et al., 2019). Lastly, explanations can be further adapted through interactions between explainer and learner when, for example, an explainer uses questions to assess the learner’s understanding and adjusts an explanation to the learner's needs (Sevian & Gonsalves, 2008).

Third, the use of different representations, such as examples, visualizations, analogies, and models, is often included in criteria for effective explanations (e.g., Findeisen, 2017; Geelan, 2013; Kulgemeyer, 2018a; Schopf et al., 2019). Especially, visual representations such as graphs, diagrams, and charts are regularly used in explanations (for economic education, cf. Ring & Brahm, 2020; Vazquez & Chiang, 2014) as they provide structure and can guide learners in the construction of their own internal representations. Examples can support understanding as they promote the connection of domain knowledge to the learners’ everyday experiences. When examples are used, they must clearly represent the principle, should cover all aspects of the topic, and should ideally be taken from the student’s everyday life (Schopf et al., 2019).

The fourth category, language, pertains to an appropriate complexity level as explainers need to translate between domain language and everyday terms (Kulgemeyer & Schecker, 2013). Furthermore, avoiding vagueness as well as using body language and gestures might further add to the quality of the explanation (Brown, 2006).

Finally, a clear and coherent process structure helps learners to follow explanations. A short introduction—clarifying the topic or question—might set expectations and activate prior knowledge (e.g., Charalambous et al., 2011). A summary at the end as well as coherent argumentation in between reduce strain on the cognitive capacities of learners and might help them in their understanding.

2.2 Multimedia Design for Video Explanations

Video explanations can be seen as multimedia material since they combine spoken words and visual representations (Mayer, 2014). Before presenting specific frameworks in the next section, it is important to illustrate their theoretical basis. Most fundamentally, Mayer’s (2014) cognitive theory of multimedia learning describes the process of learning with text and pictures with the central conclusion that the integration of the information in the visualization and the spoken text is a prerequisite for successful learning. The cognitive load framework assumes that learning is associated with different kinds of cognitive load, which are influenced by the inherent complexity of the task as well as the design of the learning material (Sweller, 2020). In combination, these two theories result in a very general recommendation: multimedia material leads to higher learning outcomes when it is designed in a way that helps learners to integrate audio and visual elements and at the same time, to reduce unnecessary cognitive load.

More specifically, researchers have identified multiple design principles that can be used as guidelines for the development of multimedia learning material (overview see Mayer, 2014). We will now focus on the most important design guidelines for our research goal, which are principles concerning the relationship of visual and audio elements in video explanations.

First, according to the signaling principle, multimedia signals can help guide learners’ attention toward the most relevant information and to connect different modalities (van Gog, 2014). In a video explanation, one possible example of this principle is highlighting the part of the visualization that is currently explained. Second, rather than presenting the same information in different modalities, visual and audio should complement each other in a meaningful way (Low & Sweller, 2014). Video creators violate this so-called modality-principle, for example, when the visual element includes complete sentences, and the explanation encompasses reading these sentences. Third, according to the temporal contiguity principle (Mayer & Fiorella, 2014), it is important to present visual and spoken textual information at the same time, rather than first talking about a visual and then showing the visual (or the other way around). Fourth, video creators should refrain from using visuals with irrelevant (but possibly interesting) details, as they might distract the learner from the important content (Mayer & Fiorella, 2014).

After these foundational multimedia principles, we will now review different frameworks and measurements for video explanations.

2.3 Frameworks and Measurements for the Quality of Video Explanations

Multiple frameworks and measurements (see Table 1) describe and analyze the quality of video explanations, which are mainly influenced by two research traditions: multimedia and cognitive load research (e.g., Brame, 2016; Kay, 2014) as well as instructional explanations (e.g., Kulgemeyer, 2018a).

Table 1 Overview of studies providing recommendations or measures for the quality of video explanations

The different studies in Table 1 can be divided into two groups in terms of their objective. We use the term “guideline” to refer to studies that aim at providing recommendations to support instructors in developing and selecting more effective videos (Brame, 2016; Kay, 2014; Kulgemeyer, 2018a; Schopf, 2020; Siegel & Hensch, 2021). In comparison, we categorize studies as “measures” when they develop measurements for existing videos by operationalizing criteria in a coding manual (Kay & Ruttenberg-Rozen, 2020; Kulgemeyer & Peters, 2016; Marquardt, 2016). As we have already discussed the underlying research of what constitutes the quality of a video explanation by analyzing instructional explanations and multimedia and cognitive load research, we do not go into detail regarding the design recommendations. Instead, we shift our focus towards measures—and, thus, the question of how the quality of video explanations has been assessed so far.

Based on earlier frameworks of “traditional” explanations, Kulgemeyer and Peters (2016) analyzed the quality of video explanations for physics on YouTube. To measure the quality of the videos, they applied a dichotomous approach and rewarded each video with one point if a certain criterion was met (or subtracted one point for “negative” criteria, such as scientific mistakes) and used the total number of points as a measure of quality. Although they obtained high inter-rater reliability (Cohen’s kappa of κ = 0.860) and satisfying internal consistency (Cronbach’s alpha of α = 0.69), the dichotomous nature of the instrument is focused on the occurrence of certain events (such as whether an equation is used to explain the content) and the subsequent use of the same category is irrelevant. This has two implications: first, it is difficult to transfer this measure, to a context where some category might not be relevant (e.g., where there is no equation). Second, it is not possible to identify more subtle qualitative differences between videos that meet the same criteria. For instance, the video quality differs most likely not only because the video does (or does not) use visualizations but because of the kind of visualization used and how it is connected to the (verbally explained) content (Mayer & Fiorella, 2014).

In the context of mathematics, Marquardt (2016) developed a rating scheme for video explanations with 22 criteria in four categories (overview in Table 1). The author operationalizes most criteria on a five-level scale and thus, theoretically, the rating scheme might be able to identify the differences between videos regarding the same criterion. Although the resulting measure combines multiple theoretical approaches, the authors did not test the rating scheme with videos and thus do not report consistency or reliability.

Table 2 Framework for the quality of video explanations: overview

Also for mathematics, Kay and Ruttenberg-Rozen (2020) had students in teacher education generate video-based worked examples. Based on Kay’s (2014) framework, the quality of the student-generated video explanations was rated regarding four categories: establishing context (n = 3 items), creating effective explanations (n = 7 items), minimizing cognitive load (n = 4 items), and engagement (n = 5 items). All items were assessed on a three-point scale. The authors report acceptable internal consistency for each category-scale (Cronbach’s alpha between 0.60 and 0.85), but no inter-rater reliability. Due to the focus on worked examples, some aspects of quality, such as technical correctness or adaption to prior knowledge, are not part of the instrument.

In summary, previous instruments have different limitations, which are the reason for the development of a new measure: First, from a theoretical point of view, they do not include all relevant criteria or include criteria that are not easily transferable to other contexts (e.g., Kay & Ruttenberg-Rozen, 2020). Second, from a methodical perspective, they lack evidence regarding interrater reliability and internal consistency (e.g., Marquardt, 2016) or due to a dichotomous approach do not provide enough information about variance within a criterion (Kulgemeyer & Peters, 2016).

3 Development of the Rating Framework

3.1 Development Process

The rating framework was developed in two steps: Before the video rating, we theoretically derived an initial framework based on research on instructional explanations in economic education (Findeisen, 2017; Schopf & Zwischenbrugger, 2015), multimedia design principles (overview in Mayer, 2014) and video explanations (Kulgemeyer, 2018a). These resources were not chosen based on a systematic search but because they combine different research approaches and thus, taken together, provide valid criteria for video explanations as each offers a unique and relevant perspective: Findeisen (2017) and Schopf and Zwischenbrugger (2015) were relevant because of their focus on the content of explanations in economic education, the design principles in Mayer (2014) explain the effect of multimedia learning in a more general manner (see Sect. 2.2) and Kulgemeyer (2018a) made use of instructional explanation literature to analyze the quality of video explanations. From these resources, we identified relevant criteria and structured them according to the five categories that Findeisen (2017) already used to describe the quality of instructional explanations in the classroom: content, learner orientation, representation/ design, language, and process structure (see Sect 2.1). Although the categories were developed for explanations in the classroom, we used them because they fit the overall aim to develop a comprehensive framework and presented categories that could be used to subsume criteria from the chosen literature. We then compared the results with some of the frameworks and instruments that are described in Sect. 2.3 to check whether we reached similar criteria (see Table 2).

In the second step, five of the videos described in Sect. 4.1 were chosen at random to test and inductively revise the rating framework to reach a usable coding manual. With this, we strived to ensure that (a) most criteria were developed before the viewing of the material and that (b) it was still possible to change, add, or omit criteria based on the actual material. This procedure as well as the theoretical framework itself were preregistered to make the changes resulting from the second phase more transparent and verifiable.Footnote 1

With the resulting coding manual, all videos were then rated by two raters. The raters were the first author of this paper and a research assistant. The five videos that were used to develop the coding manual served as anchor examples. For all criteria, the raters documented not only their final rating but also their reasoning in case their assessment differed from the highest possible rating (see Appendix 4 for the rating sheet template). After the first ratings, deviating from our preregistered analysis plan, we sent the coding manual to one expert for video ratings as well as one expert for instructional explanations in economic education and asked for feedback regarding validity and comprehensibility. The expert feedback was added to our research because the first ratings resulted in low-interrater reliability for some criteria. Based on the feedback, we adjusted the manual again.

In total, the videos were rated three times. An overview of the whole procedure can be found in Fig. 1, which also shows that the rating framework remained the same from the second rating of all videos. The final coding manual as described in Sect. 3.2 differs from the theoretical framework (that was developed before analyzing the data) in several regards which are made transparent in Appendix 5.

Fig. 1
figure 1

Overview of the development procedure of the rating framework. Note The grey areas show the framework (and the resulting data) that are reported in this manuscript

3.2 Rating Framework for Video Explanations

After the final adjustments of the theoretical framework based on the five preselected videos, the coding manual consisted of twelve criteria in five categories (see Table 2 for an overview): (1) content, (2) learner orientation, (3) representation and design, (4) language, and (5) process structure. All criteria were rated on a scale with four levels to assess to what extent each criterion was fulfilled (0 = not or only barely fulfilled, 1 = partly fulfilled, 2 = mostly fulfilled, 3 = always fulfilled). For instance, when a video included a mistake at the beginning but was otherwise flawless, we rated technical correctness as mostly fulfilled (= 2). A rating of 0 was only assigned when the criterion was not fulfilled throughout the whole video. For all criteria, a definition and the relevant conditions were outlined in bullet points in the coding manual. The complete coding manual can be found in Appendix 2.

Regarding content, the first criterion was technical correctness, i.e., no errors in the explained content. Videos received a lower rating when they included technical errors or imprecise statements. One video, for instance, described the stock market and used an example of a very small company to illustrate its stock market launch. In the description of the initial public offering, the company sold seven shares for 1,000€ each. This, however, would not be possible in Germany as a share capital of at least 50,000€ is necessary for a company to go public. As this is clearly a technical error, technical correctness was seen as only mostly, not completely fulfilled.

The second criterion, technical completeness, was achieved when no relevant information or subject-specific terms were missing. As all videos had different content, expectations were not predefined instead the raters assessed the relevant information for the topic while rating the video based on their expertise and the economics curriculum. Although a video can't encompass all information regarding a certain topic, the most important information needed to be included. One video, for instance, explained how the Gross Domestic Product (GDP) is calculated and used as a measure of economic growth. Typical limitations and criticism of the GDP, however, were not part of the video although this was expected by the raters based on the curriculum.

In the category learner orientation, the first criterion was relevance to the learners. A high rating was achieved when the learner’s perspective was considered in the video by, for example, connecting the content to the learner’s everyday life or introducing a fictitious character that might represent the learner’s perspective. In a video that explained the effect of taxes, for instance, the topic was not connected to the learner’s experiences with taxes—which could have been done, for example, by starting with the question of where and how the learner might pay taxes. Instead, the video explained gasoline taxes as an example although the targeted learners are probably not familiar with gas taxes as they are not legally allowed to drive.

Regarding the criterion linking to prior knowledge, the raters evaluated to what extent the content was related to the learner’s prior knowledge in terms of complexity and scope. Again, prior knowledge was not predefined but was instead assessed by the raters based on the grade level indicated for the video explanation and the curriculum. Videos would receive lower ratings when new subject-specific terms or principles were introduced without an explanation or already known principles were discussed in detail.

The third criterion in this category, active engagement, was fulfilled when learners were given a task that might lead to active participation; for instance, they were asked to pause the video to consider examples or assess their understanding. Not only the number of such integrated tasks but also their quality was used to form a rating.

Direct addressing was met when the learner was directly addressed by the speaker, e.g., by using the second-person singular or first-person plural instead of the passive voice. This was only consistently used in a few videos. For example, in a video about the calculation of the GDP, the speaker started by connecting the content to prior knowledge, “as you already know, the GDP…", and used the first-person plural when explaining the new content, “we should not forget, however, that intermediate inputs must be deducted before…”.

In the category design and representation, the criterion use of examples was fulfilled when the video used appropriate, comprehensive, and authentic examples to illustrate the content. Note that whether examples were a good fit regarding the learner’s everyday experiences was rated in the above criterion relevance to the learners. The number of as well as the quality of examples were considered in the rating. A suboptimal rating was often due to the use of very general instead of specific examples. For example, in a video where the labor market was explained, an example illustrated how employers and employees discuss certain wages. Instead of using a specific company, a specific employee, and a specific wage, the video remained abstract.

All aspects of visualizations that could be assessed without considering the audio track were rated in the design of visualizations. Here, different aspects of the visualizations were considered together to form a rating. The highest rating was achieved if the visualizations were error-free and when the video was not overloaded with too many visualizations (or barely contained any visualizations). For more complex visualizations, missing step-by-step construction and no signals also led to a lower rating. As such, a video with too many visualizations received a low rating (example in Fig. 2, left panel), and a video with only one complex visualization also received a lower rating, when no signals and/or no step-by-step construction were used to guide the learner’s attention.

Fig. 2
figure 2

Screenshots of two videos with low ratings regarding design (A, left panel) or matching of visualizations and spoken text (B, right panel)

For the criterion matching of spoken text and visualizations, the raters assessed whether spoken text and visualizations were linked in a way that promoted learning. For this criterion, the videos received a lower rating when general visualizations were used that did not match the specific video content (example in Fig. 2, right panel) or when temporal contiguity was not met. Other conditions in this criterion were coherence (no unnecessary/seductive details in visualization) and redundancy (only keywords of spoken text were allowed as written text).

Regarding language, the criterion comprehensive language was fulfilled if the language syntax and word usage were kept as understandable as possible. The videos received lower ratings when the speaker used unnecessary foreign or complex words or when their sentences had long or multiple subordinate clauses.

The video received a high rating for precise language when the speaker’s voice was accent- and mostly dialect-free, when the sentences were complete and free of errors, and when appropriate pauses were used. Videos with computer-generated voices, for example, received a lower rating as they were characterized by unusual intonation and unclear pauses.

In terms of process structure, videos received the highest rating regarding structure when the objective, topic, or question was clearly defined at the beginning of the video, when there was a coherent argumentation structure, and when there was a clear ending with a summary, follow-up task, or transition to a new topic.

4 Applying the Rating Manual with Videos from Preservice Teachers

4.1 Sample: Videos and Their Creators

For this study, we used videos that were created by N = 36 preservice economics teachers (16 females, Mage = 24.71 years, SD = 3.00). For one semester (April–July 2020), the students participated in two different courses that are part of the curriculum for preservice economic teachers. As part of the seminar, they were asked to design a video explanation. Their task was to choose a tool in which spoken text and visualizations could be combined to create a three- to seven-minute-long video. Most videos were created with Simpleshow (http://simpleshow.com/) or Presentation slides with voice input. Some students also used Adobe Spark (adobe.com) or recorded their screens (while working with a visualization app). The video rating made up part of the grade for the seminar to motivate students to create a high-quality video that they might later use as part of their teaching. Before the video assignment, the participants received information regarding the relevance and quality of instructional explanations (based on Schopf et al., 2019) but no information regarding high-quality video explanations or multimedia design principles. They received a short introduction to the tools and were given three weeks to create the video. To make sure that the videos had different topics and, thus, would be an appropriate sample to test the applicability of the framework, the students could choose the topic out of a list based on the economics curriculum. In total, 36 videos were used to validate the framework for video explanations in economic education. A list of all videos, including topic and length, can be found in Appendix 1.

4.2 Results

4.2.1 Reliability

In line with the preregistration, Fleiss’ kappa was used to determine the inter-rater reliability for all categories separately after the second and third ratings (see Table 3). Additionally, we report a two-way mixed, agreement, average-measures intraclass correlation (ICC) as this is more suited to the ordinal nature of our variables (Hallgren, 2012).

Table 3 Inter-rater reliability for the video quality ratings

After the second rating (i.e., the first rating with the final manual), we did not reach high interrater reliability for all categories. Especially for ratings regarding technical completeness as well as adapting to prior knowledge, the agreement was rather low. As we did not predefine aspects that should be included or prior knowledge that could be used as a basis for the two ratings respectively, we found these criteria difficult to assess objectively. Since we could not attribute our differences to an unclear understanding of the rating framework, we decided not to revise the framework but rather discuss our understanding of all criteria and rate the videos a third and final time, which led to an increase in reliability. We discuss reliability as a limitation in more detail in the last section of the paper. After the third rating, the reliability for most criteria was considered good (Fleiss’ kappa > 0.61) or almost perfect (Fleiss’ kappa > 0.81, Landis & Koch, 1977). Based on intraclass-correlation, the agreement was already excellent for most criteria (ICC > 0.75 according to Cicchetti, 1994) for the second rating.

All disputes after the third rating were resolved by discussions among the raters. Not only the ratings but also the reasoning behind the ratings were used as a basis for the final decision. The final ratings were used to assess internal consistency for the complete scale. Cronbach's alphas for the scale consisting of all 12 criteria was 0.73. Due to the small sample size and the small number of criteria for some categories, factor analysis does not yield useful results. An overview of the relationship between the different ratings in a correlation matrix can be found in Appendix 3.

4.2.2 Overview of the Ratings

Figure 3 shows an overview of the range of the quality ratings by visualizing the relative frequency of the four levels for each criterion. For some criteria, such as technical correctness (C1), the lowest rating was not used at all for the videos in this study and half of the videos received the highest rating. Active engagement (L3) was rated as barely or not fulfilled for almost all videos. For process structure (P1), all possible ratings were used in a similar frequency.

Fig. 3
figure 3

Overview of the relative frequency of the different ratings for all criteria

5 Discussion and Conclusion

The aim of this study was twofold: First, we identified relevant criteria from the literature to develop a theoretical framework based on existing theoretical and empirical work on instructional explanations in economic education (Findeisen, 2017; Schopf & Zwischenbrugger, 2015), video explanations (Kulgemeyer, 2018a), and multimedia design principles (overview in Mayer, 2014). We thus contribute a theory-driven instrument to assess the quality of video explanations in economic education to the literature which, due to its rather broad definition of criteria might be used more generally in different domains. Second, we investigated the psychometric quality of a coding manual with twelve criteria based on 36 videos that had been created by preservice teachers in a university seminar. We found the coding manual to be (mostly) reliable and the range of quality ratings fit the context of our videos. The results—both the instrument and its application—contribute to the current literature in several ways and have implications for future research and practice.

The instrument can be used in future research as well as practice to evaluate the quality of existing video explanations. It goes beyond earlier frameworks by integrating conceptual and empirical research from multimedia research and research on instructional explanations in different domains (Findeisen, 2017; Kulgemeyer, 2018b; Lee & Anderson, 2013; Leinhardt, 2001; Schopf & Zwischenbrugger, 2015; Schopf et al., 2019; Wittwer & Renkl, 2008). All criteria which we had identified as relevant and included in our instrument, have in the meantime (i.e., since the development of the instrument) also been part of newer frameworks dealing with video explanations (Kay & Ruttenberg-Rozen, 2020; Schopf, 2020; Siegel & Hensch, 2021) which can also be taken as an indicator for the instrument’s content validity. In line with similar frameworks and recommendations (Brame, 2016; Kay, 2014; Schopf, 2020; Siegel & Hensch, 2021), the instrument can also be used as a guideline for the creation of new video explanations by, for example, teachers, teacher educators, or students. One advantage of the framework presented here is that it is not thematically limited to a certain topic or a certain context (such as worked examples) and thus provides a flexible framework.

Before the framework and the coding manual are used in future research, however, it is important to discuss the potential boundaries of the instrument. The overall focus of the instrument is on the content of the videos. Thus, a clear limitation is that other relevant conditions, such as how the video is embedded in the greater learning context or how it should be adjusted to different learners, were not considered.

Nevertheless, the instrument is valuable because it enables teachers to systematically evaluate stand-alone videos which are now widely available on various platforms. For video explanations that are used in combination with other material or as part of a certain educational setting (such as a flipped classroom), the framework could be altered in future research to include aspects that are relevant to the respective setting.

Since the videos had different topics, we decided not to define the expectations regarding content and learners’ prior knowledge before the rating. Although this results in broader applicability of the framework, at the same time, it increases the subjectivity of the assessment as the raters need to evaluate these criteria based on the curriculum and their expertise. This, however, might lead to a higher need for rater coordination—which was also visible in our data as the inter-rater reliability of two criteria was very low after the first and second ratings and only increased after the last rater discussion. If multiple videos with the same content are evaluated with the framework, a clear definition of expectations regarding content and learners’ prior knowledge would increase the objectivity of the instrument.

Furthermore, we often combined multiple conditions for a criterion to be met. For example, for matching of visualization and text, we rated not only temporal and spatial proximity but also coherence, consistency, and redundancy. Even though this makes it easier to compare the videos with regard to the use of visuals in general and leads to a higher variance concerning the criterion, it still means a loss of information compared to a separate rating for all of the conditions. For instance, when a video has a lower rating regarding the criterion, the rating alone cannot be used to identify which of the conditions were not fulfilled, i.e., whether the video lacked coherence or temporal proximity. Especially when concentrating on the role of visualizations and their relation to the spoken text, it seems advisable to further develop the criterion by splitting it into different criteria.

To better evaluate the results and validity of the instrument, we can compare our results to existing empirical research on the quality of instructional explanations in the classroom (Findeisen, 2017). In the category content, most videos received higher ratings whereas this category was challenging for preservice teachers in authentic explanation settings (Findeisen, 2017). One reason for this could be the difference between video explanations and instructional explanations: For the video explanations, the preservice teachers had the time and opportunity to check the technical correctness of the content and to repeat the “production” process if necessary. Thus, errors seem to be less likely for video explanations.

In the category learner orientation, our results are somewhat comparable to the findings for instructional explanations in the classroom. Although Findeisen stated that most preservice teachers were able to adapt the content to the learners, she also argued that the evaluation of prior knowledge is one of the major challenges (Findeisen, 2017). Besides adapting the content to the learners’ prior knowledge, establishing the relevance of the topic by connecting it to the learners’ everyday experiences was an additional challenge for the preservice teachers who created the videos. Actively involving, i.e., cognitively activating the learners through tasks, was not a priority for the preservice students in our sample. A potential explanation for this is that the preservice teachers do not see the need for such cognitive activation even though the literature highlights the need for this (Brame, 2016). Furthermore, actively engaging learners was not prompted by the tools used in this study. When other software is used—for example, H5P (h5p.org)—the share of videos with more interaction would probably be higher. Although directly addressing could further highlight the relevance of the content to the learners (Kulgemeyer, 2018a), it might also be a technique that the preservice teachers are not accustomed to from existing educational videos on the internet and, consequently, do not apply to their videos.

Regarding representation and design, visualizations are often seen as a more challenging aspect in instructional explanations (Findeisen, 2017; Schopf et al., 2019). We could partly replicate this finding for video explanations although visualizations were often used adequately while the videos were rated lower regarding the design and combination with spoken text. Fewer errors in visualizations could be explained because most visualizations used in the videos were not created by the preservice teachers while explaining (and thus were more likely a result of a systematic searching process and not a spontaneous byproduct of the explanation). Regarding the lower ratings for the combination of text and visualizations, it could be argued that preservice teachers have little knowledge of how to design multimedia material in a way that promotes learning. Here, the deployed tools could also have some impact as they provide different opportunities or have certain default settings that might influence the combination of spoken text and visualizations. One tool, for example, uses certain spoken words as cues for the appearance of visualizations and thus automatically leads to high temporal contiguity. As a practical implication of our research, it seems helpful for future teachers to develop knowledge of multimedia principles as part of their teacher training because the development (or evaluation/augmentation) of learning material will most likely be part of their job in light of continuing digitalization.

While being precise was a challenge for the preservice teachers, comprehensive language did not pose a problem for the majority of future teachers. The lower ratings for precision can be partly explained by the fact that not all future teachers wanted to record their own voices and computer voices automatically received a lower rating.

In terms of process structure, most videos had a clear structure. A possible explanation for not including a clear ending might be that the preservice teachers deemed the videos to be too short to provide a summary at the end.

Up to now, the assessment of video quality has only been based on the videos themselves. Consequently, further analyses of the videos are necessary to determine whether the aspects that (theoretically) should influence the quality of the video explanation are indeed beneficial for the learner. Following Kulgemeyer (2018a), one possible approach might be to develop videos that systematically differ regarding the criteria and to test their effect on learners. It could be assumed that different criteria would affect the learners differently. Therefore, such a study might help to further develop the measure as the criteria could be weighted according to their empirical effects.