A Systematic Meta-analysis of the Reliability and Validity of Subjective Cognitive Load Questionnaires in Experimental Multimedia Learning Research

For more than three decades, cognitive load theory has been addressing learning from a cognitive perspective. Based on this instructional theory, design recommendations and principles have been derived to manage the load on working memory while learning. The increasing attention paid to cognitive load theory in educational science quickly culminated in the need to measure its types of cognitive load — intrinsic, extraneous, and germane cognitive load which additively contribute to the overall load. In this meta-analysis, four frequently used cognitive load questionnaires were examined concerning their reliability (internal consistency) and validity (construct validity and criterion validity). Results revealed that the internal consistency of the subjective cognitive load questionnaires can be considered satisfactory across all four questionnaires. Moreover, moderator analyses showed that reliability estimates of the cognitive load questionnaires did not differ between educational settings, domains of the instructional materials, presentation modes, or number of scale points. Correlations among the cognitive load types partially contradict theory-based assumptions, whereas correlations with learning-related variables support assumptions derived from cognitive load theory. In particular, results seem to support the three-factor model consisting of intrinsic cognitive load, extraneous cognitive load, and germane cognitive load. Results are discussed in relation to current trends in cognitive load theory and recommendations for the future use of cognitive load questionnaires in experimental research are suggested.


Introduction
In psychological research, subjective measurements are often used to assess constructs that are not directly observable. Such scales include multiple items each of which aims to provide information about the construct under study (McNeish, 2018). From a psychometric viewpoint, measuring can be defined "as assigning of numbers to observations in order to quantify phenomena" (Kimberlin & Winterstein, 2008, p. 2276. Within educational psychology, cognitive load theory, postulating that learning is associated with a cognitive burden imposed on the learner's working memory (e.g., Mayer & Moreno, 2010;Sweller, 2020), is seen as one of the most influential frameworks. In recent years, there has been ongoing debate surrounding how to measure types of cognitive load reliably and validly in experimental settings (e.g., Brünken et al., 2003Brünken et al., , 2010Naismith et al., 2015;Schmeck et al., 2015). This meta-analysis aimed to examine the quality of the four most frequently used cognitive load questionnaires measuring types of cognitive load. This is done by examining the two quality criteria of reliability and validity -methodological requirements that need to be met before they can be classed as high-quality measuring instruments (Kimberlin & Winterstein, 2008). Furthermore, the goal of this work is to quantitatively verify theoretical assumptions (i.e., the types of cognitive load and their interrelationships; Kalyuga, 2011;Sweller, 2010) of the cognitive load theory (CLT).

Working Memory and Cognitive Load
CLT, introduced in the 1980s by John Sweller (1988), is an established theoretical framework in educational psychology research, which applies our knowledge of human cognitive architecture and evolutionary educational psychology to instructional design (Sweller et al., 1998(Sweller et al., , 2019Sweller, 2020Sweller, , 2021. A central assumption of this cognitive theory is that learning arises from the interplay of working memory and long-term memory processes (Cowan, 2008;Sweller, 2016). Based on Cowan's (1999) embedded-processes model of working memory, both cognitive systems are not to be considered separately. Working memory is argued to be the activated part of the long-term memory indicating that a focus of attention is paid to learning processes (for an overview, see Schweppe & Rummer, 2014). Relatively uncontroversial is the assumption that the working memory system (or short-term memory system) is limited in its capacity implicating that only a limited number of elements can be processed simultaneously (Baddeley, 1986;Cowan, 2001;Miller, 1956). Furthermore, empirical findings suggest that novel information that is unknown to the learner is lost after a certain time if not repeated (Jonides et al., 2005;Peterson & Peterson, 1959). These constraints on capacity and duration hamper information processing because only a certain amount of information can be processed in the working memory simultaneously. In contrast, long-term memory stores retrievable information organized into schemata (Plass & Kalyuga, 2019;Schweppe & Rummer, 2014). Such schemata can help to overcome working memory limitations by chunking a certain amount of information into one element (Paas et al., 2003). To build such a schema, new information has to be selected, organized, and integrated into a coherent model (i.e., the SOI model; Mayer, 1996). Finally, schemata are stored in long-term memory and can be retrieved, if necessary, into the working memory in order to facilitate learning with a complex task. Assuming that each element would have to be processed individually, this would exceed the capacity of the working memory. The automation of interacting elements leads to the fact that these can be processed unconsciously in the future and thus reduces the load on the working memory (Paas et al., 2003).
Cognitive load can be viewed as a multidimensional construct involving both mental load and mental effort (Paas et al., 2003). Both constructs play an important role in our understanding of cognitive load, though it is still unclear how or even whether these are related. For instance, Krell (2017) developed a questionnaire that explicitly distinguishes between mental effort and mental load. Generally, it is assumed that mental effort is assignable to the active investment of germane resources when learning (Klepsch & Seufert, 2021). Therefore, to transfer knowledge to long-term memory and to integrate it with previously gained prior knowledge, learners must themselves become active by directing their cognitive resources toward learning-relevant activities (Klepsch & Seufert, 2020Krell, 2017). This process can be encouraged by the design of the learning material. In contrast, learning materials' inherent characteristics like the complexity and the presentation format are experienced passively by the learner through what Krell (2017) described as task-related load or mental load. Following this line of reasoning, Sweller et al. (2011) argue that mental load and mental effort should be seen as two distinct constructs, which usually correlate positively. As previously mentioned, the cognitive load imposed on working memory is caused by the learning task. To complete a task, that is, to learn, requires an amount of invested mental effort. According to Paas (1992), mental effort is characterized by the usage and allocation of cognitive resources, indicating that the amount of mental effort is a reliable estimate of someone's motivation to acquire new information. Instructional designs and procedures should support the learner to efficiently use the available working memory resources for schemata acquisition while optimizing the information processing ability (Chen & Kalyuga, 2020;Korbach et al., 2018). For this purpose, CLT proposes design principles for instructional materials and procedures that aim at reducing unnecessary load on working memory and freeing up capacity for learning-related processing (Anmarkrud et al., 2019). As pointed out by Sweller et al. (2019), cognitive load is increased when unnecessary demands that tend to impede effective learning need to be processed. For instance, distracting elements within the learning environment can be a source of unnecessary cognitive load. However, working memory's efficacy can be enhanced when the learner has a certain level of domain-specific prior knowledge. Consequently, the learner is cognitively less burdened by the task (Feldon, 2007). The overriding goal formulated within the CLT is therefore to avoid cognitive overload which manifests itself in "that the processing demands evoked by the learning task may exceed the processing capacity of the cognitive system" (Mayer & Moreno, 2003, p. 45).

Types of Cognitive Load
Perhaps the best-known version of CLT stipulates three additive types of cognitive load (cf. Sweller et al., 1998). In contrast, the original version of CLT did not distinguish between the different types of cognitive load (e.g., Sweller, 1988). The CLT was consequently further developed by undertaking a subdivision into the intrinsic cognitive load (ICL) and extraneous cognitive load (ECL) in order to better explain the phenomenon that some learning materials are more difficult to learn than others (cf. Sweller & Chandler, 1994). The three-factor model including intrinsic load, extraneous load, and germane load was then developed by Sweller et al. (1998) in the mid-to late 1990s (see Fig. 1). The germane cognitive load (GCL) type was added to the model in order to meet some findings in which germane load was increased for schema construction processes. Recent approaches have suggested that intrinsic and germane loads share the same theoretical foundation and can be classified as one type of load (e.g., Kalyuga, 2011;Sweller, 2010). Both versions, however, distinguish between intrinsic and extraneous cognitive load (Sweller, 2010;Sweller et al., 2019). In this vein, a factor analysis by Jiang and Kalyuga (2020) revealed that the intrinsic-extraneous model is suitable for assessing cognitive load. In contrast, a recent confirmatory analysis (Zavgorodniaia et al., 2020) has found strong support for the three-component model of intrinsic (ICL), extraneous (ECL), and GCL. 1 Furthermore, the most commonly used cognitive load questionnaires refer to the three-factor model and accordingly measure the three types of load separately. Therefore, for the aim of this work, the three-factor model was used.

Intrinsic Cognitive Load
Intrinsic cognitive load (ICL) describes the learning tasks inherent complexity (Klepsch et al., 2017). Accordingly, this load is determined by the learning material's level of element interactivity and the learner's domain-specific prior knowledge (Leppink et al., 2013). Hereby, the element interactivity can be described as the number of elements that have to be processed in the working memory at the same time (Chen & Kalyuga, 2020). It is assumed that a low prior knowledge linked with high element interactivity results in a high ICL. In contrast, learners with high prior knowledge have already formed schemata, which can be used as prior knowledge to help them solve problems or learning tasks without this leading to an excessive load on their working memory. In conclusion, ICL can only be changed by modifying the complexity that has to be learned or by enhancing the learner's domain-specific prior knowledge (Sweller et al., 2019).

Extraneous Cognitive Load
Extraneous cognitive load (ECL) is determined by how the learning materials are presented and organized (Sweller et al., 2019). When information is more difficult to process (e.g., through task-irrelevant details; Sundararajan & Adesope, 2020), not enough cognitive resources might be available for processing information relevant for learning as working memory capacity is exceeded. In this case, the learner is forced to compensate for the unfavorable presentation by additional cognitive effort (e.g., search processes to overcome split-attention effects; Schroeder & Cenkci, 2018). Consequently, a CLT-based recommendation is to keep the ECL as low as possible in order to enable successful learning. However, the additive character of the cognitive load types suggests that ECL in particular becomes important when the learning material generates a large amount of intrinsic load (Paas et al., 2003). In contrast, if ICL is low, the learner will have enough cognitive resources to also handle higher levels of ECL.

Germane Cognitive Load
However, over the years, the concept of germane cognitive load (GCL) has been undergoing revision (Kalyuga, 2011;Sweller, 2010). Because learning aims to build up schemata in long-term memory, the GCL refers to the working memory's resources needed to handle the intrinsic cognitive load imposed (Sweller et al., 2019). Accordingly, the learner should carry out activities like self-explanation or note-taking, which in turn contribute to learning. In contrast to the other two loads, the GCL represents a productive load (Moreno & Park, 2010). Following these assumptions, a high GCL is an indicator for engaged learners directing their cognitive resources to the learning process (Klepsch et al., 2017). In this vein, Kalyuga (2011) argues that the germane load is indistinguishable from the intrinsic load as both categories share the same theoretical background. Thus, the GCL does not represent a load in itself but rather has a distributive function, so that available working memory resources are free to handle the complexity of the learning material. The proposed intrinsic-extraneous cognitive load model removes the GCL as an autonomous type of cognitive load indicating that learning-relevant activities can be attributed to the ICL (Kalyuga, 2011;Sweller, 2010). With this assumption in mind, this work takes up the academic discourse regarding the number of cognitive load types in multimedia learning research.

Cognitive Load in Multimedia Learning
Empirical findings have shown that learning with multimedia can be enhanced when instructions follow the principles of CLT (Mayer & Moreno, 2003Sweller et al., 2011). Hereby, multimedia learning is generally defined as learning from both pictures and words (Mayer, 2014). Learning is hence more effective when people actively construct coherent models from verbal and pictorial representations (Fletcher & Tobias, 2005). However, learning is not automatically encouraged just because instructions may include words and pictures. Thus, not all multimedia learning settings are considered equally conducive to learning. As pointed out by Mayer (2014), instructional designers should build on assumptions of human cognitive architecture and thus also consider CLT when providing multimedia learning environments and materials for learners.
On the one hand, several design recommendations have been made regarding how best to support learners to transfer new information into long-term memory and integrate it with pre-existing knowledge (Sweller et al., 2019). These recommendations primarily focus on reducing extraneous processing while encouraging learners to manage any ICL induced by the difficulty of the learning material (Mayer & Moreno, 2010). To assist learners in managing the inherent complexity of certain tasks, the principles of segmenting (Rey et al., 2019) and pre-training (Mayer et al., 2002) have been formulated within the CLT framework. Presenting the learning content in learner-paced segments or providing learners with relevant information for the upcoming learning material are assumed to make it easier for the learner to learn, even when the learning materials induce a high degree of complexity. In addition, the isolated elements effect describes the learning-beneficial effect when information with a high element interactivity are learned in an isolated form in the first step (Pollock et al., 2002). Once the elements are stored in long-term memory, interactions between them can be learned in order to create a coherent model (Sweller et al., 2019).
On the other hand, several design principles have also addressed how to reduce ECL as it does not support and can even impede learning (Sweller, 2010). Because extraneous processing may occur due to unnecessary search processes, the splitattention effect describes the learning-hindering effect when corresponding information sources need to be cognitively integrated by the learner (Ayres & Sweller, 2014). To counter such additional cognitive processes, CLT recommends following the principles of spatial and temporal contiguity (Ginns, 2006). Accordingly, related pieces of information should be presented as close to each other as possible with respect to spatial and temporal proximity. A more integrated format thus facilitates integration of learning-relevant information. Another way to save working memory resources is to avoid redundancies by, for instance, presenting the same information both aurally and visually. Similarly, the redundancy effect (Kalyuga & Sweller, 2014) refers to the negative impact on learning of multiple ways presenting the same information.

Connections Among Cognitive Load Types
The CLT is an instructional framework finding wide application in multimedia learning research (Brünken et al., 2003;Paas & Sweller, 2014). In general, the different cognitive load types (ICL, ECL, and GCL) are presumed to add to the overall load (additivity hypothesis; Moreno & Park, 2010). In this vein, the different types of cognitive load form in sum the total cognitive load, whereby this assumption only applies if the capacity of working memory is not exceeded (Paas et al., 2003). When the cognitive load is approaching the limit of the working memory's capacity, the cognitive load types and their relationships can dynamically change. Following the theoretically based expectation that one load decreases when the other increases makes it easier to understand why learning materials are easy or difficult to learn. For example, when cognitive resources are depleted and ECL increases, fewer resources are available for germane processes and, thus, GCL decreases. Consequently, inconsistent connections between cognitive load facets can be assumed depending on the learning task including its complexity and presentation. Furthermore, the assumptions formulated very clearly in the CLT may differ from the subjective perceptions of the learners. It tends to be questionable whether learners are able to differentiate between the different types of cognitive load and whether questionnaires can differentiate between the types of cognitive load because of item construction and formulation. Consequently, methodological issues may cause the types of cognitive load to be related differently than formulated in the additivity hypothesis.
For instance, it can be assumed that the ICL and ECL should not be correlated since both loads are associated with different aspects of the learning materials (Sweller et al., 2019). Learners should therefore be able to differentiate between the tasks' inherent complexity (ICL) and the presentation of the learning material (ECL). Nevertheless, it can be argued that both sources of cognitive load cannot be assessed in a differentiated manner by learners. In this vein, it seems plausible that a complex learning content (e.g., biochemical processes) cannot be represented in a simple way and, thus, increases the ECL because of the complex presentation.
Based on the assumption that the ICL and GCL share a common theoretical background (Kalyuga, 2011), both variables should show an interdependency (measurable as correlation). The GCL is therefore not a load in itself but rather allocates available working memory resources to activities relevant to learning what is dealing with the intrinsic load (Sweller, 2010). However, this assumption is also questionable in light of the active load vs. passive load perspective (Klepsch & Seufert, 2021). This argues that the ICL results from the complexity of learning materials and is experienced passively by the learner, while the GCL relates to the allocation of cognitive resources and is, therefore, of an active nature. The distinction between passive and active load could result in both variables not correlating with each other.
Predicting relationships between the ECL and GCL is difficult at first glance. As the GCL refers to the allocation of cognitive resources to learning-relevant activities (Bannert, 2002), its active character is evident. Thus, learners are responsible for investing cognitive resources in germane processes actively (i.e., active load). In contrast, learners experience ECL as a result of how learning materials are presented in a passive way (i.e., passive load; Klepsch & Seufert, 2021). This distinction should result in the two loads not correlating with each other. In contrast, a learning material not optimally designed (causing higher ratings of the ECL) could be related to a lower GCL because learners are less motivated to learn and hence make less of an effort. In this vein, the cognitive load caused by the learning material can be categorized as a motivational cost (e.g., Feldon et al., 2019). To sum up, the additivity hypothesis of the CLT can probably hardly be found in reality since methodical, as well as theoretical restrictions, have to be considered.

Connections of Cognitive Load Types with Theory-Related Concepts
It is common to conduct cognitive load research in connection with learning tests to understand to what extent the intervention has contributed to successful learning. In line with Mayer (2001), knowledge gained in multimedia learning can be divided into two categories -retention and transfer. Retention is defined as remembering when the information explicitly mentioned in the learning material is asked for. In contrast, transfer is related to the application of acquired knowledge, for example, in new contexts (Mayer, 2001). Learning in both categories is typically assessed in experimental studies through multiple-choice or open question formats, among others. Accordingly, the retention-transfer differentiation is adopted in a wide range of experiments in multimedia learning research (e.g., Albus et al., 2021;Beege et al., 2019b;Bender et al., 2021;Stárková et al., 2019).
Theoretical foundations of CLT postulate direct relationships between the cognitive load types and learning outcomes. Thus, it is assumed that learning materials that are difficult to encode lead to more extraneous processing, which in turn reduces learning outcomes because additional cognitive resources, irrelevant for learning, are wasted. Concerning ICL, learning materials should be designed so that the task's inherent element interactivity is easier to handle (Mayer & Moreno, 2010). Furthermore, it can be hypothesized that ECL negatively affects learning performance and that this can be justified on the basis of theoretical assumptions. Thus, inappropriately designed or organized learning materials require additional cognitive resources, which are consequently no longer available for actual learning (Sweller, 2010). In terms of GCL, it can be derived that a higher GCL leads to higher learning outcomes because this instead represents an active load (Klepsch & Seufert, 2021;Sweller, 2010). In contrast, instructional designs and procedures should challenge and motivate the learner to invest cognitive resources for understanding (Mayer & Moreno, 2010). In line with this reasoning, attempts have been made within CLT to increase GCL (Paas & Van Merriënboer, 1994). Because increasing learning performance is the goal, greater GCL could lead to higher learning scores.
It is further assumed that the learner's domain-specific prior knowledge affects cognitive load and learning outcomes (Chen et al., 2017;Zu et al., 2021). Hereby, it is common to classify learners as novices or experts depending on the amount of their prior domain knowledge (Kalyuga & Renkl, 2010). In this vein, the expertise reversal effect states that the learner's domain-specific knowledge has a moderating effect on the effectiveness of CLT-based design recommendations (Chen et al., 2017). Consequently, the expertise reversal effect can be an additional source of ECL. Design decisions can enhance or reduce ECL perceptions in dependence of the prior knowledge of the learner. Accordingly, the interaction between the learner's expertise and the instructional procedures can lead to a reversal effect indicating that novices benefit more from an instructional intervention (reduced ECL), whereas experts may not benefit due to redundancies and associated inferences (no change or even enhanced ECL; Kalyuga, 2007).
Following the generally accepted definition of the ICL (e.g., Leppink et al., 2013), the domain-specific prior knowledge should correlate negatively with this cognitive load type. The more prior knowledge someone has, the less complex the learning material is perceived and vice versa. In this vein, one can assume that experts (with high prior knowledge) would assess a task involving a high ICL as less complex than novices (with low prior knowledge; Artino, 2008). Furthermore, learners with a high domain-specific prior knowledge can use already formed schemata while learning, making them less susceptible to poorly formatted learning materials that would tend to induce a high ECL (Paas et al., 2003). Accordingly, the domain-specific prior knowledge and the ECL should show a negative correlation indicating that learners with relatively high expertise report fewer ECL perceptions. Lastly, relationships between prior knowledge and GCL can also be postulated. With the assumption in mind that the GCL is indirectly related to the element interactivity of the learning material (Zu et al., 2020), it can be assumed that the prior knowledge and the germane load should correlate positively with each other. The more domainspecific prior knowledge (in the form of schemata) the learner has, the easier it is to allocate germane resources to learning-relevant activities.

Measuring Cognitive Load with Subjective Scales
Because working memory load is a key component of the CLT framework, measuring this load has been a high priority for researchers (Paas et al., 2003;Sweller, 2018;Sweller et al., 2011). However, cognitive load measurement is still an ongoing challenge in educational research (e.g., Ayres, 2018;de Jong, 2010;Kirschner et al., 2011;Moreno, 2010). In recent years, cognitive load research has adopted several measurement methods, with approaches divided into subjective scales and objective measures (Brünken et al., 2003). While self-reports are highly subjective, dual-task paradigms, learning outcomes, and physiological data are relatively objective methods. For example, cognitive load can be measured by asking learners to estimate their perceived cognitive load based on a Likert scale (direct) or by measuring indicators that are assumed to be related to cognitive load (indirect). In this vein, dual-task approaches and physiological measures are promising alternatives to rating scales for measuring cognitive load but are beyond the scope of the current study.

Unidimensional Measurement of Cognitive Load
Subjective measures are still the most frequently used approach in educational research (e.g., Schmeck et al., 2015). Hereby, the learner is asked to assess and selfreport the perceived amount of cognitive load while learning or working on a task (Sweller et al., 2011). This assessment is usually made after learning has taken place (Jiang & Kalyuga, 2020;Paas & Van Merriënboer, 1994). Accordingly, such instruments are applied on the assumption that individuals can give an accurate assessment of their experienced cognitive load -even if the questionnaire is conducted with a time delay (Ayres, 2006). In practical research, cognitive load is typically measured with numerical Likert-type rating scales in order to carry out statistical analyses (Ouwehand et al., 2021). The most popular rating scale for subjective measurement of cognitive load for educational purposes was proposed in the early 1990s by Paas (1992). Hereby, learners are asked to rate their invested mental effort while learning on a nine-point single-item scale ranging from "very, very low mental effort" to "very, very high mental effort." It is assumed that mental effort is an indicator of cognitive load. It has been shown that the Paas scale is sensitive for measuring differences in intrinsic cognitive load while learning (Naismith et al., 2015;Sweller et al., 2011). It should be noted that several studies adapted the scale by asking participants to rate the difficulty of the learning task (van Gog & Paas, 2008). This difference between invested mental effort and perceived task difficulty can quickly lead to problems of interpretation because learners are less motivated to invest mental effort when the learning is perceived as extremely difficult (e.g., Cennamo, 1993). Nevertheless, this scale enjoys frequent use as it is easy to implement and fast to handle for learners (Sweller, 2018). However, while it seems to be methodologically economical to measure this variable with one item, this is questionable from a psychometric point of view (e.g., Jiang & Kalyuga, 2020;Klepsch et al., 2017). Moreover, a measuring instrument consisting of only one item makes it impossible to calculate internal consistency, an important indicator for an instrument's reliability. With the proviso that the questionnaire is applied several times in one experiment (e.g., after each chapter of the learning material), it is possible to calculate test-retest reliability. However, since cognitive load can vary during learning and is therefore dynamic in nature, calculating test-retest reliability could lead to misleading reliability values. In this vein, test-retest reliability is only valid when constructs stable over time are examined (e.g., Baumeister, 1991).
To avoid such methodological problems, Leppink et al., 2013Leppink et al., , 2014, for example, have recommended using multiple items that allows for a more precise cognitive load measurement. Accordingly, measuring cognitive load without differentiating between the individual cognitive load types seems to be insufficient when evaluating the effectiveness of multimedia learning environments or interventions (e.g., van Gog & Paas, 2008).

Multidimensional Measurement of Cognitive Load
Taking up this criticism, several studies have introduced measurements that target the cognitive load types separately (e.g., Eysink et al., 2009;Klepsch et al., 2017;Leppink et al., 2013Leppink et al., , 2014. What these scales have in common is that they focus on certain cognitive load types and can therefore differentiate load more precisely. The instrument by Eysink et al. (2009) consists of six items (see Appendix A). What makes this questionnaire unique is that besides targeting ICL, ECL, and GCL, one additional item is included that measures the perceived overall cognitive load. However, this item asks learners to indicate the amount of effort they invested in following the learning material. The constructs ICL and GCL are only measured with one item each, so that no conclusions can be drawn about their internal consistency. While ICL asked participants to estimate the perceived difficulty of the learning material, GCL is related to the question of how easy or difficult it was to understand the learning content. In addition, three items concerning extraneous load refer both to the navigation and to design of the learning task, as well as to the accessibility of information.
Four years later, Leppink et al. (2013) developed a multidimensional scale (see Appendix B) including ten items referring to ECL (three items), ICL (three items), and GCL (four items). The authors conducted four studies with participants learning statistics to validate the questionnaire. Concretely, items representing ICL asked participants to estimate the complexity of presented topics, the learning activity, and covered formulas and definitions. In addition, the items representing ECL refer to the instruction and explanations in terms of their unclearness and ineffectiveness. The items concerning the GCL asked the participants to assess the extent to which the learning activity has enhanced their understanding and knowledge of the learning topic. Generally, the three-factorial structure with ten items was supported in the study by Leppink et al. (2013), though with some limitations. In particular, the GCL and learning outcomes did not correlate -a finding which contradicts theoretical assumptions of CLT. Moreover, the proposed model could only be partially supported in two studies. These issues encouraged the authors to review their proposed measurement (cf. Leppink et al., 2014;Appendix C). Since the learning topic of statistics was chosen in the first validation approach, two follow-up studies were conducted in order to examine whether the instrument is also reliable for other learning contexts. In addition, these studies should provide further evidence that the instrument can distinguish between the three types of cognitive load. First of all, the two studies supported the differentiation between items measuring intrinsic and extrinsic cognitive load. However, in line with the recent reconceptualization of GCL (e.g., Kalyuga, 2011;Sweller, 2010), "the assumption that the third factor in the psychometric instrument represents or closely relates to germane cognitive load is limited" (Leppink et al., 2014, p. 40) indicating that the three-factor model may not be fully adopted. In addition, Leppink et al. (2014) criticize the measurement by arguing that item responses on ICL, ECL, and GCL give no indication of how much mental effort the learners invest. Thus, no conclusions can be drawn about the load imposed on the working memory. Addressing this problem, one item was added to each type of cognitive load, which targets the mental effort invested in each factor to examine more directly the relationship between cognitive load and learning outcomes. The added mental effort items increase the reliability of the intrinsic and extraneous load factors, but not for germane load. Both versions of the measurement (Leppink et al., 2013(Leppink et al., , 2014 enjoy great popularity and are frequently used in experimental studies dealing with multimedia learning settings (as shown in a review by Mutlu-Bayraktar et al., 2019). Klepsch et al. (2017) introduced another cognitive load selfreport measurement trying to eliminate potential inconsistencies (see Appendix D). Hereby, ICL (two items) and ECL (three items) items refer to the task's complexity and the design of the learning material, while the GCL (three items) "should focus on the additional investment of cognitive processes into learning" (Klepsch et al., 2017, p. 5). In contrast to the Leppink questionnaires (2013Leppink questionnaires ( , 2014, the instrument from Klepsch et al. (2017) is not specific to the learning topic and can therefore be easily adapted to the material used in a specific study (e.g., animation, video, or text). Like the scales from colleagues (2013, 2014), the self-report measures differentiate all three cognitive load types. The authors validated the instrument with two different strategies. First, they used an informed rating: Students were trained to understand and differentiate between the types of cognitive load. The training consisted of a PowerPoint lecture that introduced students to the notion of cognitive load. After the training, the students were expected to be able to detect cognitive load types and to distinguish them from one another. Second, they used a naïve rating: students had to rate the same learning situations without being informed about the cognitive load types beforehand. Overall, informed ratings seem to be a promising instrument to measure the different types of cognitive load. However, giving participants an introduction to the CLT framework is not always possible in experimental studies. The naïve rating scale is much easier to handle and less time-consuming. The authors emphasize the broad possibilities for applying this method in several learning domains and studies. However, the results suggested that the GCL items should be used with caution because of varying levels of understandings on the part of respondents. The authors hence recommend conducting a reliability test. In the past few years, the naïve rating has been frequently used in experimental studies and is therefore part of this analysis.

Reliability and Validity of Subjective Cognitive Load Measurements
In general, the quality of a psychological test or measurement can be evaluated by means of three primary indicators -objectivity, reliability, and validity (Adams, 1936;Moosbrugger & Kelava, 2020). These quality criteria must be met in order to adequately measure a psychological construct using a questionnaire. For the aim of this work, the reliability and validity (Drost, 2011) of the cognitive load questionnaires are of central importance. To approach these constructs, several methods have been suggested in recent research.

The Internal Consistency of Subjective Cognitive Load Measurements
Reliability describes the consistency of a measuring instrument (Schuurman & Hamaker, 2019). Test theory assumes that a reliable instrument contains as little measurement error as possible, which maximizes the proportion of variance that arises due to actual differences in the construct to be measured (Kimberlin & Winterstein, 2008). Given a hypothetical situation in which the measurement is replicated, a reliable measurement should produce the same results under the premise that the measured construct remains unchanged (Heale & Twycross, 2015). Reliability scores can be calculated with the help of statistics that give an indication of the extent to which the instrument is free from measurement errors. Several well-known methods for estimating reliability have been established -internal consistency, parallel test, and test-retest (Schuurman & Hamaker, 2019), while various authors also rely on the split-half method to measure reliability (e.g., Cho, 2016;Thompson et al., 2010). To measure parallel test reliability, two different versions of an instrument measuring the same construct are presented to the participants several times with the two measuring instruments only differing in their wording. Concerning test-retest reliability, the procedure is similar to the parallel test method. However, the same instrument (with identical wordings) is given to participants more than once (Heale & Twycross, 2015). The third method, split-half, involves splitting the scale into two parallel halves which are then correlated (Cho, 2016). This procedure assumes that a test that is supposed to measure a construct should do this consistently across the scale with each item. The internal consistency, which is central to the aim of this work, is an estimate of the degree to which the items of the scale measure the same concept (Drost, 2011). Cronbach's alpha (α; Cronbach, 1951) is the most frequently used indicator for internal consistency (Cho, 2016;Hogan et al., 2000;Osburn, 2000;Streiner, 2003). The alpha value represents the average of all split-half reliabilities (Cortina, 1993;Ferketich, 1990). The instrument is randomly split into two halves, whereby the correlation between the sum scores estimates the reliability of the half test (Warrens, 2015). To infer the reliability of the full test, an estimate correction is needed (Revelle & Zinbarg, 2009). The resulting Cronbach's alpha value should thus reach comparatively high values since it is equivariant with a high proportion of explained systematic variance. In general, it can have values between zero and one, but negative values are also possible when some items of the scale are negatively correlated (Vaske et al., 2017). Nevertheless, specifications are rare regarding how high the value must be in order to meet the requirement of representing a reliable test. Recommendations range from 0.65 to 0.80 (Green et al., 1977) to 0.90 (Streiner, 2003) needing to be reached before the reliability of the scale can be assessed as adequate. However, in the social sciences, a value above 0.70 is generally accepted (Nunnally, 1978;Taber, 2018). However, a clearly increased alpha value can quickly lead to redundancy between the items. Following Streiner (2003), an internal consistency of 0.90 and above indicating high correlations between the items may suggest that some items of the scale are redundant. These items are assumed to test the same question or statement in another guise (Tavakol & Dennick, 2011a). Concerning the cognitive load types, reliable scales should be able to measure the different sub-types with high internal consistency. Accordingly, this work focuses on internal consistency.

The Validity of Subjective Cognitive Load Measurements
Given a reliable measurement, it should be not automatically assumed to be of high quality. It must also meet the standard of validity, which is generally defined "as the extent to which an instrument measures what it purports to measure" (Kimberlin & Winterstein, 2008, p. 2278, see also Kane, 2001). In this vein, Cook and Beckman (2006) as well as Kane (2013) pointed out that validity is not a property of the measuring instrument, but more the interpretation of what it measures. The results of a test or instrument are valid when the interpretations are justifiable in the context of the test's intended use (Kimberlin & Winterstein, 2008). As outlined by Kane (2001), resulting evaluative judgments are based on the degree to which theoretical evidence supports the interpretation of the test. Hereby, competing interpretations should also be considered (e.g., Messick, 1989). Validating a measuring instrument is therefore not a routine task but rather a close linkage of theory-based assumptions and test data. Accordingly, the underlying construct (e.g., cognitive load while learning) can never be perfectly reflected by the test, where the aim is to achieve correlations that are as high as possible (Cook and Beckman, 2006).
It is generally accepted that validity is a multifaceted construct. Consequently, in the literature, there are three main approaches to investigate the validity of a psychological test -content validity, construct validity, and criterion validity (Heale & Twycross, 2015). Content validity describes the extent to which the items of a scale are representative of the targeted construct so that the scale measures all relevant aspects (Almanasreh et al., 2019). Assessing the content validity is mostly conducted with the help of expert opinions. In this vein, the content validity index (CVI) is calculated based on item relevance ratings made by experts (Polit & Beck, 2006). Consequently, content validity plays an important role in the instrument development phase (ideally on the basis of expert surveys and reviewers in the publication process) and is therefore not part of this analysis. Instead, construct validity and criterion validity (including explicitly measurable variables) were the main focus.

Construct Validity
Introduced by Cronbach and Meehl (1955), construct validity refers to the concordance between the results of the measurement and the underlying theory. In this vein, the instrument should measure all relevant facets of the concept adequately. With the introduction of the construct validity, our understanding of the concept validity has changed. The question was no longer whether a psychological test measures what it is supposed to measure, but how it fits into the nomological network with other theoretically related constructs (Borsboom et al., 2004;Colliver et al., 2012). Quantifying the construct validity of an instrument is mostly conducted by identifying correlations with several measures. The resulting correlation patterns provide information about the degree of conformity between the measure and theoretically predictable variables (Westen & Rosenthal, 2003). Based on Campbell and Fiske (1959), construct validity can be divided into convergent and discriminant validity. Convergent validity is given when different instruments which aim to measure a common construct correlate highly with each other. In contrast, when instruments measuring different constructs do not correlate with each other, this effect is known as discriminant validity.

Criterion Validity
In terms of achieving criterion validity, psychological constructs like cognitive load should have a high degree of compliance with practically relevant external criteria (Dunn, 2020). Accordingly, the scale is not considered separately but in connection with other practically significant variables (Drost, 2011). Criterion validity can be classified into two types based on the timing of the measurements. If data is collected using the measuring instrument before data collection on the criterion, one speaks of predictive validity. Hereby, the scale's ability to predict the criterion variable is tested (Kimberlin & Winterstein, 2008). A measure can also be assessed in relation to relevant criterion variables at the same time (concurrent validity; Westen & Rosenthal, 2003).

The Present Work
Subjective judgments about perceived cognitive load during learning with multimedia can be associated with certain weaknesses. In this vein, there is an ongoing debate among researchers on how to assess cognitive load reliably and validly with self-rating scales (e.g., Brünken et al., 2003Brünken et al., , 2010Schmeck et al., 2015). Reliably and validly assessing the three types of cognitive load "has become the holy grail of CLT research" (Kirschner et al., 2011, p. 104). Nonetheless, subjective cognitive load measures are the most frequently used approach in educational research, as they can be easily implemented in experimental settings without taking up too much time. In order to verify the extent to which questionnaires provide reliable and valid insights about the construct of interest, the measuring instrument must meet the quality requirements already explained. The aim of this work is therefore to conduct a meta-analysis of cognitive load questionnaires with regard to their internal consistency and validity across studies. As questionnaires are always constructed in a theory-driven manner, it is also of great importance to examine the extent to which the cognitive load types correlate (1) among each other and (2) with important external criteria (e.g., learning outcomes) for validity purposes. In order to check the reliability and validity of cognitive load questionnaires, the four widely used instruments were chosen (see "Measuring Cognitive Load with Subjective Scales"): Eysink et al. (2009);Leppink et al., (2013Leppink et al., ( , 2014; and Klepsch et al. (2017).
The first part of the analysis focuses on internal consistency as a sub-type of reliability. In addition, moderator analyses were conducted to examine whether and how the internal consistency value Cronbach's alpha is influenced by relevant CLTrelated factors (Hall & Rosenthal, 1991). In this vein, the moderators of educational setting, the domain of the instructional material, the presentation mode of the learning material, and the number of response options (i.e., the number of scale points) were considered. Based on the insights gained, recommendations will be formulated how on to use subjective cognitive load scales in experimental multimedia learning settings. The second part focuses on the validity of cognitive load questionnaires, specifically, construct validity (i.e., how can the different questionnaires represent the theoretical assumptions underpinning CLT) and criterion validity (i.e., how cognitive load relates to important external variables, which in our case are the learning measures known as retention and transfer, as well as domain-specific prior knowledge).

Search Strategy
This study focused on articles published in peer-reviewed journals, which used the cognitive load measurements from Eysink et al. (2009), Klepsch et al. (2017, and Leppink et al., (2013Leppink et al., ( , 2014. These scales were selected as they measure all cognitive load types with multiple items. To find suitable studies, a literature search was carried out and finished on August 25, 2021. The "cited by" function of Google Scholar (for a review proving the adequacy for literature search, see Martin-Martin et al., 2017) was consulted to access the listing of the respective study. By this, an overview of all works could be gained which cited the respective scale validation's studies (Eysink et al., 2009;Klepsch et al., 2017;Leppink et al., 2013Leppink et al., , 2014. All of these studies (N = 1193) were then evaluated in terms of the following inclusion criteria (for an overview of the search process, see Fig. 2).
To be included in the meta-analysis, experimental studies had to be carried out in the field of multimedia learning indicating that a learning material or setting was intentionally manipulated. In the fields of multimedia learning and cognitive load, controlled and randomized experiments are the common and ideal ways to research and sustainably improve instructional scenarios and materials (Borman, 2002;Cobb et al., 2003;Sweller, 2021). Consequently, only experimental studies were included. In this vein, it is important to rely on reliable and valid measurement methods and, therefore, experimental studies are part of the analysis. At least one multimedia learning medium had to be included in the experimental setting of the respective study. Thus, experimental studies were included in which a multimedia learning material was intentionally manipulated (e.g., varying the font of the learning text;  or in which the handling of the learning material was intentionally varied (e.g., tracing the content of the learning material with the fingers; Tang et al., 2019). Only studies published in English were included to ensure transparency in the scientific community. Moreover, only peer-reviewed journal papers were included to ensure methodological quality (see Castro-Alonso et al., 2019). In addition, at least one cognitive load facet was measured with a pre-set list of subjective questionnaires (Eysink et al., 2009;Klepsch et al., 2017;Leppink et al., 2013Leppink et al., , 2014. All articles included were scanned for relevant data (reliabilities and correlations; see supplementary material). While most of the studies reported reliability values, correlation matrices were often missing. In the case of missing data, the corresponding author of the respective study was contacted by email and asked to fill in a prepared matrix and to send it back to the authors of this study. The matrix included a correlation matrix in which the authors should complete correlations on construct level between the concepts relevant for this work. Specifically, this matrix involved the constructs ICL, ECL, GCL, prior knowledge, retention, and transfer. When studies reported no relevant data within the manuscript and the supplementary material, or the authors did not reply to our email, the study was excluded from the analysis. Even if data was incomplete (e.g., a study reported reliability values, but no correlations for validity analyses), it was nevertheless included in the analysis in order to collect as much data as possible. This resulted in the different numbers of effect sizes used in the calculation of the meta-analysis. An overview of all studies included in the meta-analysis is given in Table 1.

Sample Characteristics
Overall, 53 articles including 67 experiments (N = 7413 participants) were considered for meta-analysis. Sample sizes, which were relevant for this work, ranged from N = 20 to N = 485. The mean age of the participants was 20.4 years, and the overall percentage of women was 63.5%. The mean sample size was M = 103.3 (SD = 68.0). Most experiments (N = 33) included the questionnaire from Leppink et al. (2013) while the related questionnaire from Leppink et al. (2014) was used in seven experiments to measure cognitive load. Moreover, the Klepsch et al. (2017) questionnaire was used in 20 experiments while 10 experiments used the questionnaire from Eysink et al. (2009). In three studies (Skulmowski & Rey, 2018, 2020aThees et al., 2021), two different cognitive load questionnaires were used.

Measures of Reliability and Validity
For this meta-analysis, theory-based dependent variables (i.e., the reliability and validity) were defined in advance and related data was subsequently collected in the course of the literature search. Concerning reliability, this work focused on the internal consistency of the cognitive load questionnaires (Ferketich, 1990). Accordingly, the Cronbach's alpha values (Cronbach, 1951;Osburn, 2000;Streiner, 2003) of the respective cognitive load types were collected and meta-analytically calculated. Following the three-factor model (Klepsch et al., 2017;Sweller et al., 1998;Zavgorodniaia et al., 2020), the alpha values were collected separately for the ICL, ECL, and GCL. In terms of construct validity, correlations between the individual cognitive load types were collected and meta-analytically calculated. Following the already mentioned retention-transfer approach (Mayer, 2001), correlation calculations between the cognitive load types ICL, ECL, and GCL were conducted with retention and transfer performance to gain deeper insights into the criterion validity of the cognitive load questionnaires. Because retention and transfer have a different focus on knowledge gain (e.g., Mayer, 2001), correlations were calculated separately.
Since the learner's domain-specific prior knowledge plays an important role within the CLT framework (e.g., Chen et al., 2017), correlations between this construct and the cognitive load types ICL, ECL, and GCL were also calculated meta-analytically.

Moderating Variables
Besides calculating main effects in a first step for the internal consistency metaanalysis, moderator analyses were conducted in a second step (Hall & Rosenthal, 1991). For this purpose, sub-groups were formed based on predefined criteria -the ICL; ECL; GCL 11 moderating variables. As cognitive load perceptions may depend on the learner's age and educational background, the educational setting in which the respective study was conducted was captured. Hereby, the classifications school education (involving primary and secondary education) and adult education (involving higher education and adult education) were chosen. Assuming that learners with increasing age are better able to make metacognitive assessments such as for the perceived cognitive load (van der Stel & Veenman, 2010), it could be hypothesized that the educational setting affects the consistency of the respective questionnaires. In this vein, Leahy (2018) pointed out that subjective questionnaires are problematic when these are used with children. As the second moderating variable, the domain of the instructional material was considered important as the CLT finds application in a wide range of learning domains (e.g., Alpizar et al., 2020;Brom et al., 2018;Rey et al., 2019;. Hereby, four different sub-groups were defined: natural sciences, social sciences, logic and mathematics, and other. Within multimedia learning research, instructional material can be classified based on its presentation mode. Thus, four sub-groups were built describing the learner's intervention and control options of the learning material (e.g., Weidenmann, 2002). First, static learning material includes non-moving text and/or pictures (e.g., Schneider et al., 2015). Second, dynamic materials are characterized by moving images, as is the case with videos or animations. However, in some dynamic learning materials, the learner has no control over the progress indicating that the video or animation cannot be paused or rewound (i.e., system-paced materials). Third, multimedia learning materials presented in an interactive presentation mode allow learners to have full control over the progress (i.e., learner-paced materials). Hereby, the possibilities range from playing an educational video game (e.g., Nebel et al., 2016) to moving the learners head around in a virtual reality environment (e.g., Albus et al., 2021). Fourth, when a study experimentally manipulated the presentation mode (e.g., Andrade et al., 2015;Dervić et al., 2019), this study was classified as mixed. The same classification was used in a meta-analysis by Schroeder and Cenkci (2018). From a psychometric point of view, the number of response options (i.e., the number of scale points) plays an important role and has been examined in a variety of psychological studies (e.g., Lissitz & Green, 1975;Matell & Jacoby, 1972;Simms et al., 2019;Wakita et al., 2012). For instance, there is empirical evidence that scales involving an odd number of options are not suitable to measure a construct (Dalal et al., 2013). Nevertheless, there are no clear recommendations on how many answer options should be provided when learners are asked to rate their perceived cognitive load on a subjective scale. Therefore, the number of answer options was collected for each study included in this meta-analysis.

Analysis Methods
In general, a meta-analysis collects empirical studies addressing the same research question to calculate the mean and variance of a population effect (Field & Gillett, 2010). Meta-analyses from the research field of educational psychology usually report weighted average effect sizes (e.g., Cohen's d; Hedges' g) when examining the effectiveness of learning formats or principles. The focus of this work, however, is to examine (a) the internal consistency via Cronbach's alpha and (b) the validity of the cognitive load questionnaires designed by Eysink et al. (2009), Klepsch et al. (2017, and Leppink et al., (2013Leppink et al., ( , 2014. Both calculations were calculated following the same pattern based on correlations. The meta-analytical procedure was carried out using JASP version 0.15 (JASP Team, 2021). An effect size was confirmed as significant (p < 0.05) when the associated confidence interval (CI) does not include zero (Hedges et al., 1992;Nakagawa & Cuthill, 2007). In addition, moderator analyses were calculated with SPSS version 28.0 (IBM Corp, 2021). To compare the effect sizes (i.e., aggregated reliability and validity estimates) for significant differences, the 95% percent confidence intervals (CI) were consulted (Cumming & Finch, 2005). If an effect size was not included in the confidence interval of another effect size to be compared, it was assumed that a significant difference exists.

Internal Consistency and Methods of Generalization
Aggregating reliability estimates from different studies with meta-analytic methods is described as reliability generalization (Vacha-Haase, 1998). Hereby, a metaanalysis can also be used to identify moderators of the alpha value (Bonett, 2010). If alpha shows similar values across different samples and experimental conditions, strong evidence of reliability generalization can be provided. In practical applications, usually only sample estimates of Cronbach's alpha are provided by scientific articles. This is mainly because many measurements are conducted with too small sample sizes making it difficult to estimate alpha with adequate precision (Bonett, 2010). As this work includes a larger sample size, a more accurate estimate with confidence intervals can be calculated. Following previous studies that have cumulated estimates of reliability (e.g., Graham & Christiansen, 2009;Graham et al., 2011;Pentapati et al., 2020;Piqueras et al., 2017), internal consistency of cognitive load scales was calculated for ICL, ECL, and GCL separately. Hereby, the reliability generalization framework allows the comparison of reliability estimates across a variety of studies (Vacha-Haase, 1998). As Cronbach's alpha is variance-adjusted, it corresponds to the value of the squared correlation (Thompson & Vacha-Haase, 2000). In detail, the square root of the reliability coefficients was calculated to obtain a r-equivalent correlation (Graham et al., 2011). However, correlations are not normally distributed because the bounded value range [− 1, + 1] can lead to a skewed sampling distribution (Silver & Dunlap, 1987). Accordingly, the Fisher's r-to-z transformation was then applied to this value to prepare it for further computations as is usual for meta-analyses (Hedges & Olkin, 1985). This way, the skewed distribution is transformed into a normal distribution: The resulting values were then weighted using the inverse variance weight of the coefficients as suggested by Graham and Christiansen (2009). This procedure takes account of different sample sizes in the various studies: The standard error can therefore be calculated from this variance using the following formula (Borenstein et al., 2009): On the assumption that reliability estimates represent different populations, a random-effects model was preferred to a fixed-effects model for the meta-analysis . Besides, Field and Gillett (2010) recommend a random-effect model when conducting meta-analyses in social sciences. Because many studies failed to report reliability coefficients, it is even more important to determine the influence of those missing data on the overall mean of Cronbach's alpha. As pointed out by Graham and Christiansen (2009), researchers intentionally do not report coefficients that are of too low a reliability, and work that has reported non-significant studies because of poorly reliable measuring instruments has often not been published. Such circumstances result in a publication bias which has been quantitatively expressed using the rank correlation based on Kendall's tau (τ; Begg & Mazumdar, 1994). This approach quantifies the relationship between the ranks of effect sizes and the ranks of their variances. The lower a correlation is, the more effect sizes are independent of the sample sizes of the studies. Since Fisher's z transformation was used for meta-analytic calculation of the summary effect and the confidence intervals' limits, these values were converted back into correlations (Hafdahl & Williams, 2009)  In the final step, the r values were transformed to the metric of coefficient alpha (α) in order to facilitate interpretation.

Validity
In order to gain deeper insights into how the individual cognitive load types are interrelated and how these relate to relevant criterion variables (domain-specific prior knowledge, retention, and transfer), corresponding correlations were analyzed (e.g., Field, 2005). By this, correlations between the variables of interest could be meta-analyzed. As suggested by Glass et al. (1981), all effect sizes were retrieved in the form of Pearson's product-moment correlations, a standardized and prominent effect size. Since not all studies in our sample reported r values, conversion procedures had to be carried out. Following Gilpin (1993), the raw effect sizes of Spearman's rho (ρ) were transformed to Pearson's r. When the study reported standardized beta coefficients (β), the correlation coefficient (r) was estimated with the formula proposed by Peterson and Brown (2005). Hereby, the mathematical relationship between the two coefficients is shown in a multiple regression model with two predictor variables: Since bivariate correlations were calculated including one predictor, it can be assumed that r = ß. Afterward, the r values were transformed into Fisher's z for calculating average correlations because Pearson's correlation coefficient cannot be interpreted as an interval-scaled measure (Silver & Dunlap, 1987). This procedure was also applied by several meta-analyses that have aggregated correlation coefficients (e.g., Capaldi et al., 2014;Edwards & Holtzman, 2017;Richardson et al., 2012). Hereby, the same procedure as for reliability was used (cf. Hedges & Olkin, 1985). A random-effects model was also calculated in this meta-analysis (Field & Gillett, 2010;Higgins et al., 2009). In addition, the rank correlation for publication bias was calculated (Begg & Mazumdar, 1994). The means and confidence intervals were then back-transformed in the correlation coefficient r to simplify interpretation. Therefore, the same formula proposed by Borenstein et al. (2009) was used. To interpret the correlation coefficients, normative effect size guidelines for individual differences researchers were followed (cf. Gignac & Szodorai, 2016). Accordingly, r = 0.10 is relatively small, r = 0.20 is typical, and r = 0.30 is relatively large. Since prior knowledge is an important factor influencing cognitive load facets (i.e., element interactivity; Chen & Kalyuga, 2020 or expertise reversal effect; Chen et al., 2017) and, consequently, correlations among cognitive load facets as well as facets and learning scores, detailed analyses with regard to prior knowledge can be found in Appendix I.

Overall Internal Consistency
With regard to the internal consistency of the cognitive load questionnaires, the analyses showed satisfactory results. First, the reliability values of all four questionnaires were accumulated to examine whether the questionnaires are capable of measuring cognitive load consistently (see Table 2). Hereby, the ICL showed an alpha value of α + = 0.823, while the internal consistency of the ECL was lowest (α + = 0.773). The GCL showed the highest value (α + = 0.860). Considering confidence intervals of the effect sizes, significant differences between all three cognitive load types concerning the internal consistency estimates could be found. All rank correlations (Begg & Mazumdar, 1994) were not significant indicating that a publication bias does not seem to be present for the internal consistency of the four investigated cognitive load questionnaires.
In order to gain deeper insights into whether the individual questionnaires can reliably measure cognitive load, the previous analysis was repeated separately for the four questionnaires (see Table 3). For the Leppink et al. (2013) questionnaire, the ICL showed an alpha value of α + = 0.845, while an internal consistency of α + = 0.759 was found for the ECL. Again, the GCL showed the highest alpha r y1 = 1 + r 12 r y2 − 1 r 12 value (α + = 0.909). Based on the effect sizes and the confidence intervals, it could be derived that all three cognitive load types differ significantly in their internal consistency. The questionnaire from Klepsch et al. (2017) also produced satisfactory Cronbach's alpha values for the individual cognitive load types. The internal consistency of the ICL amounts to a value of α + = 0.776 and α + = 0.798 for the ECL. The GCL showed a value of α + = 0.734. Consequently, the highest internal Table 2 Aggregated effect sizes and confidence intervals for the internal consistency across the four cognitive load instruments k = number of studies (or reliability coefficients); α + = accumulated Cronbach's alpha across studies; Lb and Ub = lower and upper bounds, respectively, of the 95% confidence interval around the overall reliability estimate; rank correlation = test for publication bias. For the Eysink et al. (2009)  consistency was found for the ECL. In addition, this value was significantly higher than the value of the GCL. The Cronbach's alpha of the ICL also differed significantly from the GCL. ICL and ECL did not differ significantly from each other. The cognitive load questionnaire by Leppink et al. (2014) joins the ranks of questionnaires with good internal consistency. Although comparatively few effect sizes were included in the analysis, the Cronbach's alpha values for ICL (α + = 0.851), ECL (α + = 0.788), and GCL (α + = 0.806) can be considered satisfactory in terms of commonly used benchmarks (e.g., Nunnally, 1978). The three cognitive load types did not differ significantly from each other. Because the questionnaire from Eysink et al. (2009) only measures the ECL with multiple items, the internal consistency could only be calculated for this cognitive load type. Hereby, the ECL showed a satisfactory internal consistency across studies (α + = 0.740). The non-significant rank correlations across all examined cognitive load scales seems to indicate that there is no publication bias (Begg & Mazumdar, 1994).

Internal Consistency -Moderating Variables
To investigate the influence of additional variables on the reliability of cognitive load questionnaires, moderator analyses were carried out. The results of the moderator analyses across all questionnaires are displayed in Table 4. According to the confidence intervals of the effect sizes, learners' age and educational background, the domain of the material, and the presentation mode were not significant moderators of the reliability of the ICL subscale. Only the number of scale points resulted in significant differences. According to the effect sizes, a 10-point Likert scale was most reliable. A 7-point Likert scale at least should be used to ensure stronger reliability. Concerning ECL, learners' age and educational background, as well as the presentation mode were not significant moderators. With respect to the domain of the learning material, the highest reliability was achieved in the field of mathematics and logic; reliability did not differ with regard to the other learning domains. Regarding the number of scale points, an odd number (i.e., 5 or 9 response options) resulted in higher reliabilities than an even number of scale points. Considering the GCL, learners' age and educational background, as well as the domain of the learning material were not significant moderators. Regarding the presentation mode, reliability was reduced when interactive learning media were used. Regarding GCL, the use of an 11-point Likert scale was associated with the highest reliability, whereas the use of a 7-point Likert scale led to the lowest reliability.

Construct Validity
To examine the construct validity of the subjective cognitive load questionnaires, the generally accepted definition was followed proposing an ideally high level of agreement between the measurement results and the underlying theoretical assumptions (Cronbach & Meehl, 1955;Westen & Rosenthal, 2003). In consequence, construct validity was evaluated by considering the correlations between the cognitive load types of the investigated questionnaires. Regarding all questionnaires, meta-analytic correlations between the cognitive load types are displayed in Table 5. For the questionnaire by Leppink et al. (2013), a positive correlation between ICL and ECL and a negative correlation between ECL and GCL were found. ICL and GCL did not correlate with each other. When examining the questionnaire by Klepsch et al. (2017), a positive correlation between ICL and ECL as well as between ICL and GCL was found. ECL and GCL were negatively correlated with each other. Regarding the questionnaire by Leppink et al. (2014), a positive correlation between ICL and ECL was found. ECL and GCL as well as ICL and GCL were correlated negatively with each other. Concerning the questionnaire by Eysink et al. (2009), all types were positively correlated with each other. It should be mentioned that the number of included effect sizes from the questionnaires by Leppink et al. (2014) and Eysink et al. (2009) was relatively small. Across all four questionnaires, positive correlations between the constructs ICL and ECL were found. The largest correlation was found for the questionnaire by Eysink et al., (2009;r + = 0.53). Except for the questionnaire by Eysink et al. (2009), all questionnaires showed negative correlations between ECL and GCL. The largest negative correlation was found for the questionnaire by Leppink et al., (2014; r + = − 0.33). The correlations between ICL and GCL were ambiguous across the investigated questionnaires.

Criterion Validity
Considering criterion validity, relevant variables in cognitive load research were included in order to investigate interrelationships of the cognitive load types with practically relevant external criteria (i.e., the domain-specific prior knowledge as well as learning outcomes of retention and transfer; Drost, 2011). Meta-analytic correlations between learning scales and cognitive load types of all questionnaires are displayed in Table 6. Meta-analytic correlations between domain-specific prior knowledge and cognitive load types of all questionnaires are displayed in Table 7. At first, the criterion validity of the questionnaire by Leppink et al. (2013) was investigated. ECL negatively correlated with both learning scales and GCL positively correlated with retention and transfer. ICL negatively correlated with transfer but not with retention performance. Furthermore, ICL and ECL negatively correlated with domain-specific prior knowledge. GCL did not correlate with prior knowledge. Second, the criterion validity of the questionnaire by Klepsch et al. (2017) was investigated. ICL and ECL negatively correlated with both learning scales and GCL positively correlated with retention and transfer. Furthermore, ICL and ECL negatively correlated with prior knowledge. GCL did not correlate with prior knowledge. Third, the criterion validity of the questionnaire by Leppink et al. (2014) was investigated. ICL negatively correlated with both learning scales. ECL negatively correlated with transfer but not retention performance. GCL did not correlate with retention. Meta-analytic correlation between GCL and transfer could not be conducted because of the lack of data (k = 1). Furthermore, ICL and ECL did not correlate with prior knowledge. Meta-analytic correlation between GCL and prior knowledge could not be conducted because of the lack of data (k = 1). Overall, the number of studies that used the questionnaire from Leppink et al (2014) was very small. Thus, interpretation of the results is restricted. Finally, the criterion validity of the questionnaire by Eysink et al. (2009) was investigated. All cognitive load types negatively correlated with retention and transfer. Furthermore, GCL negatively correlated with prior knowledge. In addition, prior knowledge did not correlate with ICL and ECL. Summarizing the results, significant negative correlations between the sub-facet ICL and both learning scores occurred across all four questionnaires. Regarding retention, the largest correlation was found for the questionnaire by Eysink et al., (2009;r + = − 0.23). Regarding transfer, the largest correlation was found for the questionnaire by Leppink et al., (2014;r + = − 0.25). Furthermore, across all questionnaires, significant negative correlations between the sub-facet ECL and both learning scores occurred. Regarding retention, the largest correlation was found for the questionnaire by Eysink et al., (2009;r + = − 0.22). Regarding transfer, the largest correlation was found for the questionnaire by Leppink et al., (2014;r + = − 0.18). These results are in line with the theoretical implications derived from the CLT but effect sizes were rather moderate (cf. Gignac & Szodorai, 2016). Correlations between GCL and learning scores as well as cognitive load types and prior knowledge were rather ambiguous across the questionnaires. Consistent with predictions based on CLT, positive correlations between GCL and learning scores and negative correlations between prior knowledge and ICL as well as ECL could be observed regarding the questionnaires by Leppink et al. (2013) and Klepsch et al. (2017). Missing correlations between prior knowledge and ICL as well as ECL and negative correlations between GCL and prior knowledge as well as learning scales could be observed regarding the questionnaire by Eysink et al. (2009).

Discussion
The aim of this meta-analysis was the investigation of subjective cognitive load questionnaires in terms of their validity and reliability in experimental multimedia learning settings. In the following, the results of these analyses will be discussed with regard to whether they comply with the assumptions of CLT.

Internal Consistency of the Examined Cognitive Load Scales
The cognitive load questionnaires from Klepsch et al. (2017) and from Leppink et al., (2013Leppink et al., ( , 2014 showed satisfactory results indicating that the respective items for ICL, ECL, and GCL seem to have a high internal consistency. Hence, the questionnaires can all be recommended to measure the different cognitive load types because a high reliability can be observed although usually only three to five items per construct are used. All values are higher than 0.70 and consequently can be considered satisfactory. Moreover, it can be concluded that no items of the scales are redundant as the alpha value 0.90 did not occur for any cognitive load type (Tavakol & Dennick, 2011a). In addition, the satisfactory values of the cognitive load scales are associated with a small measurement error (Tavakol & Dennick, 2011b). For instance, the alpha value (α + = 0.827) for the ICL involves an error variance of 0.316 (0.827 × 0.827 = 0.684; 1.00 − 0.684 = 0.316). The alpha values of the ECL scales (α + = 0.773) as well of the GCL scales (α + = 0.860) also reaffirm the use of the scales in experimental multimedia learning research. However, it must be noted that for the Eysink et al. (2009) questionnaire, only the internal consistency for the ECL could be calculated, as the ICL and GCL are measured with single items. Besides the lack of possibility to calculate the internal consistency, it seems insufficient from the point of view of measurement theory to measure complex constructs like intrinsic and germane load with one single item.

Validity of the Examined Cognitive Load Scales
With respect to the validity of the cognitive load questionnaires, this work examined both construct validity and criterion validity (Cronbach & Meehl, 1955;Drost, 2011;Westen & Rosenthal, 2003). In this way, connections of the cognitive load types among each other can be calculated on the one hand (construct validity) and connections with theory-related concepts (i.e., domain-specific prior knowledge, retention, and transfer) on the other (criterion validity). The following conclusions have to be interpreted under the assumption that the types of cognitive load can change dynamically, in particular when the working memory's capacity is exceeded. Furthermore, cognitive load is a latent, not directly observable construct, whereby theoretical assumptions cannot be transferred congruently in the real world. Across all four cognitive load questionnaires, positive correlations between the ICL and ECL were found. Hence, the positive correlation supported by relatively large effect sizes between these two cognitive load types seems inconsistent with the additivity hypothesis of the CLT. While the ICL refers to the tasks' inherent complexity, the ECL is determined how the information is presented and formatted (Sweller et al., 2019). Therefore, the ICL and ECL should describe different aspects of the learning material. However, apparently, ICL and ECL cannot be completely separated from each other by the learner. When learners perceive a learning material to be high in ECL because of a poor presentation of information, they might also perceive the learning material to be more complex. In addition, it seems plausible that a complex learning content can also only be represented with a rather complex design. The positive correlation between the ICL and ECL thus tends to contradict the theoretically stated additivity hypothesis. On the one hand, this could be based on the already mentioned problem that CLT assumptions may differ from subjective evaluations. Otherwise, it could be a measurement problem, so that item formulations do not accurately reflect the types of cognitive load. This could be addressed by two possibilities. First, developing and validating a new cognitive load questionnaire could help to overcome the missing differentiation. Second, referring to Klepsch et al. (2017), learners could be informed before the experiment about which aspects of the learning material are described by the ICL and ECL and how to distinguish between them. Based on this "meta"-knowledge, more accurate judgments about perceived cognitive load could be made. In a similar vein, Zu et al. (2021) found empirical evidence that learners' ability to distinguish between the ICL and ECL depends on their domain-specific prior knowledge. In summary, the construct validity of the questionnaires examined may be relatively limited when considering underlying theoretical assumptions. Particularly salient are correlations between the ICL and ECL, which suggest that the two constructs cannot be grasped separately by learners. In this vein, the available cognitive load questionnaires seem to lack sufficient discriminant validity because ICL and ECL showed relatively high correlations. Tendency, the two, different labeled measures, ICL and ECL, assess a same construct (extrinsic convergent validity; Gonzalez et al., 2021). The construct validity of the questionnaires by Klepsch et al. (2017) as well as by Leppink et al., (2013Leppink et al., ( , 2014 revealed a negative correlation between the cognitive load types ECL and GCL. In line with our understanding of cognitive load, a higher ECL could be accompanied by a lower GCL because cognitive resources are wasted to compensate for the sub-optimal design or presentation of the learning material (Sweller et al., 2019). This connection could be also based on motivational influences suggesting that unfavorably designed learning materials could lower an individual's motivation to learn (Feldon et al., 2019). The more that learners are motivated, the more germane (or learning-relevant) resources are invested by the learner to master the task. In this vein, there is empirical evidence that motivated learners reported higher levels of GCL (Cook et al., 2017;Costley & Lange, 2018). One could consequently argue that higher ECL perceptions are related to lower investments of mental effort indicating that passive load affects active load (Klepsch & Seufert, 2021). However, these conclusions are limited by the fact that the questionnaire by Eysink et al. (2009) found a positive correlation between the ECL and GCL limiting the construct validity of this questionnaire. In recent years, the three-factor model of cognitive load has been widely discussed in view of its theoretical justifiability (Kalyuga, 2011;Sweller, 2010;Sweller et al., 2011Sweller et al., , 2019. The results of this meta-analysis can add momentum to this discussion. Theoretical assumptions suggest that both the ICL and GCL have the same theoretical basis so that the notion of the GCL might be redundant (Kalyuga, 2011). Consequently, both variables should correlate with each other as the GCL distributes cognitive resources to handle the tasks element interactivity (described as ICL). However, the questionnaire from Leppink et al. (2013) found a non-significant correlation between these cognitive load types indicating that they are not statistically connected which seems to support the additivity hypothesis. This is another point that tends to support the three-factor model (e.g., as in the factor analysis by Zavgorodniaia et al., 2020). The Klepsch et al. (2017) as well as Eysink et al. (2009) questionnaires showed significant but rather small positive correlations between ICL and GCL. Even if correlations do not allow causal statements of course, it can, at least, be hypothesized that both variables measure two related but no uniform construct. In general, the questionnaire from Eysink et al. (2009) makes it difficult for the learner to differentiate between the cognitive load types as the item formulations dissolve the theoretical boundaries of the CLT. Thus, all items include the term "easy or difficult" which rather evokes associations with the ICL (task difficulty) and thus seems to be insufficient for the evaluation of the ECL and GCL.
Concerning criterion validity, correlations with theory-related concepts (i.e., domain-specific prior knowledge, retention, and transfer) revealed interesting insights. First, across all four cognitive load questionnaires, negative correlations were found between the ICL and the learning outcomes in both retention and transfer. However, the correlation between ICL and retention failed to reach significance in the Leppink et al. (2013) questionnaire. It can be assumed that higher ICL perceptions are related to worse learning outcomes. In the light of CLT, this unsurprising finding can be explained by the task complexity (Ayres, 2006;Chen & Kalyuga, 2020). The lower the learner estimates the task's inherent complexity (refers to element interactivity), the better the result achieved in the learning test. This effect can probably be explained by domain-specific prior knowledge . Learners with expertise in the domain relevant for the learning material can rely on previously generated (automated) schemata including interacting elements which help them to deal with the task's complexity (Paas et al., 2003;Sweller et al., 2011). In contrast, learners with low prior knowledge have not acquired schemata, so that each element needs to be processed separately while learning. In terms of ECL, negative correlations between this cognitive load type and learning outcomes in retention and transfer were found. As correlations between ECL and retention (Leppink et al., 2014) as well as between the ECL and transfer (Eysink et al., 2009) failed to reach significance, the following conclusion must be made with some caution. In general, the results regarding the criterion validity support theoretical assumptions of the CLT (Sweller et al., 2019). Therefore, the negative relationship between the ECL and learning outcomes indicates that learners who perceive the learning material as unfavorably designed for learning (causing higher ECL ratings) perform worse in the learning test. In this case, cognitive resources needed to cope with the complexity of the task are wasted for processing the poorly designed instructions (Klepsch & Seufert, 2020). This is in line with the common recommendation to reduce ECL while learning in order to enhance learning outcomes (Beckmann, 2010;Leppink & Heuvel, 2015). Reducing ECL can free cognitive resources to deal with the task's inherent element interactivity (Paas et al., 2003). In terms of GCL, positive correlations with the learning outcomes retention and transfer are also explainable based on CLT's tenets. However, correlations between GCL and the learning indicators retention and transfer were non-significant for the Leppink et al. (2014) questionnaire. In general, the positive correlations (questionnaires by Klepsch et al., 2017;Leppink et al., 2013) indicated that higher GCL perceptions go hand in hand with higher learning outcomes. As GCL is held to arise from learningrelevant activities such as taking notes or remembering previously acquired knowledge, this cognitive load can be seen as supporting successful learning. In line with Paas and van Gog (2006), a high GCL indicates that learners are engaged to learn and direct their mental resources to learning-relevant activities. Increasing GCL is thus a central challenge within CLT (Klepsch & Seufert, 2020;Moreno & Park, 2010;Paas & van Gog, 2006) -what is underlined by the results of this work. Although the positive correlation is a logical consequence of the theoretical assumptions of the CLT, the relatively low level of correlations is surprising. Accordingly, a higher correlation is to be expected, because learners who report a high GCL should also have a comparatively high learning gain. The small correlations indicate a gap between the learners' subjective evaluated GCL and their actual objective result in the learning test. Assuming that there is strong evidence for meta-cognitive beliefs and learning outcomes being related (e.g., Al Khatib, 2010;Nelson & Dunlosky, 1991;Sungur, 2007), cognitive load questionnaires should be able to better capture the relationship between the GCL and learning achievements. In contrast, the negative correlation between the GCL and learning outcomes, reported in the Eysink et al. (2009) questionnaires, gives a further indication that this questionnaire does not adequately measure the GCL due to unfavorably formulated items. It is important to add here that it is also possible for a learner to invest a high level of GCL, which ultimately does not pay off, so not much learning could be done. This could happen particularly if the ICL (i.e., the complexity of the learning material) is too high and/or the learner has too domain-specific little prior knowledge. Thus, a relatively high GCL is not necessarily associated with better learning performance. In this vein, motivational beliefs should not be neglected, but are, according to Feldon et al. (2019), a result of the instruction and could affect the GCL and related concepts such as mental effort.
Regarding domain-specific prior knowledge, negative correlations between the ICL and prior knowledge occurred in the questionnaires from Leppink et al. (2013) and Klepsch et al. (2017). In line with our current understanding of cognitive load, the learner's expertise affects ICL perceptions (Artino, 2008;Bannert, 2002). Learners with high expertise can draw on schemata during learning, which help to cope with the complexity of the task (Kirschner et al., 2009;Leppink & Heuvel, 2015). In this vein, it could also be possible that learners reporting a high ICL have less domain-specific prior knowledge. Similar results were found between ECL and prior knowledge. When learners can rely on domain-specific prior knowledge, they perceive a lower ECL. Accordingly, learners are less susceptible to poorly formatted and designed learning materials, when enough expertise is available for learning. This is based on the additive relationship between ICL and ECL. Counterintuitively, non-significant correlations between the GCL and prior knowledge were found. In view of CLT, it could be assumed that learners with a certain level of prior knowledge are better able to allocate their cognitive resources to learning-relevant activities (Paas & van Gog, 2006;Paas et al., 2003).

Recommendations for Further Use of Cognitive Load Scales in Experimental Research
Subjective questionnaires play an important role in experimental cognitive load research, as they help us to better understand cognitive processes during learning. These findings can help to further improve learning materials or procedures. Therefore, cognitive load questionnaires should meet the highest psychometric requirements (Embretson, 2013). On one hand, the reliability analyses showed satisfactory Cronbach's alpha values justifying the use of these scales in experimental settings. On the other hand, based on the moderator analyses, recommendations can be made that should be considered in the future when using cognitive load questionnaires. In terms of the number of scale points, moderator analyses suggest that at least a 7-point scale could be used to ensure high reliability when ICL is to be measured. A scale with 10 response options was associated with the significantly highest reliability when measuring the ICL. The ECL could be measured with 5-point or 9-point scales to ensure high reliability. Scales with an even number of response numbers (i.e., 6 or 10 response numbers) were associated with lower reliability and should therefore not be used, which counteracts findings from Dalal et al. (2013). In terms of GCL, using a 11-point scale was associated with the highest reliability, but all numbers of scale points but a 7-point scale resulted in a high reliability. From a pragmatic perspective, however, researchers are likely to measure ICL, ECL, and GCL with the same number of scale points. This also makes it easier for the participant to understand the scale. Taken together, moderator analyses support using a 9-point scale. With respect to the domain of the instructional material and the presentation mode, the internal consistency of the ICL showed only slight differences indicating that this scale can be used in various learning settings. Furthermore, the ECL showed the highest reliability when the experimental studies took place in the instructional domain of logic and mathematics. Possibly, the perception of the presentation and format of the learning material is particularly sensitive when dealing with complex learning topics. In interactive learning environments, the GCL showed lower reliability indicating that learners are less able to monitor and report on their learning process. Across all cognitive load types, the absence of notable differences between school education and adult education seems to suggest that the questionnaires can reliably measure cognitive load over a wide age range. However, researchers should ensure, prior to the experiment, that learners can understand the item formulations and can thus make suitable meta-cognitive judgments. In this vein, Leahy (2018) warns against using subjective cognitive load questionnaires with children. However, Wang, Ardasheva, et al. (2021a), Wang, Ginns, et al. (2021b)) were successful in using a cognitive load questionnaire with 10-to 12-year participants which outlines the need for future research. The literature review uncovered noticeable differences in the descriptions of the cognitive load scale. For example, there is often a lack of information on the number of levels of the scale, the labels of the scale points, or even reliability. However, because these points are quite essential to ensure the fit of the scale to the experimental purpose, researchers are encouraged to specify as precisely as possible which scales of cognitive load were used including the number of options (e.g., 10-point scale) and to report reliability values such as Cronbach's alpha (Cronbach, 1951), McDonald's omega (McDonald, 1999), or Revelle's omega (Revelle & Zinbarg, 2009). However, researchers should be aware of the ongoing debate concerning methodological weaknesses of Cronbach's alpha (Christmann & Aelst, 2006;Hayes & Coutts, 2020;McNeish, 2018;Panayides, 2013;Sijtsma, 2009;Taber, 2018) and should critically evaluate the reliability values also with regard to the construct to be measured. In this vein, Deng and Chan (2017) emphasize that Cronbach's alpha tends to misestimate true reliability unless τ-equivalent items are involved.
It is particularly noticeable that researchers tend to interchange the cognitive load questionnaires from Leppink and colleagues (2013;2014). This circumstance is probably due to the similarity of the two questionnaires, which, however, differ significantly because of the additional mental effort items and the resulting higher number of items included in the questionnaire published in 2014, which, on the one hand, had an aggravating effect on the data synthesis for this work and, on the other hand, complicates the interpretation of the results in the respective primary study. Moreover, articles were found reporting the reliability for the three cognitive load types together, which is not coherent from a theoretical point of view. Although all three types measure the burden on the working memory (or at least the ECL and ICL), they concentrate on different aspects of the learning material. In addition, the individual cognitive load types can vary in terms of their reliability -if the alpha value of the entire scale is given and this is poor, the results of this measurement should not be used further for a generalization. If the reliability is calculated individually, this cognitive load type with a poor Cronbach's alpha could be removed, whereas categories with a more satisfactory value can be included for interpretation.

Limitations and Future Directions
Although the present work is the first of its kind in the history of CLT research, some remarks must be made that may partially limit the results but should also encourage researchers to follow up on this work. The Cronbach's alpha values of the various cognitive load questionnaires examined are satisfactory but should nonetheless be considered with caution. It is a more or less unwritten rule that very poor Cronbach's alpha values might lead to studies not being published by (highly ranked) journals. In this vein, it is also possible that studies reporting non-satisfactory values for internal consistency are not even submitted by the authors and, with the subsequent "file-drawer problem" distorting the actual values seen in the published literature. Accordingly, there is evidence to assume a bias towards too high values. Moreover, the cognitive load questionnaires differed with respect to the number of items used. As Cronbach's alpha is sensitive to the number of items in the scales (e.g., Tavakol & Dennick, 2011a, b;Vaske et al., 2017), comparing questionnaires with different numbers of items makes interpretation difficult. Furthermore, responses in Likert scales can lead to inflated inter-item correlations (i.e., estimates of internal consistency). These dependencies arise above all when people classify similar statements with similar values using a Likert scale. Positive correlations between the items thus lead to increased reliability which can be at the expense of the measurement's validity (Eisinga et al., 2013). Unclear items or items that do not appear meaningful to the learner reinforce this tendency (Schuman et al., 1981).
The basis of these meta-analyses was primary studies from the field of multimedia learning. Consequently, the studies differ in terms of information presentation, multimedia design, as well as the learning topic covered what may influence cognitive load perception. To account for these influences, moderator analyses for reliability were calculated. However, problems concerning the heterogeneity of the primary studies cannot be fully ruled out (e.g., Thompson, 1994). Thus, slight changes in the multimedia design (and the learning content which has to be learned) could also have led to changes in the answers in the Likert scale and therefore cognitive load perceptions. However, not all possible influences can be calculated, which is a common problem in meta-analyses (Borenstein et al, 2021).
Probably the most important conclusion of this meta-analysis is that there is still a lot of research to be done in the field of CLT measurement. Particularly striking here was the age of the participants which ranged (when considering the standard deviation) between 16 and 24 years. Consequently, the majority of them were affiliated with a secondary school or university. Assuming that learners with increasing age are better able to estimate their perceived cognitive load (e.g., van der Stel & Veenman, 2010), more research with under-represented demographics is needed. For instance, developing reliable and valid cognitive load questionnaires for younger target groups (i.e., elementary school students) could be fruitful (Leahy, 2018). In general, the possibility to use questionnaires in different languages (i.e., test adaptation; Ercikan & Lyons-Thomas, 2013) is a desirable goal but associated with challenges. However, adapting questionnaires to different cultures is a difficult task (Hambleton & Patsula, 1998) requiring knowledge of the cultural as well as linguistic circumstances. In terms of the cognitive load questionnaires, the original versions of the scales are available in English (Eysink et al., 2009;Leppink et al., 2013Leppink et al., , 2014 or bilingual in German and English (Klepsch et al., 2017). However, many studies in cognitive load research have been conducted in European or Asian countries with other mother tongues. It can be assumed that the original questionnaires were translated and re-interpreted. This might limit the results of this meta-analysis because it may also affect reliability and validity. Thus, researchers should stick to questionnaire translation rules since even small changes in the item formulation can affect the respondents' understanding (Harkness et al., 2004). However, studies involved in this meta-analysis often fail to indicate if questionnaires were translated. In sum, it would be desirable to make translations accessible to the research community.

Conclusion
Over the years, CLT has become a major theory in educational psychology research. Results of this meta-analysis revealed that cognitive load during learning can be reliably measured with currently available subjective questionnaires. In contrast, significant correlations between cognitive load types might question the construct validity of the cognitive load questionnaires and/or the additivity hypotheses postulated by the CLT. Results of correlations among cognitive load types with relevant criterion variables tend to support the three-factor model of cognitive load comprising ICL, ECL, and GCL. Overall, multimedia learning researchers should be encouraged to use cognitive load questionnaires in their research while being aware of the concrete designation of the scale, the number of response options, and the correct indication of the scale's reliability.