Introduction

Instructional feedback consists of evaluations that (dis)confirm a learner’s performance and/or provide additional information to adapt a learner’s behaviour and learning performance (Butler & Winne, 1995), and frequently results in improved performance (Hattie & Timperley, 2007; Winstone et al., 2017a). In addition, studies on instructional feedback typically generate guidelines to enhance the effectiveness of feedback in face-to-face situations, but the question remains whether these guidelines can be transferred to other feedback situations such as, for example, when feedback is delivered digitally (Van der Kleij et al., 2015). Digitally delivered instructional feedback appeals to a different set of interactions within and between relevant factors (e.g. feedback timing and assessment type) as well as psychological and logistical barriers. In the case of digitally delivered feedback, personal factors of the source essential in face-to-face settings are replaced by a fixed set of characteristics of the digital medium (Winstone et al., 2017a; Wu et al., 2019). To examine the effects of digitally delivered instructional feedback on learning performance, studies can be described in terms of factors that are considered to influence performance. Examples of these factors are context factors, such as the educational sector; content factors, such as the feedback message itself (e.g. feedback function); and task factors, such as the assignment or discipline in which the assessment was administered (Narciss, 2013). However, so far studies on digitally delivered instructional feedback have mainly focused either solely on context factors, content factors, or task factors—without considering interactions within and between factors. For example, Debuse et al. (2007) examined manual, semi-automated, and automated feedback to optimize the grading process, and Mandernach (2005) investigated various levels of computerized feedback, such as no feedback, knowledge of result, and knowledge of correct result. Such an approach can only provide general conclusions about the effectiveness of the cluster of interest (context, content, or task factors) whilst simultaneously contributing to an unidimensional view of the processing of instructional feedback and its influence on learning performance. As a result, transfer to other learning contexts may fail to occur as all three clusters of factors combined constitute a learning environment. Hence, the present meta-analysis investigates the impact of context, content, and task factors on digitally delivered instructional feedback and subsequent learning performance as well as the interaction between factors.

Digitally delivered instructional feedback

Commonly, instructional feedback is described as confirming the learner’s performance and/or providing corrections, or suggestions for adapting performances. This general description of feedback is adopted by many researchers, albeit in different wordings, and varying in their foci and degree of specificity (e.g. Butler & Winne, 1995; Hattie & Timperley, 2007; Narciss, 2013). One of the most detailed models of instructional feedback is proposed by Narciss (2013, p. 10). This model distinguishes two feedback loops: the internal feedback loop—focused on feedback generated by the learner—as part of which the learner assesses his or her own performance or current level of comprehension according to internally defined standards (internal assessment), and the external feedback loop—focused on feedback provided by an external feedback source, such as teachers, peers, or digital systems—as part of which the learner’s performance or current level of understanding is compared to externally defined standards (external assessment; Clark, 2012; Debuse et al., 2007). The feedback processing in both loops is influenced by context, content, and task factors (see Butler et al., 2013; Gordijn & Nijhof, 2002; Hattie & Timperley, 2007; Kluger & DeNisi, 1996; Timms et al., 2016). These factors—and interactions between them—may invoke different control actions by the learner (Narciss, 2013). Even though various terminologies are used for digital systems, each with their own distinct features, they can all be considered technical solutions for supporting learning, teaching, and delivering feedback (Suhonen, 2005). Henceforth, the term digital learning environments (DLEs; Debuse et al., 2013) will be used to refer to any technical solution (typically digital systems) used for delivering feedback. DLEs offer unique possibilities compared to their face-to-face and/or pen-and-paper counterpart, such as learner-controlled frequency, timing of feedback, and feedback display (Cheung & Slavin, 2012; Mandernach, 2005). In addition, when learners work in a DLE, they interact with a relatively fixed set of factors as programmed by the DLE designers, compared to a variety of personal factors of both the feedback source and the feedback recipient in face-to-face instructional feedback.

Instructional feedback and learning performance

If the feedback is adequate—as a result of the comparisons of the internal standards, internal feedback, external standards, and external feedback—the discrepancy between the current and desired level of performance and/or comprehension will be reduced (Narciss, 2013; Orsmond et al., 2005; Price et al., 2011). Reducing this discrepancy might positively affect learning performance, that is, research has shown mixed results regarding the influence of (digitally delivered) instructional feedback. Nevertheless, the possibility that (digitally) delivering instructional feedback can enhance learning performance is widely acknowledged (Hattie & Timperley, 2007; Narciss, 2013; Nicol & MacFarlane-Dick, 2006; Winstone et al., 2017a). Furthermore, the effect of (digitally delivered) instructional feedback is increased when providing relevant information to correct ineffective strategies that target the knowledge and/or skills relevant for successful task performance (Narciss, 2013). Furthermore, the necessity and suitability of the feedback and the assignment also determine to what extent the feedback is implemented. For example, a task that requires learners to link different information sources requires different feedback than a memorization task (Kluger & DeNisi, 1996). In a similar vein, when a learner understands the task and does not face any difficulties, feedback is often seen as unnecessary and less likely to be used (Winstone et al., 2017b).

Instructional feedback factors in the present meta-analysis

Thus far, feedback studies typically include either context, content, or task factors to determine their effectiveness for enhancing learning performance. However, this leads to an incomplete and simplistic rendering of feedback situations because context, content, and task factors are typically studied in isolation despite the fact that these factors—including the numerous interactions between factors—constitute the learning environment. Alongside this simplistic rendering, feedback terminology can vary. For example, feedback can be referred to as help-seeking, feedback-seeking, support, or feedforward, depending on the conceptual framework (see Boud et al., 2018; Panadero et al., 2016; Tanaka et al., 2002). This lack of consensus in terminology and labelling, for example, concerning definitions of immediate and delayed feedback, makes it difficult to fully cover the complexity of influential feedback factors as well as their effects on learning performance.

The implementation of external feedback is determined by various factors, such as the assessment, the timing of feedback, and the amount of information or detail of that feedback (Gordijn & Nijhof, 2002; Price et al., 2010). In addition, these factors can be fit into broader clusters of factors relating to context, content, and task factors (e.g. Butler et al., 2013; Gordijn & Nijhof, 2002; Kluger & DeNisi, 1996). As context, content, and task factors are easier to manipulate and to set as relatively fixed in DLE’s, these clusters of factors were chosen as the focus of the present meta-analysis. More specifically, context factors are related to the source of feedback and create the setting in which the learning task is administered (see e.g. Gordijn & Nijhof, 2002; Narciss, 2013). Context factors include aspects such as learner control, educational sector, rewards, study setting, and timing. The second cluster comprises content factors, including the feedback message, feedback form, feedback function, and feedback focus (see e.g. Butler et al., 2013; Hattie & Timperley, 2007). The third cluster is related to whether the learner will be provided with instructional feedback as part of the assessment approach and consists of factors that are embedded in the learning setting (Timms et al., 2016). Examples are the type of assessment, the discipline in which the assessment is conducted, the assessment developers (i.e. who developed the assessment), and the feedback display. The interaction between factors within these clusters can be illustrated when a student completes mathematical equations, followed by mandatory immediate verification feedback (typically referred to as ‘Knowledge of Result’; KR) on what they did. This example links the amount of learner control with the feedback timing, feedback form, and type of assessment.

Research aim

This meta-analysis aims (a) to synthesize research on context, content, and task factors of digitally delivered instructional feedback and their effect on learning performance, and (b) to identify which context, content, and/or task factors of digitally provided instructional feedback in DLEs—directed towards children, adolescents, and (young) adults covering primary, secondary, and higher education—are most effective in improving learning performance. More specifically, the following research questions (RQs) are addressed:

RQ1::

Which context factors (educational sector, feedback timing, learner control, rewards, and study setting) moderate digitally delivered instructional feedback and are most effective in improving learning performance in digital learning environments (DLEs)?

RQ2::

Which content factors (form, focus, and function) moderate digitally delivered instructional feedback and are most effective in improving learning performance in DLEs?

RQ3::

Which task factors (assessment developers, assessment type, discipline, and feedback display) moderate digitally delivered instructional feedback and are most effective in improving learning performance in DLEs?

RQ4::

Which combination of context, content, and task factors moderates digitally delivered instructional feedback and contributes to learning performance in DLEs?

Methods

Search strategy

Peer-reviewed articles published in English, published between January 2004 and January 2019, were obtained from the search databases ERIC, PsycINFO, and SocINDEX. Relevant search terms for (a) feedback included ‘feedback’, ‘feedback-seeking’, ‘help-seeking’, and ‘support’, and (b) digital learning environment (DLE) included ‘digital (learning) environment’, ‘digital education’, ‘computer learning’, ‘learning environment’, ‘intelligent tutoring’, and ‘hypermedia’.

Inclusion criteria

In order to be included in the meta-analysis, studies had to meet the following criteria:

  1. 1.

    Feedback was implemented in a digital learning environment (as a tool or as a provider) involving an individual learning task.

  2. 2.

    Quantitative outcomes were reported for individual performance (these could focus on the academic, social, or psychomotor domain).

  3. 3.

    A control group or comparison group should be included to enable interpretation of the results of the experimental group. Furthermore, a pretest had to be included to correct for initial differences between the experimental and control group(s), or random assignment to conditions was part of the procedure.

  4. 4.

    The participants needed to be aged 5 years or older.

The following exclusion criteria were used to narrow the scope:

  1. 1.

    Studies involving special needs education (e.g. dyslexia, deaf/blindness, and/or attention deficits).

  2. 2.

    Meta-analyses or systematic/literature reviews.

  3. 3.

    Studies primarily focused on user or evaluation feedback.

  4. 4.

    Studies primarily focused on feedback towards teachers.

  5. 5.

    Studies primarily focused on the development of a digital learning environment.

  6. 6.

    Studies focused on workplace learning.

  7. 7.

    Studies in which researchers were the participants.

The search provided 4855 hits, and during the first phase, all abstracts were scanned for the inclusion/exclusion criteria. The abstracts mentioned feedback by either mentioning the word feedback or with a similar description (e.g. help or support) were scanned more thoroughly. Authors from four non-accessible articles were e-mailed with a request for a copy of their article, of whom one complied. After removing the duplicates from the sample and applying the inclusion/exclusion criteria, 81 articles were included for further analysis (see Fig. 1).

Fig. 1
figure 1

Selection process for studies in the present meta-analysis. CG control group. EG experimental group. k = number of excluded articles

Coding of variables

The statistical quality of the articles to be included in the meta-analysis was determined through the selection process displayed in Fig. 1, extended with the criteria for exclusion. A final decision about inclusion/exclusion as a result of a lack of statistical quality always followed after a discussion between the two coders (i.e. first author and research assistant). In total, 46 articles met the inclusion criteria and were of sufficient statistical quality. In several articles, multiple interventions were included. As such, we were able to include the results of 116 interventions in our meta-analysis.

The coding of the characteristics of the feedback message started with mapping out when and how it could influence the processing of feedback, marking this as an inductive approach. A deductive approach was used to complement the information gathered by taking individual learners—and their processing of feedback—as a reference point (e.g. see Timms et al., 2016). This approach, in combination with existing research, served as a guideline for the distinction into three clusters in this meta-analysis: context factors, content factors, and task factors (cf. Butler et al., 2013; Gordijn & Nijhof, 2002; Hattie & Timperley, 2007; Kluger & DeNisi, 1996; see also Supplementary Materials Part 1 for an elaborated description of the feedback clusters).

Interrater reliability (IRR) was first determined for the general information (authors, title, publication years, sample size for the control group, and sample sizes for the experimental groups), followed by the reliability for the clusters of feedback factors. Krippendorff’s alpha for the coding of the general information (κ = 0.94) and the clusters of feedback factors (κ = 0.94) were satisfactory (McHugh, 2012).

Analysis

Comprehensive Meta-Analysis software package (CMA, version two; Biostat; see: www.meta-analysis.com) and Hierarchical Linear Modelling (HLM, version six; Raudenbush et al., 2004) were used for the data analysis. First, effect sizes and variances for each intervention were calculated. This information was then combined to compute a weighted summary effect and to perform moderator analyses to examine which factors influence the summary effect.

Effect size for interventions

CMA computed an effect size and variance for each intervention, based on the statistical information provided in the articles. Hedges’ g was selected as the effect size parameter given the concerns about bias and corrections of effect sizes (Cumming, 2012). CMA requires a pre–post correlation score when the effect size is calculated, and with the lack of this information in the articles, we used an estimated effect size of 0.7 (similar to the pre–post correlations on achievement tests provided by Cole et al., 2011). A correction for dependency in the data was applied to articles with multiple experimental groups with comparisons with the same control group or to within-person designs. Without correction, these control groups would be over-weighted in the meta-analysis (Lipsey & Wilson, 2001). Furthermore, for some interventions, multiple measures of performance were available. In these cases, all measures were used, and CMA computed a mean effect size based on these measures. The variance of this mean effect size for an intervention was adjusted in such a way that the variance decreased slightly when multiple outcome measures were reported.

Combining the effects

CMA and HLM make use of weighted analyses. The weight attached to each intervention is the inverse of its variance. Thus, studies with large sample sizes (and thus a small variance) have more influence on the computed effect size than interventions with small sample sizes. The interventions in the meta-analysis differed in many respects, resulting in the use of a random effects model to compute the summary effect and a mixed effects model for the moderator analyses. CMA was used for all analyses, except for the meta-regression analyses with multiple moderators in the same model for which HLM was used.

Publication bias

A selection bias arises when the included articles do not represent all the research conducted about the subject matter (Ellis, 2010; Rothstein et al., 2005). A funnel plot was used to discover publication bias. When a bias is present, the plot is skewed and asymmetrical rather than equally distributed; however, the funnel plot must be carefully interpreted because asymmetry is not always the result of biases, and it only informs about the presence of a bias; not which bias is present (Ellis, 2010). Duval and Tweedie’s trim-and-fill procedure was used to determine if the sample missed interventions to provide a more symmetrical funnel plot (Duval & Tweedie, 2000).

Tower of babel bias

The Tower of Babel bias is presumably present in this meta-analysis due to the focus on articles published in English. The severity of this bias was examined by coding in which country the intervention was conducted, and in which language the intervention materials were presented.

Type-I error correction

A series of significance tests were performed, increasing the likelihood of a Type-I error: finding an effect by chance when there is actually no effect. The recommendation of Polanin (2013)—the application of the ‘False Discovery Rate’ (FDR) method of Benjami and Hochberg (1995) within each group of tests (‘timeline of significance testing’)—was used to correct for Type-I errors. The FDR method finds a balance between the chances of incorrectly rejecting the null hypothesis (Type-I error) and falsely accepting the null hypothesis (Type-II error). In accordance with the FDR procedure, the p values within each group of tests (summary effect and categorical moderator analyses) were sequenced from low to high, starting with the largest value, and determined the largest p value for which

$$p_{i} \le \frac{i}{m} \times \alpha$$

where i is the ordered p value, m is the number of significance tests, and α is the chosen level of control. The null hypothesis of pi and smaller was rejected with α = 0.05.

Outliers and model fit

The Tukey method was used to recode outliers into respective inner fence values beyond the 25th and 75th percentiles (Lipsey, 2009). Five interventions with extremely high effect sizes were recoded to the 75th percentile to prevent an excessive influence of these interventions on the results.

The model fit of the meta-regression was determined according to the Akaike Information Criterion (AIC), with lower values indicating a better model fit (Harrer et al., 2019).

Results

Effects of feedback on learning performance

The summary effect of the 116 interventions was moderate: Hedges’ g = 0.41 (SE = 0.05; 95% CI 0.32–0.50, p < 0.001). The homogeneity statistic indicated that the variation in effects is statistically detectable (Q = 295.03; df = 115; p < 0.001), meaning that the feedback interventions do not share the same true effect size and display variability in scores. The variance of the true effect sizes was estimated at T2 = 0.13 and reflected a moderate proportion of real difference in effect sizes (I2 = 61.02). Supplementary Materials part 2 summarizes the effect sizes and characteristics from the included interventions.

Publication bias

Duval and Tweedie’s trim-and-fill method (Duval & Tweedie, 2000) revealed a publication bias on the left side of the mean in the present sample (see Fig. 2). The black dots represent the estimated missing intervention studies due to publication bias (the white dots are the observed interventions). The white diamond represents the observed summary effect, and the black diamond represents the summary effect with adjustment for publication bias.

Fig. 2
figure 2

Funnel plot of Hedges’ g and standard error (with outlier correction)

The estimated Hedges’ g would have been lower (Hedges’ g = 0.23)—with 25 trimmed studies—without the publication bias compared to the observed Hedges’ g of 0.41 (Cohen, 1988). However, if we had not reduced the extremely high effect sizes of some intervention studies (outlier reduction), the result of the analysis of publication bias would have been very different, with no estimated missing intervention studies. Figure 3 displays the corresponding funnel plot.

Fig. 3
figure 3

Funnel plot of standard error by Hedges’ g (without outlier correction)

Tower of babel bias

For each intervention, the country in which the intervention took place and language in which the intervention materials were presented was coded. Roughly 60% of the included interventions were conducted in a native English-speaking country. The remaining 40% comprises interventions in German, Taiwanese, Spanish, and Dutch (see Table 1). The Tower of Babel bias did not appear to be as severe as previously assumed.

Table 1 Language of intervention materials

Moderator analyses

Many aspects of the intervention studies were coded. However, not all information was always provided in the articles in which the interventions were described, with missing data as a consequence. This missing data was deleted from the moderator analysis because missing data has no additional value in the analyses.

Context factors

The effect sizes of the context factors and their significance are displayed in Table 2. The between-group difference was statistically significant for education sector (Q-between = 33.55; df = 4; p ≤ 0.001), feedback timing (Q-between = 23.62, df = 2, p ≤ 0.001), learner control (Q-between = 57.39; df = 3; p < 0.001), and study setting (Q-between = 15.75; df = 2; p ≤ 0.001). A significant between-group difference means that the variation in effects between the various categories is statistically detectable. Table 2 shows that learning performance seems unaffected by whether participants received a reward or not (Q-between = 1.54; df = 1; p = 0.21).

Table 2 Parameter estimates of included context factors

Content factors

The effect sizes of the content factors of feedback and their significance are displayed in Table 3. The learning performance was affected by feedback form (Q-between = 14.83; df = 2; p ≤ 0.001), feedback focus (Q-between = 51.97; df = 5; p ≤ 0.001), and feedback function (Q-between = 17.97; df = 5; p ≤ 0.001). This indicates that the variation in effects is statistically detectable. Within the studies involving a feedback focus on the process, surface processing strategies provided a Hedges’ g of 0.53 (SE = 0.22; 95% CI 0.10 to 0.95; p = 0.02; k = 9) and deep processing strategies a Hedges’ g of 0.09 (SE = 0.06; 95% CI − 0.04 to 0.22; p = 0.17; k = 29). The Q-between for this difference between strategies was not significant 3.63 (df = 1; p = 0.06).

Table 3 Parameter estimates of included content factors

Task factors

The effect sizes of the task factors are displayed in Table 4. All factors displayed statistically detectable variation between the groups: assessment developers (Q-between = 12.68; df = 3; p = 0.01), assessment type (Q-between = 16.98; df = 6; p = 0.01), discipline (Q-between = 17.97; df = 5; p ≤ 0.001), and feedback display (Q-between = 6.73; df = 2; p = 0.04).

Table 4 Parameter estimates of included task factors
Control condition

The control condition within the studies could receive no feedback or some feedback (regardless of what the feedback encompasses). Hedges’ g was 0.61 (SE = 0.08; 95% CI 0.45–0.78; k = 50) for interventions compared with control conditions with no feedback, and Hedges’ g was 0.27 (SE = 0.05; 95% CI 0.16–0.37; k = 66) for interventions compared with control conditions receiving some feedback. The difference was significant (Q-between = 12.88; df = 1; p < 0.001).

Type-I correction

A series of significance tests were performed, resulting in a significant summary effect and significant between-group differences (p < 0.05). After applying the FDR method, relevant factors within clusters, i.e. learner control, educational sector, study setting, feedback focus, feedback form, feedback function, assessment developers, assessment type, discipline, and feedback display, were likely not false discoveries and reflected real differences.

Combining context, content, and task factors

A meta-regression was employed to examine the effects of multiple moderators simultaneously. According to Borenstein et al. (2009), a meta-regression model can include approximately one variable per ten interventions. According to that reference, the model could include a maximum of ten moderators. Meta-regression was employed separately for context, content, and task factors (see Part 3 in the Supplementary Materials). However, the number of dummy variables included in the analyses exceeded the maximum of ten moderators. Therefore, the reported results should be interpreted with caution. Due to the high number of variables, the power of the meta-regression remains low and provided models that are too full (see Part 3 in the Supplementary Materials). As a result, the number of variables in the model was minimized by implementing a forward regression focused on factors that could be adapted by practitioners. This practical approach was adopted to provide practitioners specific aspects to adjust instruction and/or instructional feedback besides the education provided in a DLE. The factors learner control, feedback timing (both are context factors), feedback form, feedback focus, feedback function (all content factors), and feedback display (task factor) were stepwise added to a basic model with as a covariate the variable indicating whether the control group did or did not receive some feedback. This covariate was essential to ensure sufficient quality of the data because inclusion of a control group allows comparisons between different groups within a study. In the first step of the forward regression, each model consisted of the covariate and one of the factors. The model with the covariate and feedback focus showed the best improvement in model fit (χ2 = 24; df = 6; p ≤ 0.001). In the next step of the forward regression, addition of one of the remaining factors—learner control, feedback timing, feedback form, feedback function, and feedback display—did not improve the model fit, leaving the best model fit for the basic model that includes feedback focus besides the covariate whether the control group did or did not receive some feedback (see Table 5).

Table 5 Results of forward meta-regression model with factors under practitioners control

Discussion

The increasing use of DLEs in education calls for a re-examination of existing literature about digitally delivered instructional feedback. Existing practices might apply guidelines derived from feedback in face-to-face settings, despite the different interactions between relevant clusters of factors (i.e. context, content, and task factors) as well as psychological and logistical barriers regarding digitally delivered instructional feedback (see Winstone et al., 2017b; Wu et al., 2019). The present meta-analysis identified the most effective context, content, and task factors of digitally provided feedback to improve learning performance in DLEs from studies published between January 2004 and January 2019. The summary effect of 0.41 (SE = 0.05) for the 116 interventions from 46 articles was moderate.

Context factors

The first focus was on context factors, such as the educational sector, feedback timing, learner control, receiving a reward, and study setting. The context factors feedback timing, learner control, educational sector, and study setting significantly improved learning performance, but this was not the case for receiving a reward.

With regard to feedback timing, both immediate and delayed feedback had significant and strong effects on improving learning performance, with delayed feedback being slightly more effective than immediate feedback. A combination of feedback timing approaches was ineffective. These findings indicate that clarity and consistency—as to whether participants receive immediate or delayed feedback—is more essential than the actual timing of the feedback. These findings are in line with the previous research on immediate feedback (Dihoff et al., 2003) and delayed feedback (Butler et al., 2007).

In general, many activities account for learner control, ranging from basic decisions about the pace of information presentation found in many DLEs (pacing) to determining the order of information units (sequencing), the contents to be processed (selection or content controlling), or how the contents should be displayed (representation control; see Scheiter & Gerjets, 2007). However, all of these control decisions, if available in the DLEs, rely on (a) whether and how the learner is able to decide, and (b) whether (s)he is able to act upon his/her decisions. With respect to learner control, most DLEs in the present meta-analysis have features that automatically deliver instructional feedback to the learners without actions from the teacher. The results of the present meta-analysis do not favour self-paced feedback above pacing decided by others (e.g. teacher and peer) and automatic feedback (with a slightly lower effect size). Although Recker and Pirolli (1992) found that learners with better self-regulated learning strategies perform better in DLEs with options for more learner control than learners who are less proficient in implementing self-regulated learning strategies, more learner control (i.e. self-paced) does not always directly pave a path to better learning performance. The effect sizes for self-paced feedback and pacing decided others (e.g. teacher and peers) were similar. However, combining learner control options did not positively contribute to learning (performance) and would be advised against based on these findings.

The effects of educational sector were significant for middle school (11–13 years), high school (14–18 years), and higher education (18 years and older, e.g. university [of applied sciences]). Interventions in primary education were not significant (i.e. p = 0.06). Although with caution, the findings generally show that digitally delivered instructional feedback is more effective when learners are older. First, primary and secondary education are considered more supportive contexts for students in terms of responsibility, structure, and guidance than higher education (Rodway-Dyer et al., 2009). Second, metacognition is a skill that develops with age (Efklides, 2008); thus, younger learners (i.e. primary and secondary education) might rely more heavily and/or solely on external feedback as opposed to older learners who are assumed to have better monitoring skills. Older learners might rely more on internal feedback or are better able to process the digitally delivered instructional feedback, either requested by them or delivered to them. Third, the curriculum in primary and secondary education is designed to teach (comparatively) basic skills, whereas in higher education, the learner oftentimes selects a focused set of courses, which is sometimes described as a specialization that fits a discipline. Fourth, research has shown that motivation is universally lacking for a range of subjects when learners progress from primary to secondary education (Graham et al., 2016).

Regarding the setting in which the study was conducted a significant difference was observed between school-based studies and laboratory-setting studies, in which laboratory-setting studies displayed a higher effect size than school-based studies.

Content factors

The second focus of this meta-analysis was on content factors, such as feedback form, feedback focus, and feedback function. Concerning the feedback form, differences were found between simple, more, and most informative forms of digitally delivered instructional feedback on learning (performance). All forms showed small (Hedges’ g = 0.21) to high (Hedges’ g = 0.64) effect sizes, with the highest effect for simple feedback (e.g. Knowledge of Result; KR) and the lowest effect for the most elaborate form (e.g. Knowledge of Correct Result or KCR with an explanation). Some of these effect sizes contrast with the previous research. Van der Kleij et al. (2015) found that simple feedback was ineffective, whereas the present sample showed a relatively high effect of simple feedback on learning performance. This difference might be explained by the type of learning outcomes considered. Whereas Van der Kleij and colleagues (2015) differentiated between lower and higher order outcomes, it was decided to avoid such a distinction in the present meta-analysis to ensure the inclusion of other learning domains such as social and psychomotor outcomes. In a similar vein, Jaehnig and Miller (2007) concluded that elaborated feedback (similar to the most elaborated feedback factor in the present meta-analysis) was more effective than knowledge of correct results (similar to the more elaborated feedback factor in the present meta-analysis). Moreover, more elaborated feedback was found to be more effective than most elaborated feedback, and simple feedback was more effective than more elaborated feedback. As elaborated feedback appeals to the degree of informativeness of the feedback, this can mean that learners might have more information to process at the same time. As a result, the processing of feedback might require more effort and time from the learner and might result in the inability of processing all feedback correctly and timely (Crews & Wilkinson, 2010). Finally, affective components might also play a role (Goetz et al., 2018), such as an emotional backwash that could inhibit further processing of the feedback (Pitt & Norton, 2017). These emotions might differ between DLEs and face-to-face settings, and, as a result, different types of feedback might evoke different emotions (Saplacan et al., 2018).

In the present sample, the feedback focus was mostly directed towards the task. A focus on the process, self-regulation, and the self were used less often; however, a focus on the process resulted in a higher effect size. Although the number of interventions leading to this effect size was quite limited, process-focused feedback (i.e. feedback focused on the main process to understand and/or complete a learning task) seems more beneficial for learning than task-focused feedback (feedback focused on understanding and/or completing the learning task), as well as the combination of task- and process-focused feedback (Hattie & Timperley, 2007; Sadler, 2010). Again, caution is warranted because the effectiveness of the feedback focus depends on the task, stressing the interaction between factors. Task-focused feedback is frequently highly specific and only applicable in nearly identical situations (Hattie & Timperley, 2007). Therefore, transfer to other tasks often fails to occur using feedback about the task (Brookhart, 2008). Despite the high effect size, the effect of self-focused feedback is inconclusive (see Hattie & Timperley, 2007). From a motivational perspective, the high effect size of self-focused feedback might be explained by the fact that learners adopt defensive or assertive attitudes regarding external feedback (i.e. by a teacher, peers, or DLE) to protect a (positive) self-image, and they become more socially aware leading to adapting behaviours regarding the seeking and processing of feedback (Linderbaum & Levy, 2010; Tuckey et al., 2002). If the threat from feedback to one’s self-esteem is considered low, it positively influences the learning outcome (Kluger & DeNisi, 1996).

All feedback functions and their combinations had a significant effect on learning performance, ranging from moderate to high effects sizes. The highest effect size (Hedges’ g = 0.88) was found for digitally delivered instructional feedback targeting metacognition; but it was based on a relatively low number of interventions (k = 7). Clustering the feedback function into overarching clusters to the function to which feedback predominantly belonged—cognition, metacognition, and motivation (in line with Narciss, 2013)—intended to provide a starting point for practitioners to deliver (digitally delivered) instructional feedback to learners. Nevertheless, feedback coded as cognition may also contain aspects of, for example, metacognition, and as a result, the distinctions are not as well-established as intended. In fact, the lack of knowledge and skills in one cluster can be compensated by knowledge and skills in another cluster; for example, better metacognitive skills can compensate for a lack of prior knowledge (see Veenman et al., 2006).

Task factors

The third focus comprised task factors, such as the assessment developers, assessment type, discipline, and feedback display. All task factors showed significant between-group differences (i.e. differences between subcategories). Regarding the factor assessment developers, these differences ranged from moderate (Hedges g = 0.48 and 0.54) to high (Hedges g = 1.00) effect sizes. The contrast between standardized tests and the assessments designed by teachers—albeit used with care due to the low number of articles—is of particular interest, because assessment is one aspect that teachers can influence or adapt. In addition, combining assessments from different developers (i.e. researcher-made and teacher-made) yielded no significant effect on learning (performance), which might be due to differences in the quality of the assessment or the goal for which the assessment was designed.

Regarding the assessment type, oral assignments yielded the highest effect size, and a combination of assignment types the lowest. However, give the relatively low number of studies (k = 6), the effect size for oral assignments can be considered an extreme (Hedges’ g = 0.99). More frequently occurring assessment types (e.g. linguistic questions, calculations, and writing assignments) also show effect sizes ranging from Hedges’ g = 0.25 for a combination of assignments up to Hedges’ g = 0.76 for calculations. The latter effect size for calculations can be explained by the relatively fixed sets of answers that are necessary, contrasted with the diversity of answers found in writing assignments (e.g. essays). In writing assignments, learners need to understand and critically review information in a particular domain, and they need to utilize language to communicate this to others (Read et al., 2001). Moreover, the level of writing skills might differ as a function of discipline and method of assessment (Bacha, 2001). Nevertheless, for practitioners, it is promising that (digitally delivered) instructional feedback appears effective with a diversity of assignments. Concerning the feedback display, a visualization yielded the highest effect size (Hedges’ g = 0.59), and a combination of a written and visual display yielded the lowest effect size (Hedges’ g = 0.25). Yet, when considered separately both feedback displays have a significant and positive effect on learning performance.

Besides assessment types, research has shown that there are disciplinary differences which might affect the effectiveness of digitally delivered instructional feedback. The present sample revealed similar effect sizes for language and arts, and science education, but a nonsignificant nihil effect size for social sciences. The nature of the tasks in science education, such as calculations and working with formulas, may require different feedback compared to tasks in social sciences where the number of valid answer options might be broader as compared to science education, making it more difficult to provide feedback as well as illustrating the crucial role of acknowledging interactions between factors.

Control condition

In interventions with a control condition, receiving some feedback (irrespective of what the feedback entailed) was more effective than receiving no feedback. However, research has also shown that some feedback is not always preferred to no feedback. For example, Fyfe et al. (2012) found that learners with little prior knowledge of the correct solution strategy benefitted from some feedback, whereas learners with some prior knowledge of the correct solution strategy benefitted from a learning situation without feedback. Yet, viewing feedback effects in a black-and-white fashion (e.g. ‘some’ feedback is better than no feedback, or feedback works vs. it does not work) paints a simplistic and unidimensional picture of the effectiveness of feedback. Practitioners might incorrectly perceive such a view as an one-size-fits-all approach, complicating and possibly thwarting positive effects of (digitally delivered) instructional feedback on learning performance.

Combining context, content, and task factors

A meta-regression was used to examine whether the combination of context, content, and task factors moderates digitally delivered instructional feedback and contributes to learning performance. The meta-regression showed that a feedback situation consisting of process-focused digitally delivered instructional feedback contributed significantly to learning performance. The other factors included in the meta-regression—i.e. learner control, feedback timing, feedback form, feedback function, and feedback display—did not contribute to the effect on learning performance. One explanation might be that the many combinations of context, content, and task factors in the feedback situation do not provide straightforward results. In other words, given that many combinations of factors potentially (positively) contribute to or hinder learning performance, there is no one-size-fits-all solution to (digitally delivered) instructional feedback, and the contribution to learning (performance) is not linear (see Timmers et al., 2013).

Limitations

Despite the exclusion of articles with an overall poor statistical quality, some articles with a sufficient overall quality still had some unclear methodological specifications (e.g. assessment developers). Furthermore, the number of interventions per factor was limited, and it should be noted that the feedback factors included in this study are still a subset of possibly relevant factors; thus, they do not fully reflect the complexity of feedback situations.

Certain feedback factors may also occur more often in combination with specific domains. For example, visual displays (e.g. graphs or diagrams) could appear more often in mathematics compared to languages. Although this might have influenced the effect sizes, the results suggest that the set of feedback factors included in the analyses contained multiple key contributors to learning performance. Despite the appeal of clusters of factors, (digitally delivered) instructional feedback cannot be easily assigned to one single cluster. In fact, feedback factors often cover multiple clusters, such as the assessment type at hand (task factor) which is nested in a domain or discipline within an educational sector (context level).

Finally, (digital) delivery of instructional feedback neither implies processing of feedback nor does it lead to enhanced learning performance per se. Most researchers assume in their studies that learners pay attention to the provided feedback, but whether they indeed do so cannot be inferred by relying solely on improvements in performance. Learners are not always skillful in the processing of feedback, and learning and performance improvement are not automatically guaranteed by the mere provision of feedback (Bevan et al., 2008; Weaver, 2006). Moreover, researchers argue that feedback training is essential to improve learners’ use of feedback (Dressler et al., 2019; Falchikov & Goldfinch, 2000). As a result, the incongruence between researchers’—hopeful—assumptions about the idealized feedback use by learners in actual practice (either in laboratory or school settings) calls for critically re-examining instructional feedback practices (Molloy & Boud, 2013) in primary education and secondary education, because relatively few interventions from these educational sectors were included in the present meta-analysis.

Future prospects

His meta-analysis showed that the effects of digitally delivered instructional feedback on learning performance are inextricably interwoven with context-, content-, and task-related factors. The results suggest that digitally delivered instructional feedback should focus on the learning process and that situations in which learners have to deal with many combinations of factors should be avoided. The results also prompt researchers to expand research into digitally delivered instructional feedback research in primary and secondary education, as these differ from higher education in terms of subjects, degree of learner control, and assessment types. Likewise, more studies are welcomed that elaborate on existing guidelines about effective (digitally delivered) instructional feedback by clarifying the contexts in which these guidelines can and should be applied and which factors contribute to learning performance. Finally, future research can elaborate on how guidelines for instructional feedback—derived from both face-to-face and digital settings—complement each other, so as to offer more contextualized guidelines to practitioners for implementing (digitally delivered) instructional feedback in their classrooms.