1 Introduction

Vegas et al. (2016) reported that crossover designs are a popular design for software engineering experiments. In their review they identified 82 papers of which 33 (i.e., 40.2%) were crossover designs. Furthermore, those 82 papers reported 124 experiments of which 68 (i.e., 54.8%) used crossover designs. However, they reported that “crossover designs are often not properly designed and/or analysed, limiting the validity of the results”. They also warned against the use of meta-analysis in the context of crossover style experiments.

As a results of that study, two of us undertook a detailed study of parametric effect sizes from AB/BA crossover studies (see Madeyski and Kitchenham 2018a, band Kitchenham et al. 2018). We identified the need to consider two mean difference effect sizes and reported the small sample effect size variances and their normal approximations.

As we were undertaking this systematic review,Footnote 1 we found that Santos et al. (2018) had already performed a mapping study of families of experiments. They reported that although the most favoured means of aggregating results was Narrative synthesis (used by 18 papers), Aggregated Data meta-analysis (by which they mean aggregation of experiment effect sizes) was used by 15 studies.

Using Vegas et al. (2016), Madeyski and Kitchenham (2018b) and Santos et al. (2018) as a starting point, we decided to investigate the validity and reproducibility of effect size meta-analysis for families of experiments (Madeyski and Kitchenham 2017). Our goals are to;

  • Identify the effect sizes used and how they were calculated and aggregated.

  • Use the descriptive statistics reported in the study, attempt to reproduce the reported results.Footnote 2

  • In the event that we were unable to reproduce the results, to investigate the underlying reason for lack of reproduciblity.

We concentrated on families of experiments as our form of primary studies. We did this (rather than looking at papers that report a meta-analysis after performing a systematic review) because papers reporting a family of experiments are likely to have published sufficient details about the individual studies and their meta-analysis process for us to attempt to validate and reproduce their effect size calculations and meta-analysis. In addition, Santos’s mapping study confirmed the popularity of families of experiments, and emphasized that more families needed to aggregate their results. These two factors indicate the importance of adopting valid meta-analysis processes in the context of families of experiments. Nonetheless, our reproducibility analysis method, based on aggregating descriptive statistics, is the same as would be used to meta-analyse data from experiments found by a systematic review. Thus, the results from this study are likely to be of value for any meta-analysis of software engineering data.

We concentrated on high quality journals not only because such papers usually present reasonably complete descriptions of their results and methods, but also because they attract papers from experienced researchers, which are reviewed by other experienced researchers. Thus, readers of papers in such journals expect the published results to be correct. Invalid results in such papers are therefore likely to have a more serious impact than mistakes in papers published in less prestigious journals or conferences. For example, practitioners may base decisions on invalid outcomes, and novice researchers may adopt incorrect methods.

We present our research questions in Section 2 and our systematic review methods in Section 3. A summary of the primary studies included in our review, a discussion of the validity of the meta-analysis methods used in each study and our reproducibility assessment are in Sections 45 and 6, respectively. We discuss the results of our study in Section 7 and present the contributions of this paper and our conclusions in Section 8.

We also include an Appendix that reports details of our statistical analysis and analysis results not needed to support our main arguments. The Appendix also discusses reproducibility aspects of our study.

2 Research Questions

The research questions (RQs) relating to our systematic review are:

  1. RQ1:

    Which studies that undertook families of experiment have also undertaken effect size meta-analysis?

  2. RQ2:

    What are the characteristics of these studies in terms of methods used for experimental design and analysis?

  3. RQ3:

    What meta-analysis methods were used and were they valid?

  4. RQ4:

    If the meta-analysis methods were valid can results be successfully reproduced?

RQ1, RQ2, and the reporting aspects of RQ3 could be addressed directly from information reported in each primary study. To address the validity aspect of RQ3 and RQ4, we reviewed the meta-analysis processes described by each study and then attempted to reproduce first the effect sizes and then the meta-analysis in each primary study. Finally, we compared our results with the reported results. We assumed that it would be possible to conduct a meta-analysis based on the descriptive data and the effect size chosen by the primary study authors, since this is the normal method of performing meta-analysis.

3 Systematic Review Methods

We performed our systematic review (SR) according to the guidelines proposed by Kitchenham et al. (2015). The processes we adopted are specified in the following sections.

3.1 Protocol Development

Our protocol defines the procedures we intended to use for the systematic review including the search process, the primary study selection process, the data extraction process and the data analysis process. It also identified the main tasks of all the co-authors. The protocol was initially drafted by the first author and reviewed by all the authors. After trialling the specified processes, the final version of the protocol was agreed by all the authors and registered as report W08/2017/P-045 at Wroclaw University of Science and Technology. The following sections are based on the processes defined in the protocol. Any divergences report our actual processes, as opposed to the planned processes described in the protocol. The major deviation from the protocol and the results reported in this paper is that originally we had assumed it would be appropriate to concentrate on reproducibility, but as our investigation progressed we realized that we needed to consider the reasons for lack of reproducibility, that is, consider in more detail the validity of the meta-analysis process. Furthermore, validity is the key issue, because it is not useful to reproduce an invalid result.

3.2 Search Strategy

In order to address our research questions, we needed to identify papers that reported the use of meta-analysis to aggregate individual studies, reported the results of the individual studies in detail, and were published in high quality journals.

To achieve our search process strategy, we decided to limit our search for families of experiments to the following five journals:

  • IEEE Transactions on Software Engineering (TSE).

  • Empirical Software Engineering (EMSE).

  • Journal of Systems and Software (JSS).

  • Information and Software Technology (IST).

  • ACM Transactions on Software Engineering Methodology (TOSEM).

We restricted ourselves to these journals because they all publish papers on empirical software engineering, and all have relatively high impact factors (among SE journals). These are, therefore, highly respected journals, and we should expect the quality of papers they publish to be correspondingly high.

3.3 SR Inclusions and Exclusions

In this section we present our inclusion and exclusion criteria. Details of the search and selection process, the validation of the search and selection process, and the data extraction process can be found in the supplementary material (Kitchenham et al. 2019b).

Given our research questions, papers to be included in our SR were identified using the following inclusion criteria:

  1. 1.

    The paper should report a family of three or more experiments. This is because it is the criteria adopted by Santos et al. (2018) and there is more opportunity to detect heterogeneity with three or more studies.

  2. 2.

    The experiments reported in the paper should relate to human-centric experiments or quasi-experiments that compare SE methods or procedures rather than report observational (correlation) studies with no clear comparisons.Footnote 3

  3. 3.

    The paper should have been published by one of the five journals identified by our search strategy, see Section 3.2.

  4. 4.

    The paper should use some form of meta-analysis to aggregate results from the individual studies using standardized effect sizes, i.e., standardized mean difference or point-biserial correlation coefficient (rpb).Footnote 4 These effect sizes are commonly used in software engineering meta-analyses.

The following exclusion criteria were also defined:

  1. 1.

    The paper was an editorial.Footnote 5

  2. 2.

    The paper was published before 1999, when Basili et al. (1999) first discussed families of experiments.

3.4 Data Analysis

The results extracted from each primary study allowed us to answer questions RQ1, RQ2 and the methodology element of RQ3. To address the validity element of RQ3 and RQ4 for each primary study, we reviewed carefully the meta-analysis methods reported by the study authors and attempted to reproduce the effect size values and meta-analysis results using the reported descriptive data.

Many of the studies reported multiple metrics and hypotheses tests for each experiment. In all cases, we first attempted to reproduce the effect sizes reported by the authors and then the meta-analysis. We analyzed only the first outcome metric, because we assumed that if the individual effect sizes were reproduced and results of meta-analyzing the effect sizes was reproduced, it would confirm whether or not the meta-analysis was reproducible without checking the results for every metric. Our assumption (that in our case it is enough to analyze the first outcome metric) was based on the fact that none of the primary studies reported using different methods to calculate effect sizes or performing meta-analysis for different outcome metrics. In addition, outcome tables for descriptive statistics and effect sizes were similar for all outcome metrics. There is only one situation where there might be a difference between outcomes for different metrics. This would happen if the authors did not maintain the direction as well as the magnitude of the effect size. Then, if one metric had effect sizes with different directions and one did not, we would agree with the authors in the case where all directions were the same and disagree when the directions were not the same. This happened in the case of Study 9 (see Section 6.11).

For each primary study, we compared the effect sizes for each experiment and the overall meta-analysis mean effect size with the results of our calculations. However, we needed some method of deciding whether effect sizes or meta-analysis results had been reproduced, since we did not expect to obtain exactly the same effect size values since our values were obtained from summary statistics whereas study authors might have derived their effect sizes from calculations on the raw data. We chose to use a difference of 0.05 between our calculated effect size meta-analysis mean and the equivalent reported statistics as a criterion for deciding whether there was a reproducibility problem. Our basis for choosing 0.05 was that:

  1. 1.

    A relative value would unfairly penalize small effect sizes, for example if a study reported an effect size of 0.01 and we reported an effect size of 0.02, we would have relative difference of 50% for a difference that could be the result of rounding applied to reported mean values.

  2. 2.

    Most studies reported descriptive data on metrics, in the range 0 to 1, to two decimal places, so we thought an absolute value of 0.05 might be sufficiently large to allow for differences due to rounding effects caused because our reproducibility statistics were derived from the reported means and variances.

  3. 3.

    Most studies did not state explicitly whether or not they applied the small sample size adjustment to their standardized effect sizes. For example, a medium effect size of 0.5 and a sample size of 23 (the median experiment size), the effect of applying the small sample adjustment is to reduce the standardized effect size to 0.48.

4 An Overview of the Primary Studies (RQ1 and RQ2)

In this section, we address RQ1 and RQ2 and present an overview of the primary studies included in our systematic review.

4.1 Studies Reporting Meta-analysis of Families of Experiments (RQ1)

The 13 primary studies we included in our SR are shown in Table 1 ordered by inverse publication date.Footnote 6 The table reports the number of experiments in each family and the number of participants in each experiment. We report on the studies in this order throughout this section.

Table 1 Primary studies

Table 2 provides an overview of the goals of each of the studies and the specific techniques they investigated. The technique in boldface (e.g., PBR in study S13) is the treatment technique and the other technique (e.g., CBR) is the control technique. Later in this paper, effect sizes are reported relative to the treatment technique, so positive values indicate that the treatment technique outperforms the control technique and negative values indicate that the control technique outperforms the treatment technique. There are some trends observable in Table 2:

  • Six studies investigated the impact of different UML documentation options (see rows where the techniques are labelled DO to signify Documentation Options).

  • Four studies investigated procedures in the context of maintainability.

  • Four studies investigated requirements issues, three compared specification languages and one investigated proposals for verifying non-functional requirements.

Table 2 Primary study data

4.2 Experimental Methods Used by the Primary Studies (RQ2)

Table 3 presents some information about individual experiments discussed in each primary study. During data extraction, it became clear that many of our 13 primary studies, included experiments with crossover designs. Vegas et al. (2016) warned that the terminology used to describe crossover designs was not used consistently, and we found exactly the same problem with our primary studies (Kitchenham et al. 2019a). Therefore, we used the description of the experimental design provided by the authors to derive our own classification. Understanding the specific experimental design is important in the context of meta-analysis, because the variance of the standardized effect size is different for different designs, see Morris and DeShon (2002) and Madeyski and Kitchenham(2018a, b). In all cases the description was sufficient for us to identify the individual experimental designs. Like Vegas et al., we found that the primary study authors did not adopt our terminology, nor did they use the same terminology as other primary study authors who adopted the same design.

Table 3 Primary study experiment data

The primary studies used only four basic experimental designs, which we discuss in the Appendix A.1. To understand the notation used in the rest of the paper, it is important to note that all crossover style designs have two different types of standardized mean difference effect size (see Morris and DeShon 2002 and Madeyski and Kitchenham 2018b):

  1. 1.

    An effect size that measures the personal improvement (of an individual or team) performing a task using one method compared with performing the same taskFootnote 7 using another method. We refer to this as the repeated measures standardized effect size, δRM, with an estimate dRM.

  2. 2.

    An effect size that is equivalent to the standardized mean effect size obtained from a an independent groups design (also known as a between participants design). We refer to this independent groups effect size as δIG, with an estimate dIG.

For balanced crossovers (where each sequence group has the same number of participants), effect sizes are calculated as follows (Morris and DeShon 2002; Madeyski and Kitchenham 2018b):

$$ d_{RM}=\frac{\bar{x}_{A}-\bar{x}_{B}}{s_{e}} $$
(1)

where \(\bar {x}_{A}\) is the mean value of the treatment technique observations and \(\bar {x}_{B}\) is the mean value of control technique, se is the within participants standard deviation.

$$ d_{IG}=\frac{\bar{x}_{A}-\bar{x}_{B}}{s_{IG}} $$
(2)

where sIG is equivalent to the pooled within groups standard deviation of an independent groups study.

In addition, there is a relationship between the two standard deviations (Madeyski and Kitchenham 2018b):

$$ s_{e}= \sqrt{(1-r)}s_{IG} $$
(3)

where r is the Pearson correlation between the repeated measures. Thus, the effect sizes are also related:

$$ d_{RM}= \sqrt{(1-r)}d_{IG} $$
(4)

For small sample size, Hedges and Olkin (1985) recommend applying a correction to dRM and dIG. We refer to the small sample size corrected effect sizes as gRM and gIG respectively. We prefer not to give these terms generic labels, such as Hedges’ g, because as Cumming (2012) points out (see page 295) meta-analysis terminology is inconsistent. In terms of names given to standardized effect sizes, dIG is referred to as d by Borenstein et al. (2009) and as g by Hedges and Olkin (1985), gIG is referred to as g by Borenstein et al. (2009) and d by Hedges and Olkin (1985). In our primary studies, most papers used the terms Hedge’s g and one used Cohen’s d but the papers did not specify whether or not they used the small sample size adjustment. Only Study 13, explicitly defined Hedges’ g to be what we refer to as dIG and used the term d to be what we refer to as gRM.

In Table 3, we also report whether the data was analyzed using parametric (P) or non-parametric methods (NP) tests for the individual experiments. Four of the studies used non-parametric tests or parametric tests depending on the outcome of tests for normality. Study 13 and Study 14 performed both non-parametric and parametric tests, but only reported the results of the parametric tests since the outcomes of both tests were consistent. It is important to note that many of the crossover studies did not analyze their data correctly, by using independent groups tests rather than repeated measures tests. We annotated three studies as partly valid because they used tests that catered for repeated measures, but may have been delivered slightly biased results if time period effects or material effects were significant (see Appendix A.1.3).

5 The Validity of Meta-analysis Procedures Used by the Primary Studies (RQ3)

In this section, we discuss the methods used by the primary study authors. In Table 4, we summarize issues related to meta-analysis including the effect size names used by the authors, our assessment of the effect size the authors aggregated, which meta-analysis tools were used and whether heterogeneity was investigated. We discuss these results in this section. However, the main focus of this section is to assess the validity of the meta-analysis procedures used in each primary study. This validity assessment was made from reading the report of the meta-analysis processes and the meta-analysis results reported in each primary study. It was intended to identify incorrect or incomplete reporting of meta-analysis process and any obvious violations of meta-analysis principles. In Section 5.1, we explain the recommended methods for analyzing standardized mean difference effect sizes, then in Section 5.2, we discuss the methods used by the primary study authors and highlight any potential validity problems with their meta-analysis method.

Table 4 Meta-analysis methods

5.1 Standard Procedures for Meta-analysis

The usual method for aggregating standardized mean effect sizes such as Hedges’ g is to construct a weighted average using the inverse of the effect size variance: (see, for example, Hedges and Olkin 1985; Lipsey and Wilson 2001; Borenstein et al.2009):

$$ \overline{ES}=\frac{{\sum}^{k}_{i=1}w_{i}ES_{i}}{{\sum}^{k}_{i=1}w_{i}} $$
(5)

where ESi is the calculated effect size of the i-th experiment, k is the number of experiments, \(\overline {ES}\) is the mean effect size, and wi is an appropriate weight. It is also customary to use the inverse of the effect size variance as the weight, i.e., wi = 1/(var(ES)i), where the formula for (var(ES)i) depends both on the study design (Morris and DeShon 2002; Madeyski and Kitchenham 2018b) and the specific effect size. However, Hedges and Olkin (1985) make it clear that the use of the variance is based on large sample theory. In practice using the estimate of ESi in the equation for its variance, when sample sizes are small, leads to a biased weights and a biased estimate of \(\overline {ES}\). They point out that a weight based on the number of observationsFootnote 8 would lead to a pooled estimate that was unbiased but less precise. Such weights are close to optimal when the population mean is close to zero and the number of observations are large.

Equation (5) assumes a fixed effects meta-analysis but a random effects analysis is also usually based on the effect size variance. Also, in the case of a fixed effect analysis, the variance of \(\overline {ES}\) is obtained from the equation:

$$ var(\overline{ES})=\frac{1}{{\sum}^{k}_{i=1}w_{i}}=\sum\limits_{i=1}^{k}{v_{i}} $$
(6)

Equation (5) is also used for aggregating the unstandardized effect size (UES). Although in this case, var(UES)i is the square of the standard error of the mean difference.

There are two main meta-analysis models: a fixed effects model and a random effects model. Equations (5) and (6) are appropriate for a fixed effects model, when we assume that data from individual experiments arise from the same population (i.e., the data from each experiment arise from the same population).

A random effects model assumes that data from individual experiments arise from different populations each of which has its own population mean and variance. A random effects analysis estimates the excess variance due to the different populations by comparing the variance between experiment means with the within experiment variance. In practice, random effects analysis replaces var(ES)i with a larger revised variance that includes both the within experiment variance and the between experiment variance. In the case of a family of experiments, we would expect a priori that the experiments were closely controlled replications and a fixed effect size would be appropriate. However, a random effects analysis will give the same results as a fixed effects analysis in the event that the effect sizes are homogeneous, so we would recommend defaulting to a random effects method. Such approach would address the common issue, also mentioned by Santos et al. (2018), of using fixed effect models when, due to the heterogeneity of effects, random effects models would be preferred.

5.2 Meta-analysis Methods Used by the Primary Studies

None of the primary studies aggregated the unstandardized effect size. However, twelve studies reported effect sizes they referred to either as Hedges’ g or a related standardized effect size (Cohen’s d, γ and d). Apart from Study 13, none of the papers that used crossover-style experiments mentioned the possibility of two different effect sizes, so we assume that they all attempted to aggregate the effect size equivalent to an independent group study (i.e., dIG or gIG).

Study 1 and Study 4 both reported calculating Hedges’ g, but their description did not mention applying the small sample size adjustment, so we assume they reported what we refer to as dIG. They also reported converting to a correlation based effect size (usually referred to as the point bi-serial correlation, rpb Rosenthal 1991). This can easily be calculated from the standardized effect size using the following formula (see Borenstein et al. 2009; Lipsey and Wilson 2001):

$$ r_{pb}=\frac{d_{IG}}{\sqrt{d^{2}_{IG}+a}} $$
(7)

where a = 4 for a balanced experiment. After constructing rpb, it is necessary to apply Fisher’s normalising transformation Fisher (1921). The resulting transformed variable for experiment i is referred to as zi, and the set of zi −values can be aggregated using the following equation (which is equivalent to (5)):

$$ \bar{Z}=\frac{{\sum}_{i} w_{i}z_{i}}{{\sum}_{i} w_{i}} $$
(8)

The only mistake Study 1 and Study 4 made in the description of their meta-analysis was that the authors reported using a weight wi = 1/(N − 3), where wi is the weight for the i th experiment. In fact, the variance of rpb, after applying the Fisher normalizing transformation, is vi = 1/(N − 3) and the weight is wi = 1/vi = (N − 3), which ensures that the largest studies are given most weight in the aggregation process (Lipsey and Wilson 2001). In addition, the authors of Study 4 reported using a t-test for independent groups, so they may have used the number of observations rather than the sample size to calculate weights (and the overall variance).

In principle, transformation to rpb is a valid analysis method, since it avoids the probable bias in calculating the variance of the dIG for small sample sizes. For this reason, we used it as the basis of our reproducability analysis, and we report the method in detail in Appendix A.2.

An important implication of using the normalizing transformation of rpb is that the variance of rpb is var(ri) = 1/(ni − 3) and using (6):

$$ var(\overline{r_{pb}})={\sum\limits_{i}^{k}}{var(r_{i})}=\frac{1}{{{\sum}^{k}_{i}}{(n_{i}-3)}} $$
(9)

This means that if researchers mistakenly believe the variance is based on the number of observations rather than the number of participants, they will assume that the variance of each rpb is 1/(2ni − 3) after transformation, and will substantially underestimate the variance of the average effect size \(\overline {r_{pb}}\).

Four studies (i.e., Study 2, Study 5, Study 9 and Study 10) reported an effect size that they referred to as Hedges’ g. They also reported an aggregation method that, like Study 1 and Study 4, used (8), and they also made the same mistake with their description of the weight. However, they did not explicitly confirm that they transformed their effect size to a correlation, so we cannot be sure whether these studies aggregated the standardized effect sizes directly but mistakenly assumed that the variance of each effect size was 1/(ni − 3), or omitted to mention that they used the rpb transformation. Of these four studies, only Study 2 used an analysis that considered repeated values, so the other studies might have used a variance based on 1/(2ni − 3).

Study 3, Study 7 and Study 11 all made a mistake with their basic meta-analysis. They all used an AB/BA crossover design (although Study 3 also used an independent groups design for one of its 5 experiments). In each crossover study they estimated a standardized effect size for each time period separately. So for each AB/BA experiment they calculated two different estimates of dIG, one for time period 1 and the other for time period 2. It is incorrect to aggregate such effect sizes because the same participants contributed to each estimate of dIG, and, hence, the two effect sizes from the same experiment were not independent. This violates one of the basic assumptions of meta-analysis that each effect size comes from an independent experiment. The effect of this error is to increase the degrees of freedom attributed to tests of significance associated with the average effect size.

Study 6 reported using Cohen’s d and aggregating their values using a weighted mean and the META 5.3 tool. They referenced Hedges and Olkin (1985), which did not report methods for meta-analysing crossover designs, so we assume that the authors aggregated dIG but do not know how they calculated their weights.

Study 8 reported and aggregated rpb but used a different method to that used by Study 1 and Study 4. We describe the method they used in the Appendix A.3. From the viewpoint of validity a critical issue is that they derived rpb from the one-sided p −value of their statistical tests. For each experiment in the family and for each metric, they used either the Mann-Whitney-Wilcoxon (MMW) test or the t −test depending on the outcome of a normality test. However, Study 8 used statistical tests appropriate for independent groups studies, although the family used 4-group crossover experiments, so the resulting p −values are likely to be invalid. However, the study authors were attempting to use a meta-analysis process that would allow them to aggregate their parametric and non-parametric results. The authors reported the heterogeneity of their experiments, but as pointed out in Appendix A.3, the heterogeneity was probably over-estimated.

Study 13 reported a standardized effect size based on team improvement, which we refer to as gRM. The authors also reported dIG for each experiment, which they referred to as Hedges’ g, but they did not aggregate it. They estimated the variance of dRM but do not cite the origin of the formula they used. They used Hedges’ Q statistic (see (19)) to test for heterogeneity. The test failed to reject the null hypothesis (i.e., their p −value was greater than 0.05), and they reported what appears to be the unweighted mean of the effect sizes.

Study 14 referred to their effect size as γ for 4 separate hypotheses. However, the hypothesis we believe to be most relevant to investigating the difference between the techniques was based on the difference between the personal improvement observed among participants in one treatment group and the personal improvement among participants in the other group. This is a difference of differences analysis for which it is correct to use the independent groups t −test. However, γ cannot be easily equated to either dRM or dIG. For purposes of analysis, the difference data can be analysed as an independent groups study, but for purposes of interpretation, the mean difference measures the average individual improvement after the effect of skill differences are removed. They report both the weighted and unweighted overall mean. As explained in Appendix A.1.1, the weight was based on the inverse of the variance of γ and was calculated using the formula for the moderate sample-size approximation of the variance of gIG. They also tested for heterogeneity using the Q statistic proposed by by Hedges and Olkin (1985) which depends on the effect size variance.

Both Study 13 and Study 14 also aggregated one-sided p −values, as described in Appendix A.4, in order to test the null hypothesis of no significant difference between techniques.

The majority of primary study authors used the Meta-Analysis v2 BioStat (2006) for aggregation, although Meta-Analysis v2 does not support aggregation results from crossover design studies.

As mentioned by Santos et al. (2018), although many researchers used non-parametric methods for at least some of their individual experiments (see Table 3), they subsequently used parametric effect sizes. This is somewhat inconsistent but not necessarily invalid. It would certainly be inappropriate for studies that used both parametric and non-parametric methods to aggregate non-parametric effect sizes and parametric effect sizes in the same meta-analysis, so some consistent effect size metric is necessary.

The advantage of using the standardized mean difference is that the central limit theorem confirms that mean differences are normal irrespective of the underlying distribution of the data. The problem with standardized effect sizes is that the estimate of the variance of the data within each experiment, which is used to calculate the standardized effect size, may be biased for small sample sizes. However, the variance of the mean effect sizes for each experiment calculated as part of any random effects meta-analysis puts an upper limit on the variance of the overall mean effect size. In addition, currently, aggregating non-parametric effect sizes is not feasible. There are no well-defined guidelines identifying which non-parametric effect sizes to use, nor how they might be aggregated.

Only three of the primary studies considered heterogeneity. Study 8 and Study 13 reported non-significant heterogeneity. Study 14 reported significant heterogeneity and reported both a weighted and an unweighted mean. Only Study 2 explicitly mentioned using a fixed effects meta-analysis. Since the other studies made no mention of heterogeneity or using any specific meta-analysis model, we assume that the they also undertook fixed effects meta-analysis.

6 The Reproducibility and Validity of the Primary Study Meta-analyses (RQ4)

This section reports our reproducibility assessment and incorporates it with the validity analysis reported in Section 5, since it makes little sense to investigate the reproducibility of invalid meta-analyses. In turn, our reproducibility assessment allowed us to investigate further the validity of the meta-analysis processes adopted in each paper, from the viewpoint of whether processes that were valid in principle, were also applied correctly, in practice. In Section 6.1, we describe the method we used for our reproducibility assessment. In Section 6.2, we report the overall results of the reproducibility assessment, and in the following sections, we discuss the reproducibility results for each study in the context of the validity assessment reported in Section 5.2.

6.1 Reproducibility Assessment Process

For reproducibility, as far as possible, we used the same method for each study. To construct the effect size, we used the following process:

  1. 1.

    From the descriptive statistics reported in the study, we used (2) to calculate the standardized effect size appropriate for independent groups dIG. Our estimate of \(s^{2}_{IG}\) was usually based on the pooled within-technique variance. However, in the case of Study 3, Study 7 and Study 11, \(s^{2}_{IG}\) was based on the pooled within-cell variance, where a cell is defined as a set of observations that were obtained under exactly the same experimental conditions (see Appendix A.1.2).

  2. 2.

    We applied the exact small sample size adjustment J (see (14)) to calculate the effect size gIG.

This is the standard starting point for any meta-analysis when raw data is not available. To aggregate the effect sizes:

  1. 1.

    We transformed the gIG values to rpb and applied Fisher’s normalizing transformation Fisher (1921).

  2. 2.

    We used the Rmetafor tool Viechtbauer (2010) to fit a random effects model using its default method which is the Restricted maximum-likelihood estimation (RMLE) method.

  3. 3.

    We back-transformed our meta-analysis results to the standardized mean difference.

This approach is described in more detail in the Appendix A.2. It was the same as that undertaken by Abrahão et al. (2011), which has the advantage of being appropriate for all experimental designs used in our primary studies and does not rely on information such as the variances of standardized effect sizes which was not well-known to SE researchers.

The three main deviations from this method were:

  1. 1.

    For Study 8, we reported our results in terms of the point bi-serial correlation (i.e., rpb) because Study 8 reported and aggregated rpb.

  2. 2.

    For Study 13, descriptive statistics were not reported explicitly and we estimated the mean difference and standard deviations from the reported graphics. In addition, Study 13 explicitly reported the statistics we refer to as gRM and dIG, so we reported both effect sizes and, like the study authors, aggregated the gRM values.

  3. 3.

    In Study 14, the authors reported the personal improvement results for each participant, which is equivalent to dRM. So, to report comparable effect sizes, we calculated the descriptive statistics from the reported descriptive difference data (i.e., the post-training results minus the pre-training results).

Assuming the descriptive data was reported correctly, our meta-analyses should provide more trustworthy results for studies that used an invalid meta-analysis process (in particular, Study 3, Study 7 and Study 11). However, as explained in Appendix A.1.2, if materials, or time period effects are significant our estimates of \(s^{2}_{IG}\) will be inflated which would lead to underestimates of dIG. Also if there were significant interactions between either time period or materials, and technique such effects would also inflate \(s^{2}_{IG}\).

We defined results to be reproducible if the difference between the individual experiment effect sizes and the overall effect size reported in the primary study and those we calculated from the descriptive statistics was less than 0.05, as discussed in Section 3.4. We also compared the probability levels for the overall effect sizes. We expected primary studies that did not appreciate the impact of repeated measures would report smaller p −values than us. As discussed in Section 3.4, we only analyzed one measure per primary study.

6.2 Reproducibility Assessment Results

Table 5 displays the calculated effect sizes and reported effect sizes for each experiment and each effect size reported in each study. The variable Type refers to the effect size reported in the row. None of the studies apart from Study 7, Study 11 and Study 13 mentioned the small sample adjustment factor, so we assume that the standardized mean difference effect size reported by the authors is dRM. Study 13 reported both dIG and gRM, but aggregated gRM and the one-sided p −value. Study 7 and Study 11 reported two values that they called Hedges’ g. The value in their main tables was the small sample size adjusted standardized mean difference effect size, but they aggregated the non-adjusted effect size. The final column labelled RR (i.e., Results Reproduced) reports the number of times the absolute difference between the reported and calculated effect sizes was less than than 0.05 for all relevant entries. The studies for which all standardized effect sizes were reproduced are highlighted. We were only able to reproduce all standardized effect sizes for Study 2, Study 5 and Study 6, although for Study 14, we also reproduced the authors’ aggregation of p −values.

Table 5 Calculated and reported effect sizes

Table 6 displays the calculated and reported overall mean values for the effect sizes plus (if available) the p −value of the mean, the upper and lower confidence interval bounds (UB and LB), QE which is the heterogeneity test statistic and QEp which the the p −value of the heterogeneity statistic. The column RR identifies whether the difference between the calculated overall mean and the reported overall mean was greater than 0.05 (the studies for which this is the case are highlighted). The mean of the standardized effect sizes was reproduced for seven studies: Study 2, Study 5, Study 6, Study 8, Study 10, Study 11, and Study 13. However, Study 8 and Study 11 must be discounted because of validity problems.

Table 6 Overall mean values of effect sizes reported and calculated

The reproducibility results are collated with the validity assessment for each study, and are discussed in the following sections. In each section, the validity problems identified in Section 5 are identified in the paragraphs labelled “Meta-Analysis Validity Issue”. Critical issues that invalidate the aggregation performed by the authors are identified. If the reproducibility failed or was otherwise deemed invalid, we include a “Cause of Problem” paragraph. Validity issues identified as a result of our reproducibility assessment are identified as meta-analysis process implementation errors in the “Cause of Problem” paragraph.

6.3 Study 1 Validity and Reproducibility

Meta-Analysis Method Validity Issues::

None.

Author’s Aggregation Method::

Weighted mean of dRM based on transforming to and from rpb.

Our Aggregation Method::

Weighted mean of gIG based on transforming to and from rpb as described in Appendix A.2.

Individual Effect Size Reproducibility::

Failed.

Mean Effect Size Reproducibility::

Failed.

Cause of Problem::

Meta-analysis process implementation error - Incorrect use of meta-analysis tool.

Comments: Although we could not detect any validity problems with Study 1, and we based our meta-analysis on rpb derived from gIG, we could not reproduce the effect sizes nor the meta-analysis results. The study reported substantially smaller effect sizes, both for individual experiments and overall, than the ones we calculated. We contacted Prof. Abrahão who was the first author of this paper. She very kindly provided us with the raw data used in Study 1. Using Prof. Abrahão’s raw data, we recalculated gIG for each study and aggregated the data after transforming to rpb and following the process described in the Appendix A.5. Prof. Abrahão agreed with our analysis of her raw data. She also confirmed that she was attempting to calculate the matched pairs effect size (i.e., gRM).

The low values she obtained were due to several different factors. The most significant issue was that she used the Meta-Analysis-V2 tool BioStat (2006) that does not support crossover designs, although it does support matched pairs studies. The tool attempts to calculate gIG not gRM.Footnote 9

6.4 Study 2 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

It is unclear whether the paper aggregated the standardized effect size dIG directly or used the transformation to rpb.

Meta-Analysis Method Validity Issue 2::

The weights and variances may have been based on the number of observations rather than the number of participants.

Author’s Aggregation Method::

Unclear. Either the weighted mean of dIG based on transforming to and from rpb or the weighted mean of dIG with weight = N-3.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Succeeded.

Mean Effect Size Reproducibility::

Succeeded.

Comments: According to our criteria, Study 2 was fully reproduced with respect to the individual effect sizes and the weighted mean of the effect sizes. However, there is difference with respect to the p −values for the overall mean that is consistent with using the number of observations rather than the number of participants when calculating the variance of the effect size.

6.5 Study 3 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

Critical validity issue - Incorrect meta-analysis of non-independent effect sizes.

Meta-Analysis Method Validity Issue 2::

Unclear whether the authors aggregated dIG or rpb.

Meta-Analysis Method Validity Issue 3::

The weights and variances may have been based on the number of observations rather than the number of participants for AB/BA crossover experiments.

Author’s Aggregation Method::

Unclear. Either the weighted mean of dIG based on transforming to and from rpb or the weighted mean of dIG with weight = N-3.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Failed (4), Succeeded (1).

Mean Effect Size Reproducibility::

Failed.

Cause of Problem::

Critical validity issue.

Comments: Study 3 used different experiment designs. Four experiments were AB/BA crossover experiments, the fifth experiment was an independent groups study. We were able to reproduce the effect size for the fifth experiment.

It is important to note that even though Study 3 used two different experimental designs, once comparable effect sizes are constructed, in this case gIG, results from all experiments can be aggregated. Thus, we provide corrected effect sizes and an overall meta-analysis, using the reported descriptive statistics to calculate gIG for each experiment, followed by aggregation of normalized rpb values.

6.6 Study 4 Validity and Reproducibility

Meta-Analysis Method Validity Issues::

The study might have based weights and variances on the number of observations rather than the number of participants.

Author’s Aggregation Method::

Weighted mean of dIG based on transforming to and from rpb.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Failed.

Mean Effect Size Reproducibility::

Failed.

Cause of Problem::

Meta-analysis process implementation error - Incorrect use of meta-analysis tool

Comments: Like Study 1, Study 4 reported transforming its standardized effect size to rpb but could not be reproduced. Like Study 1, it reported significantly smaller effect sizes, both for individual experiments and overall, than the ones we calculated. Prof. Abrahão was a co-author of this paper, but she informed us that the raw data for Study 4 were no longer available. However, since the pattern of results was similar to Study 1 (i.e., the experiment effect sizes were smaller than the one we calculated), it is likely that the analysis suffered from the same problems.

6.7 Study 5 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

The study might have based weights and variances on the number of observations rather than the number of participants.

Meta-Analysis Method Validity Issue 2::

Unclear whether the authors aggregated dIG or rpb.

Author’s Aggregation Method::

Unclear. Either the Weighted mean of dIG based on transforming to and from rpb or the weighted mean of dIG with weight=N-3.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Succeeded.

Mean Effect Size Reproducibility::

Succeeded.

Comments: Despite uncertainty about which effect size was aggregated, Study 5 was successfully reproduced both at the individual experiment level and at the overall meta-analysis level. The largest discrepancy occurred for the first experiment results. This was due to a probable rounding error. The mean values of Ueffec for the first experiment (E-UL) in Table 7 of Fernández-Sáez et al. (2016) are 0.76 for Low LoD and 0.76 for High LoD, so we calculated the mean difference (and the effect size) to be zero. In fact, Study 5 reports a standardized effect size of − 0.046 (see Fernández-Sáez et al. 2016, Fig. 4).

Study 5 did not explicitly report the confidence intervals on mean standardized effect size, but visual inspection of their forest plot (Fernández-Sáez et al. 2016, Fig. 4) suggests an interval of approximately [− 0.25, 0.4] which is smaller than the interval we calculated [− 0.343,0.612]. So, Study 5 might have underestimated the standard error of the mean standardized effect size.

6.8 Study 6 Validity and Reproducibility

Meta-Analysis Method Validity Issue::

The study might have based weights and variances on the number of observations rather than the number of participants.

Aggregation Method::

Based on dIG but not specified in detail.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Succeeded.

Mean Effect Size Reproducibility::

Succeeded.

Study 6 was successfully reproduced both for individual effect sizes and for the overall mean effect sizes. All discrepancies appear to have occurred because we calculated the small sample size adjusted values. The non-adjusted values for the three experiments are Exp1 = 0.579, Exp2 = 0.3517 and Exp3 = 0.5793, which are very close to the reported values.

6.9 Study 7 Validity and Reproducibility

Meta-Analysis Method Validity Issue::

Critical validity issue - Incorrect meta-analysis of non-independent effect sizes.

Author’s Aggregation Method::

Weighted mean of dIG for each time period.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Failed

Mean Effect Size Reproducibility::

Failed.

Cause of Problem::

Critical validity issue.

Comments: Like Study 3, Study 7 calculated standard effect sizes separately for each study. Since the meta-analysis aggregation was invalid, we report our estimates of the effect sizes for each experiment and their overall mean.

We note, however, that the first time period analysis the authors performed is a valid independent groups analysis (see Senn 2002, Section 3.1.2), so a meta-analysis, based on all participants provides valid estimate of dIG and its variance. Compared with an analysis of data from both time periods, the analysis is based on one set of materials rather than two and the estimate of dIG may be biased if the randomization to groups was not sufficient to balance out skill differences. However, it is not affected by any technique by time period or technique by order interactions.

6.10 Study 8 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

Wrongly used p −values from independent groups tests to calculate rpb

Meta-Analysis Method Validity Issue 2::

Used the number of observations in their heterogeneity assessment instead of the number of participants.

Author’s Aggregation Method::

Weighted mean of rpb based on the Hunter-Schmidt method (Hunter and Schmidt 1990).

Our Aggregation Method::

Aggregation of rpb derived from gIG.

Individual Effect Size Reproducibility::

Failed.

Mean Effect Size Reproducibility::

Succeeded due to accidental correctness.

Cause of Problem::

Meta-analysis process implementation error - Inconsistency between reported p −values and calculated effect sizes.

Comments: Study 8 was reproduced for three of the four effect sizes and the overall mean. The largest discrepancy was found for the first experiment.

We based our estimate of rpb on the gIG, whereas the authors used (33), so discrepancies might have been due to the different methods of calculating rpb. Table 7 summarises our attempt to reproduce the effect size calculations used by the authors from the initial p −values. The p −values reported by the authors are shown in the first row with their equivalent Z −values in row 2. The first issue is that the p −value for the first experiment is large while the other p −values are small which leads to both positive and negative Z −values. The published box plots all had medians for the control that were smaller than the medians for the technique treatment, so we would expect all the studies to have small p −values for tests (assuming the authors calculated the probability that the control group exhibited larger values than the treatment group). Thus, it appears that value for p(Exp1) is anomalous and could be a typographical error. Furthermore, applying their procedure to the p −values, we did not obtain values of rpb any closer to their reported values than the values we obtained starting from our estimates of gIG, whether we used the number of observations (see row 4, rpb(NO)) or the number of participants (see row 5, rpb(NP)) in Table 7.

Table 7 Calculating rPB effect size from probabilities

Thus, although the overall mean rpb value we obtained is very close to the overall mean reported by the authors, the process used to derive the individual effect sizes could not be reproduced.

6.11 Study 9 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

Unclear whether the authors aggregated dIG or rpb

Meta-Analysis Method Validity Issue 2::

The study might have based weights and variances on the number of observations rather than the number of participants.

Author’s Aggregation Method::

Unclear. Either the weighted mean of dIG based on transforming to and from rpb or the weighted mean of dIG with weight = N-3.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Failed.

Mean Effect Size Reproducibility::

Failed.

Cause of Problem::

Meta-analysis process implementation error - Authors ignored effect size direction.

Comments: Study 9 was not reproduced either in terms of individual effect sizes or in terms of the overall mean. Looking at the effect sizes, it is clear that the authors of Study 9 aggregated the absolute mean effect sizes for each experiment, and so overestimated the overall effect size.

This is the only case in which it is possible for the results of a meta-analysis process using one metric to differ, with respect to reproducibility, from the the results obtained using another metric. If all effect sizes of the other metric were in the same direction, using the absolute effect size would not cause a reproducibility problem. This is in fact the case for the other metric used in this study.

6.12 Study 10 Validity and Reproducibility

Meta-Analysis Method Validity Issue 1::

Unclear whether the authors aggregated dIG or rpb

Meta-Analysis Method Validity Issue 2::

The study might have based weights and variances on the number of observations rather than the number of participants.

Author’s Aggregation Method::

Unclear. Either the weighted mean of dIG based on transforming to and from rpb or the weighted mean of dIG with weight = N-3.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Not reported.

Mean Effect Size Reproducibility::

Succeeded.

Comments: Study 10 did not report individual experiment effect sizes, nor any p −values for the meta-analysis, but, did report an overall effect size very close to our calculation.

6.13 Study 11 Validity and Reproducibility

Meta-Analysis Method Validity Issue::

Critical validity issue - Incorrect meta-analysis of non-independent effect sizes.

Author’s Aggregation Method::

Weighted mean of dIG for each time period.

Our Aggregation Method::

As for Study 1.

Individual Effect Size Reproducibility::

Not reported.

Mean Effect Size Reproducibility::

Succeeded due to accidental correctness.

Cause of Problem::

Critical validity issue.

Comments: Like Study 3 and Study 7, Study 11 calculated standard effect sizes separately for each study. In this case, however, we found an example of accidental correctness. The Study 11 mean effect size was reproduced because the analysis effects were extremely close for both time periods so constructing an average effect size for each experiment gave very similar results to treating the results of each time periods as separate experiments. What is noticeable is that the reported p −value was considerably lower than the one we calculated. This was because the authors believed they had six effect sizes in their meta-analysis rather than three.

Like Study 7, the first time period meta-analysis reported by Study 11 provides a valid estimate of dIG and its variance.

6.14 Study 13 Validity and Reproducibility

Meta-Analysis Method Validity Issue::

None

Author’s Aggregation Method::

Unweighted mean of gRM and sum of the natural logarithm of the one-sided p −values.

Our Aggregation Method::

Weighted mean of gRM based on transformation to and from rpb and sum of the natural logarithm of the one-sided p-values.

Individual Effect Size Reproducibility::

Failed due to extracting basic data from graphics.

Mean Effect Size Reproducibility::

Succeeded.

Comments: Study 13 did not report the mean and standard deviation of the technique groups. Instead, the authors presented the descriptive statistics in graphical form. However, in contrast to the other studies, Study 13 reported both the dIG (which they referred to as Hedges’ g) and gRM (which they referred to as d) using a valid formula to estimate its standard deviation.

Since the value we used to reproduce the effect sizes were estimated from a diagram, we expected the difference between our results and the reported results to be slightly larger than our 0.05 level, in fact all the differences were less than 0.08

Study 13 aggregated both the one-sided p-values and the individual gRM effect sizes. The overall mean gRM was validated by our difference criterion. The reported aggregated probability, P, was close to the value we calculated,Footnote 10 and overall we conclude that Study 13 has been successfully reproduced.

6.15 Study 14 Validity and Reproducibility

Meta-Analysis Method Validity Issue::

None

Author’s Aggregation Method::

Weighted and unweighted mean of gRM and sum of the natural logarithm of the one-sided p −values.

Our Aggregation Method::

Weighted mean of gRM based on transformation to and from rpb and sum of the natural logarithm of the one-sided p-values.

Individual Effect Size Reproducibility::

For gIG failed due to rounding errors, for p succeeded.

Mean Effect Size Reproducibility::

Failed due to rounding errors.

Comments: Study 14 used an interesting design that avoids some of the problems associated with replicated measures by analyzing the differences in differences (see Appendix A.1.4). Study 14 actually performed four statistical tests for each of four different variables, including comparing the pretest results for each group, comparing the posttest results for each group, comparing the post-test with the pretest values for each group, as well comparing the mean difference of the difference between pretest and posttest results for each group (which they call the performance improvement). However, for the purpose of comparing the two treatments, the relative performance improvement is the most appropriate measure to test:

$$ ProcessImprovement=\frac{{\sum}_{Ai}(x_{Ai2}-x_{Ai1})}{n_{A}}-\frac{{\sum}_{Bi}(x_{Bi2}-x_{Bi1})}{n_{B}} $$
(10)

where xAi2 is the posttest value of metric x for participant i in Group A and xAi1 is the pretest value of metric x for subject i. xBi2 and xBi1 are equivalent values for participants in group B. nA and nB are the number of participants in each group. Like an independent groups analysis, the variance of the difference values is the pooled within group variance (see (12)).

We were able to reproduce only one of the standardized mean effect sizes for individual experiments. In addition, we could not reproduce the overall mean effect size. All the data is reported to two significant digits, and it appears that because the raw data values are quite small, this has led to potentially large rounding errorsFootnote 11 However, we obtained t-test p −values that were similar to the reported values, and our aggregated p −values were also close.

7 Discussion

This section discusses issues arising from our systematic review and validity and reproducibility studies.

7.1 Summary of Results

We found 13 primary studies that conformed with our inclusion criteria in the sources we searched. All primary studies reported their experimental designs in sufficient detail for us to classify their individual experiments into four distinct design types: 4-group AB/BA crossover design,duplicated AB/BA crossover design, independent groups design, and a pretest posttest control design.

All 13 primary studies also provided sufficient information for us to reproduce their meta-analysis results, but, in most cases, only for effects sizes comparable to independent groups designs (i.e., dIG and gIG). Of the crossover designs, only Study 13 reported the improvement effect sizes (gRM). The other crossover design studies did not provide the summary information needed to calculate the personal improvement effect size.

We identified four primary studies that exhibited validity problems sufficient to call into question the reported meta-analysis results, and another six studies where we were unsure about the validity of the meta-analysis. In those six cases, we expected the effect sizes to be slightly biased and effect size variances to be underestimated, see Appendix A.5 for a more detailed explanation.

Of the 12 studies that reported individual experiment effect sizes, we were able to fully reproduce five primary studies. In addition, we also reproduced six of the 12 reported overall effect sizes. In the case of Study 10, which did not report individual experiment effect sizes, we were able to reproduce its overall effect size.

7.2 Experimental Designs Used by Primary Studies

Six studies used the 4-group duplicated AB/BA crossover design and four studies used the AB/BA crossover design. Study 3 used two different designs, with 4 experiments using a 4-group duplicated AB/BA crossover and one experiment using an independent groups design. The two remaining studies used an independent groups design and a pretest posttest control design. Thus, 12 of the 13 primary studies used repeated measures methods.

Only one family used an independent groups design for all its experiments, although outcomes of this design are the most straightforward to analyse and meta-analyse. However, using more complex designs makes the analysis of individual experiments and their subsequent meta-analysis more difficult. Only 4 of those 12 repeated measures studies used analysis methods appropriate for repeated measures data. Using analysis methods appropriate for independent groups studies has knock-on effects for any subsequent meta-analysis that can lead to invalid effect sizes or invalid effect size variances.

The main reason for using repeated measures designs is to be able to account for the individual skill differences among participants. However, the crossover design is not the only way to do this. In particular, the pretest posttest control group experimental design (see Appendix A.1.4) has some desirable properties. It allows the effect of individual differences are catered for by the analysis, but avoids the problem of technique by period interaction which is a potential risk when using a crossover design. For example, there were many studies evaluating the perspective-based code reading (PBR) methods (see Ciolkowski 2009), some of which used the undefined current method as a control while others used the checklist-based reading (CBR) method as a control. Using a pretest posttest control group, the current method would be used to establish a pretest baseline and then groups could be randomly assigned to training in CBR or PBR and the posttest differences used to assess whether PBR or CBR most enhanced defect detection.

7.3 Meta-analysis Reporting

Primary study authors did not always describe their meta-analysis processes fully and consistently. Few studies reported any information related to the standard error of the average effect size or its confidence intervals. The p −values for the overall effect sizes were reported nine times. In only three cases were the reported and calculated p −values of the same order of magnitude. Two papers reported confidence interval bounds, but these were Study 7 and Study 11 and we disagreed with their aggregation process.Footnote 12

We also noticed some more general reporting issues:

  • Studies often reported a name such as Hedges’ g for their standardised mean effect sizes, but did not usually specify how this was calculated. For reproducibility it is important to know both the formula for the standard deviation used to standardise the mean difference and whether or not the small sample size adjustment factor was applied.

  • Many studies used metrics that corresponded to the fraction of correct responses and which they reported on a [0, 1] scale. This can lead to rounding errors when reproducing results, if descriptive statistics are only reported to two decimal places. It is preferable to represent such numbers as a percentage rather than a fractions. Reporting percentages to two decimal places is appropriate both for means and standard deviations.

  • Authors using a repeated measures design sometimes failed to report the number of participants in each sequence group. However, this is important for meta-analysis purposes if the individual experiments are unbalanced in any way.

We collate our observations and formulate guidelines about reporting and conduct of meta-analysis in Appendix A.6.

7.4 Meta-analysis Tools

11 of the 13 studies mentioned using a meta-analysis tool. Of those 11 studies, seven exhibited reproducibility problems. It is difficult for researchers to assess whether they have used tools correctly unless there is some way of validating the tool outcomes. This study has shown that attempting to reproduce the results from descriptive data is a useful means of checking the output from tools. Comparing the results of analyzing the raw data as opposed to the descriptive statistics (as reported in Appendix A.5) shows that results based on descriptive statistics may be biased, but they should still provide results of the same order of magnitude, providing a sanity check on the tool outputs.

7.5 Meta-analysis Methods

In this section we discuss the implications of our study on the use of meta-analysis methods to aggregate data from families of experiments.

7.5.1 Testing for Heterogeneity

Only three primary studies (Studies 8, 13 and 14) reported the results of testing for heterogeneity among experiments in a family. It might be expected that a family of experiments was by definition homogeneous. However, some studies such as Study 1 and Study 3 reported families that had considerable differences between the individual experiments (see the supplementary material (Kitchenham et al. 2019b)). It is certainly worth checking for heterogeneity in such cases. In the case of Study 1, our meta-analysis found a heterogeneity value of 4.01 which had an associated p −value of 0.45 suggesting that heterogeneity was limited and the fixed effect analysis undertaken by the authors was appropriate. In the case of Study 3, the heterogeneity value was 8.46 with p = 0.0761. Since heterogeneity tests are not very powerful (see Higgins and Thompson 2002), we suggest that a value less than 0.1 should be accepted as an indication that a random effects analysis might be preferable to a fixed effects analysis.

7.5.2 Meta-analysis Choices

One of the major problems with meta-analysis is that there are many different effect sizes and methods that can be used to aggregate results. The meta-analysis methods used in the primary studies were not always clearly reported, but most studies reported standardized mean effect sizes for individual effect sizes and for the overall mean effect size. Study 8 reported the point bi-serial correlation coefficient. In addition, Study 13 and 14 used the method of combining p-values, which is now known to have severe limitations, see Appendix A.4.

Many text books recommend aggregating standardised mean difference effect sizes, see for example, Borenstein et al. (2009) or Lipsey and Wilson (2001), but it depends on obtaining the correct effect size variance.Footnote 13 This is fairly straightforward if the individual experiments have medium to large sample sizes, but is more complicated if experiments have very small sample size (Hedges and Olkin 1985), and also depends on the specific experimental design, as can be seen in Madeyski and Kitchenham (2018b) and Morris and DeShon (2002).

It would seem to be easier to convert to rpb for aggregation, as we did in our reproducibility assessment. This procedure avoids the need to obtain estimates of the standardized effect size variance. However, it must be recognised that the problem with the standardised effect size and its variance is that, for small sample sizes, the estimate of the variance which is used to calculate the standardised effect size is likely to be inaccurate. Converting to rpb does not overcome this problem since the point bi-serial correlation is itself calculated as the ratio of two variance estimates.

In practice, as proposed by Santos et al. (2018), an option for homogeneous families (i.e., families that use the same material and the same output measures) would be to analyze the data from the family as one large experiment, using what they call an Independent Participant Data (IDP) stratified method. This analyzes the data from all the individual experiments together as a single data set, and uses the individual experiment identifier as a blocking factor. This would lead to an estimate of overall mean difference and the residual variance based on all the participants. An estimate of the effect size of the family and its standard error would then be more likely to be reliable.

It is also possible that using non-parametric effect sizes would avoid some of the problems inherent in using parametric effect sizes. However, although it is possible to calculate a number of different non-parametric effect sizes, it is not clear which non-parametric effect sizes should be used, nor how to aggregate results from individual experiments into an overall effect size.

7.6 Limitations

It should be noted that all primary studies using crossover designs (except Study 7 and Study 11), based their analysis on the pooled within treatment standard deviation, rather then the pooled within cell standard deviation. Both variances are calculated using a formula similar to that shown in (12) but the pooled within treatment variation is calculated based on pooling the variances of the observations in each of the two different treatment groups. In contrast, the pooled within cell standard deviation is based on pooling the variances calculated from the observations found in each of the experimental conditions shown in Table 8 for AB/BA crossover designs and Table 9 for 4-group crossover designs. This means the standard deviation will be biased (in fact the standard deviation will be larger than it should be), unless the system and period effects are negligible. Furthermore any bias in the standard deviation will impact the estimation of standardized effect size, making it smaller than it should be.

Table 8 AB/BA crossover design
Table 9 Duplicated AB/BA crossover design

We claimed to have found a reproducibility problem if the difference between the effect size estimates reported by the authors and the ones we calculated was greater than 0.05. The choice of 0.05 was based on convenience and can be criticized. In practice, the value we chose seemed to work reasonably well as a means of drawing our attention to possible reproducibility problems. However, it incorrectly highlighted some differences that we believed to be due rounding errors, and we also observed two examples of accidental correctness. So, it was critical to review the actual meta-analysis process reported by the authors, as well as the difference between reported and calculated effect sizes to confirm whether there were validity or reproducibility problems.

8 Conclusions and Contributions

Our systematic review identified 13 primary studies from five high quality journals. In seven cases we identified validity or reproducibility problems. Even in cases where we reproduced the average standardized effect size, in four cases, we are not sure as to the accuracy of statistical tests of significance and p −values. We conclude that meta-analysis is not well understood by software engineering researchers.

Our systematic review process reported in Section 3 has ensured that the problems we identified were found in papers published in high quality software engineering journals with stringent peer review processes. It is, therefore, important to report such problems and provide guidelines and procedures to help to avoid them in the future. Answers to RQ1 and RQ2 reported in Section 4, provide traceability to the individual primary studies and contextual details of the experimental methods used to analyse each experiment. This confirms that we have not been biased in our selection of primary studies. Answers to RQ3 and RQ4 provide traceability to the individual meta-analysis problems and confirmation that most problems are found in more than one primary study, so are more than just one-off mistakes.

The major contributions of our study arise from our efforts to address the meta-analysis problems found by validity and reproduciblity assessment reported in Sections 5 and 6. They are:

  1. 1.

    To provide evidence that meta-analysis methods are not well-understood by software engineering researchers (see Sections 5 and 6)

  2. 2.

    To identify specific meta-analysis validity and reproducibility errors (see Sections 5 and 6).

  3. 3.

    To provide guidelines for reporting and undertaking meta-analysis that could help to avoid meta-analysis errors (see Appendix A.6).

  4. 4.

    To describe the model underlying the 4-group crossover experimental design (see Appendix A.1.3), since although the design is popular in software engineering research, it has not previously been specified in any detail.

  5. 5.

    To provide a worked example of analyzing and meta-analyzing results from a family of studies that used a 4-group crossover design (see Appendix A.5).

Although we have provided meta-analysis reporting and conduct guidelines, it must be recognized that we lack the simulation studies needed to address questions such as:

  • Whether there is an optimum (or minimum viable) number of experiments in a family.

  • Whether the conversion to rpb is preferably to aggregating gIG directly, given the small sample sizes and numbers of independent experiments in SE families.

  • Whether we should use non-parametric methods for analysis and meta-analysis.

We are currently undertaking research addressing these issues.

Finally, whenever possible, we would ask researchers to make their data sets publicly available. Such data sets allow reviewers to check the validity of results before publication, provide a valuable resource for novice researchers, and allow data to be re-analyzed if new analysis methods become available.