Metaanalysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment
Abstract
Context
Previous studies have raised concerns about the analysis and metaanalysis of crossover experiments and we were aware of several families of experiments that used crossover designs and metaanalysis.
Objective
To identify families of experiments that used metaanalysis, to investigate their methods for effect size construction and aggregation, and to assess the reproducibility and validity of their results.
Method
We performed a systematic review (SR) of papers reporting families of experiments in high quality software engineering journals, that attempted to apply metaanalysis. We attempted to reproduce the reported metaanalysis results using the descriptive statistics and also investigated the validity of the metaanalysis process.
Results
Out of 13 identified primary studies, we reproduced only five. Seven studies could not be reproduced. One study which was correctly analyzed could not be reproduced due to rounding errors. When we were unable to reproduce results, we provide revised metaanalysis results. To support reproducibility of analyses presented in our paper, it is complemented by the reproducer R package.
Conclusions
Metaanalysis is not well understood by software engineering researchers. To support novice researchers, we present recommendations for reporting and metaanalyzing families of experiments and a detailed example of how to analyze a family of 4group crossover experiments.
Keywords
Evidencebased software engineering Systematic review Metaanalysis Effect size Families of experiments Reproducible research1 Introduction
Vegas et al. (2016) reported that crossover designs are a popular design for software engineering experiments. In their review they identified 82 papers of which 33 (i.e., 40.2%) were crossover designs. Furthermore, those 82 papers reported 124 experiments of which 68 (i.e., 54.8%) used crossover designs. However, they reported that “crossover designs are often not properly designed and/or analysed, limiting the validity of the results”. They also warned against the use of metaanalysis in the context of crossover style experiments.
As a results of that study, two of us undertook a detailed study of parametric effect sizes from AB/BA crossover studies (see Madeyski and Kitchenham 2018a, band Kitchenham et al. 2018). We identified the need to consider two mean difference effect sizes and reported the small sample effect size variances and their normal approximations.
As we were undertaking this systematic review,^{1} we found that Santos et al. (2018) had already performed a mapping study of families of experiments. They reported that although the most favoured means of aggregating results was Narrative synthesis (used by 18 papers), Aggregated Data metaanalysis (by which they mean aggregation of experiment effect sizes) was used by 15 studies.

Identify the effect sizes used and how they were calculated and aggregated.

Use the descriptive statistics reported in the study, attempt to reproduce the reported results.^{2}

In the event that we were unable to reproduce the results, to investigate the underlying reason for lack of reproduciblity.
We concentrated on families of experiments as our form of primary studies. We did this (rather than looking at papers that report a metaanalysis after performing a systematic review) because papers reporting a family of experiments are likely to have published sufficient details about the individual studies and their metaanalysis process for us to attempt to validate and reproduce their effect size calculations and metaanalysis. In addition, Santos’s mapping study confirmed the popularity of families of experiments, and emphasized that more families needed to aggregate their results. These two factors indicate the importance of adopting valid metaanalysis processes in the context of families of experiments. Nonetheless, our reproducibility analysis method, based on aggregating descriptive statistics, is the same as would be used to metaanalyse data from experiments found by a systematic review. Thus, the results from this study are likely to be of value for any metaanalysis of software engineering data.
We concentrated on high quality journals not only because such papers usually present reasonably complete descriptions of their results and methods, but also because they attract papers from experienced researchers, which are reviewed by other experienced researchers. Thus, readers of papers in such journals expect the published results to be correct. Invalid results in such papers are therefore likely to have a more serious impact than mistakes in papers published in less prestigious journals or conferences. For example, practitioners may base decisions on invalid outcomes, and novice researchers may adopt incorrect methods.
We present our research questions in Section 2 and our systematic review methods in Section 3. A summary of the primary studies included in our review, a discussion of the validity of the metaanalysis methods used in each study and our reproducibility assessment are in Sections 4, 5 and 6, respectively. We discuss the results of our study in Section 7 and present the contributions of this paper and our conclusions in Section 8.
We also include an Appendix that reports details of our statistical analysis and analysis results not needed to support our main arguments. The Appendix also discusses reproducibility aspects of our study.
2 Research Questions
 RQ1:
Which studies that undertook families of experiment have also undertaken effect size metaanalysis?
 RQ2:
What are the characteristics of these studies in terms of methods used for experimental design and analysis?
 RQ3:
What metaanalysis methods were used and were they valid?
 RQ4:
If the metaanalysis methods were valid can results be successfully reproduced?
RQ1, RQ2, and the reporting aspects of RQ3 could be addressed directly from information reported in each primary study. To address the validity aspect of RQ3 and RQ4, we reviewed the metaanalysis processes described by each study and then attempted to reproduce first the effect sizes and then the metaanalysis in each primary study. Finally, we compared our results with the reported results. We assumed that it would be possible to conduct a metaanalysis based on the descriptive data and the effect size chosen by the primary study authors, since this is the normal method of performing metaanalysis.
3 Systematic Review Methods
We performed our systematic review (SR) according to the guidelines proposed by Kitchenham et al. (2015). The processes we adopted are specified in the following sections.
3.1 Protocol Development
Our protocol defines the procedures we intended to use for the systematic review including the search process, the primary study selection process, the data extraction process and the data analysis process. It also identified the main tasks of all the coauthors. The protocol was initially drafted by the first author and reviewed by all the authors. After trialling the specified processes, the final version of the protocol was agreed by all the authors and registered as report W08/2017/P045 at Wroclaw University of Science and Technology. The following sections are based on the processes defined in the protocol. Any divergences report our actual processes, as opposed to the planned processes described in the protocol. The major deviation from the protocol and the results reported in this paper is that originally we had assumed it would be appropriate to concentrate on reproducibility, but as our investigation progressed we realized that we needed to consider the reasons for lack of reproducibility, that is, consider in more detail the validity of the metaanalysis process. Furthermore, validity is the key issue, because it is not useful to reproduce an invalid result.
3.2 Search Strategy
In order to address our research questions, we needed to identify papers that reported the use of metaanalysis to aggregate individual studies, reported the results of the individual studies in detail, and were published in high quality journals.

IEEE Transactions on Software Engineering (TSE).

Empirical Software Engineering (EMSE).

Journal of Systems and Software (JSS).

Information and Software Technology (IST).

ACM Transactions on Software Engineering Methodology (TOSEM).
3.3 SR Inclusions and Exclusions
In this section we present our inclusion and exclusion criteria. Details of the search and selection process, the validation of the search and selection process, and the data extraction process can be found in the supplementary material (Kitchenham et al. 2019b).
 1.
The paper should report a family of three or more experiments. This is because it is the criteria adopted by Santos et al. (2018) and there is more opportunity to detect heterogeneity with three or more studies.
 2.
The experiments reported in the paper should relate to humancentric experiments or quasiexperiments that compare SE methods or procedures rather than report observational (correlation) studies with no clear comparisons.^{3}
 3.
The paper should have been published by one of the five journals identified by our search strategy, see Section 3.2.
 4.
The paper should use some form of metaanalysis to aggregate results from the individual studies using standardized effect sizes, i.e., standardized mean difference or pointbiserial correlation coefficient (r_{pb}).^{4} These effect sizes are commonly used in software engineering metaanalyses.
 1.
The paper was an editorial.^{5}
 2.
The paper was published before 1999, when Basili et al. (1999) first discussed families of experiments.
3.4 Data Analysis
The results extracted from each primary study allowed us to answer questions RQ1, RQ2 and the methodology element of RQ3. To address the validity element of RQ3 and RQ4 for each primary study, we reviewed carefully the metaanalysis methods reported by the study authors and attempted to reproduce the effect size values and metaanalysis results using the reported descriptive data.
Many of the studies reported multiple metrics and hypotheses tests for each experiment. In all cases, we first attempted to reproduce the effect sizes reported by the authors and then the metaanalysis. We analyzed only the first outcome metric, because we assumed that if the individual effect sizes were reproduced and results of metaanalyzing the effect sizes was reproduced, it would confirm whether or not the metaanalysis was reproducible without checking the results for every metric. Our assumption (that in our case it is enough to analyze the first outcome metric) was based on the fact that none of the primary studies reported using different methods to calculate effect sizes or performing metaanalysis for different outcome metrics. In addition, outcome tables for descriptive statistics and effect sizes were similar for all outcome metrics. There is only one situation where there might be a difference between outcomes for different metrics. This would happen if the authors did not maintain the direction as well as the magnitude of the effect size. Then, if one metric had effect sizes with different directions and one did not, we would agree with the authors in the case where all directions were the same and disagree when the directions were not the same. This happened in the case of Study 9 (see Section 6.11).
 1.
A relative value would unfairly penalize small effect sizes, for example if a study reported an effect size of 0.01 and we reported an effect size of 0.02, we would have relative difference of 50% for a difference that could be the result of rounding applied to reported mean values.
 2.
Most studies reported descriptive data on metrics, in the range 0 to 1, to two decimal places, so we thought an absolute value of 0.05 might be sufficiently large to allow for differences due to rounding effects caused because our reproducibility statistics were derived from the reported means and variances.
 3.
Most studies did not state explicitly whether or not they applied the small sample size adjustment to their standardized effect sizes. For example, a medium effect size of 0.5 and a sample size of 23 (the median experiment size), the effect of applying the small sample adjustment is to reduce the standardized effect size to 0.48.
4 An Overview of the Primary Studies (RQ1 and RQ2)
In this section, we address RQ1 and RQ2 and present an overview of the primary studies included in our systematic review.
4.1 Studies Reporting Metaanalysis of Families of Experiments (RQ1)
Primary studies
ID  Year  Citation  Exps  Participants  Total 

per group  participants  
Study 11  2016  Morales et al. (2016)  3  230,25,13  69 
Study 5  2016  FernándezSáez et al. (2016)  4  11,16,32,22  81 
Study 8  2015  GonzalezHuerta et al. (2015)  4  28,16,36,12  92 
Study 9  2015  FernándezSáez et al. (2015)  3  40,51,78  169 
Study 2  2014  Scanniello et al. (2014)  4  24,22,22,18  86 
Study 1  2013  Abrahao et al. (2013)  5  24,24,28,20,16  112 
Study 4  2013  Fernandez et al. (2013)  3  12,32,20  64 
Study 6  2013  Hadar et al. (2013)  3  19,31,39  79 
Study 7  2012  Teruel et al. (2012)  3  30,42,9  81 
Study 10  2011  CruzLemus et al. (2011)  3  69,25,30  124 
Study 3  2009  CruzLemus et al. (2009)  5  55,178,14,13,24  284 
Study 14  2004  Pfahl et al. (2004)  3  9,10,10  34 
Study 13  2001  Laitenberger et al. (2001)  3  2,12,13  29 

Six studies investigated the impact of different UML documentation options (see rows where the techniques are labelled DO to signify Documentation Options).

Four studies investigated procedures in the context of maintainability.

Four studies investigated requirements issues, three compared specification languages and one investigated proposals for verifying nonfunctional requirements.
Primary study data
ID  Main goal  Techniques  Task or activity 

S11  Compare two requirements specification  i* v. TRiStar  Requirements 
languages for teleoreactive systems  Understandability  
S5  Assess level of Detail (LoD) of UML  DO:LoD Low v.  Maintainability 
diagrams needed for maintenance  LoD High  
S8  Assess the QuaDAI strategy for verifying  ATAM v. QuaDAI  NonFunction 
nonfunctional requirements  Requirements  
Achievement  
S9  Compare forwarded designed with  DO: UML diagrams FD  Maintainability 
reverse engineered UML diagrams  v. RE  
for maintenance  
S2  Assess UML requirement diagrams for  DO: Source code alone  Maintainability 
code maintainability  v. Source code with  
UML analysis model  
S1  Assess UML sequence diagrams (SDs)  DO: Without SDs v.  Requirements 
impact on understandability  with SDs  Understandability  
S4  Assess two web usability assessment  Heuristic Evaluation v.  Evaluation of Web 
methods  Web Usability  site usability  
Evaluation Process  
S6  Compare understandability of requirements  Use Cases v. Tropos  Requirements 
expressed in different visual languages  Understandability  
S7  Compare two requirements languages  i* v. CSRML  Requirements 
Understandability  
S10  Assess if stereotypes improve UML  DO: Without  Maintainability 
sequence diagram comprehension  sterotypes v. with  
stereotypes  
S3  Assess if composite state diagrams  DO: without CSDs v.  Model 
(CSDs) help maintenance  with CSDs  Understandability  
S14  Compare two SE automated training  COCOMObased training  SE knowledge test 
programs  v. Systems Dynamics  
training  
S13  Compare defect detection of  CBR v. PBR  Defect detection 
perspective based reading with  
checklist based reading 
4.2 Experimental Methods Used by the Primary Studies (RQ2)
Primary study experiment data
ID  Design  Tests  Main hypothesis tests for  Valid 

used  each experiment  analysis  
Study 1  4group crossover  NP  Wilcoxon (paired analysis)  Partly 
Study 2  4group crossover  NP or P  Unpaired ttest or  No 
MannWhitneyWilcoxon  
Study 3  AB/BA crossover  P  ANOVA 2 × 2 factorial  No 
Independent  P  Oneway ANOVA  Yes  
groups (1)  
Study 4  4group crossover  NP or P  One tailed ttest for independent  No 
groups or MannWhitney  
Study 5  4group crossover  NP  Wilcoxon for paired samples  Partly 
Study 6  4group crossover  NP  MannWhitney  No 
Study 7  AB/BA crossover  P  ANOVA 2 × 2 factorial  No 
Study 8  4group crossover  NP or P  Onetailed ttest for independent  No 
samples or Mann Whitney  
Study 9  Independent groups  NP or P  ANOVA or MannWhitney  Yes 
Study 10  4group crossover  NP  KruskallWallis  No 
Study 11  AB/BA crossover  P  ANOVA 2 × 2 factorial  No 
Study 13  AB/BA crossover  NP and P  Matched pairs ttest and  Partly 
Wilcoxon signed ranks test  
Study 14  Pretest and  NP and P  Oneway paired ttest and  Yes 
posttest control  MannWhitney 
 1.
An effect size that measures the personal improvement (of an individual or team) performing a task using one method compared with performing the same task^{7} using another method. We refer to this as the repeated measures standardized effect size, δ_{RM}, with an estimate d_{RM}.
 2.
An effect size that is equivalent to the standardized mean effect size obtained from a an independent groups design (also known as a between participants design). We refer to this independent groups effect size as δ_{IG}, with an estimate d_{IG}.
For balanced crossovers (where each sequence group has the same number of participants), effect sizes are calculated as follows (Morris and DeShon 2002; Madeyski and Kitchenham 2018b):
In addition, there is a relationship between the two standard deviations (Madeyski and Kitchenham 2018b):
For small sample size, Hedges and Olkin (1985) recommend applying a correction to d_{RM} and d_{IG}. We refer to the small sample size corrected effect sizes as g_{RM} and g_{IG} respectively. We prefer not to give these terms generic labels, such as Hedges’ g, because as Cumming (2012) points out (see page 295) metaanalysis terminology is inconsistent. In terms of names given to standardized effect sizes, d_{IG} is referred to as d by Borenstein et al. (2009) and as g by Hedges and Olkin (1985), g_{IG} is referred to as g by Borenstein et al. (2009) and d by Hedges and Olkin (1985). In our primary studies, most papers used the terms Hedge’s g and one used Cohen’s d but the papers did not specify whether or not they used the small sample size adjustment. Only Study 13, explicitly defined Hedges’ g to be what we refer to as d_{IG} and used the term d to be what we refer to as g_{RM}.
In Table 3, we also report whether the data was analyzed using parametric (P) or nonparametric methods (NP) tests for the individual experiments. Four of the studies used nonparametric tests or parametric tests depending on the outcome of tests for normality. Study 13 and Study 14 performed both nonparametric and parametric tests, but only reported the results of the parametric tests since the outcomes of both tests were consistent. It is important to note that many of the crossover studies did not analyze their data correctly, by using independent groups tests rather than repeated measures tests. We annotated three studies as partly valid because they used tests that catered for repeated measures, but may have been delivered slightly biased results if time period effects or material effects were significant (see Appendix A.1.3).
5 The Validity of Metaanalysis Procedures Used by the Primary Studies (RQ3)
Metaanalysis methods
ID  Effect size  Effect size  Aggregation  Heterogeneity 

name  aggregated  tool  tested  
Study 1  Hedges’ g  r _{ p b}  MetaAnalysis v2  No 
Study 2  Hedges’ g  r_{pb} or d_{IG}  MetaAnalysis v2  No 
Study 3  Hedges’ g  r_{pb} or d_{IG}  MetaAnalysis v2  No 
Study 4  Hedges’ g  r _{ p b}  MetaAnalysis v2  No 
Study 5  Hedges’ g  r_{pb} or d_{IG}  MetaAnalysis v2  No 
Study 6  Cohen’s d  d _{ I G}  Meta5.3  No 
Study 7  Hedges’ g  d _{ I G}  MetaAnalysis v2  No 
Study 8  r _{ p b}  r _{ p b}  Meta5.3  Yes 
Study 9  Hedges’ g  r_{pb} or d_{IG}  MetaAnalysis v2  No 
Study 10  Hedges’ g  r_{pb} or d_{IG}  MetaAnalysis v2  No 
Study 11  Hedges’ g  d _{ I G}  MetaAnalysis v2  No 
Study 13  g _{ R M}  g _{ R M}  None  Yes 
Study 13  p  P  None  Yes 
Study 14  γ  γ  None  Yes 
Study 14  p  P  None  Yes 
5.1 Standard Procedures for Metaanalysis
The usual method for aggregating standardized mean effect sizes such as Hedges’ g is to construct a weighted average using the inverse of the effect size variance: (see, for example, Hedges and Olkin 1985; Lipsey and Wilson 2001; Borenstein et al.2009):
Equation (5) assumes a fixed effects metaanalysis but a random effects analysis is also usually based on the effect size variance. Also, in the case of a fixed effect analysis, the variance of \(\overline {ES}\) is obtained from the equation:
Equation (5) is also used for aggregating the unstandardized effect size (UES). Although in this case, var(UES)_{i} is the square of the standard error of the mean difference.
There are two main metaanalysis models: a fixed effects model and a random effects model. Equations (5) and (6) are appropriate for a fixed effects model, when we assume that data from individual experiments arise from the same population (i.e., the data from each experiment arise from the same population).
A random effects model assumes that data from individual experiments arise from different populations each of which has its own population mean and variance. A random effects analysis estimates the excess variance due to the different populations by comparing the variance between experiment means with the within experiment variance. In practice, random effects analysis replaces var(ES)_{i} with a larger revised variance that includes both the within experiment variance and the between experiment variance. In the case of a family of experiments, we would expect a priori that the experiments were closely controlled replications and a fixed effect size would be appropriate. However, a random effects analysis will give the same results as a fixed effects analysis in the event that the effect sizes are homogeneous, so we would recommend defaulting to a random effects method. Such approach would address the common issue, also mentioned by Santos et al. (2018), of using fixed effect models when, due to the heterogeneity of effects, random effects models would be preferred.
5.2 Metaanalysis Methods Used by the Primary Studies
None of the primary studies aggregated the unstandardized effect size. However, twelve studies reported effect sizes they referred to either as Hedges’ g or a related standardized effect size (Cohen’s d, γ and d). Apart from Study 13, none of the papers that used crossoverstyle experiments mentioned the possibility of two different effect sizes, so we assume that they all attempted to aggregate the effect size equivalent to an independent group study (i.e., d_{IG} or g_{IG}).
Study 1 and Study 4 both reported calculating Hedges’ g, but their description did not mention applying the small sample size adjustment, so we assume they reported what we refer to as d_{IG}. They also reported converting to a correlation based effect size (usually referred to as the point biserial correlation, r_{pb} Rosenthal 1991). This can easily be calculated from the standardized effect size using the following formula (see Borenstein et al. 2009; Lipsey and Wilson 2001):
In principle, transformation to r_{pb} is a valid analysis method, since it avoids the probable bias in calculating the variance of the d_{IG} for small sample sizes. For this reason, we used it as the basis of our reproducability analysis, and we report the method in detail in Appendix A.2.
An important implication of using the normalizing transformation of r_{pb} is that the variance of r_{pb} is var(r_{i}) = 1/(n_{i} − 3) and using (6):
Four studies (i.e., Study 2, Study 5, Study 9 and Study 10) reported an effect size that they referred to as Hedges’ g. They also reported an aggregation method that, like Study 1 and Study 4, used (8), and they also made the same mistake with their description of the weight. However, they did not explicitly confirm that they transformed their effect size to a correlation, so we cannot be sure whether these studies aggregated the standardized effect sizes directly but mistakenly assumed that the variance of each effect size was 1/(n_{i} − 3), or omitted to mention that they used the r_{pb} transformation. Of these four studies, only Study 2 used an analysis that considered repeated values, so the other studies might have used a variance based on 1/(2n_{i} − 3).
Study 3, Study 7 and Study 11 all made a mistake with their basic metaanalysis. They all used an AB/BA crossover design (although Study 3 also used an independent groups design for one of its 5 experiments). In each crossover study they estimated a standardized effect size for each time period separately. So for each AB/BA experiment they calculated two different estimates of d_{IG}, one for time period 1 and the other for time period 2. It is incorrect to aggregate such effect sizes because the same participants contributed to each estimate of d_{IG}, and, hence, the two effect sizes from the same experiment were not independent. This violates one of the basic assumptions of metaanalysis that each effect size comes from an independent experiment. The effect of this error is to increase the degrees of freedom attributed to tests of significance associated with the average effect size.
Study 6 reported using Cohen’s d and aggregating their values using a weighted mean and the META 5.3 tool. They referenced Hedges and Olkin (1985), which did not report methods for metaanalysing crossover designs, so we assume that the authors aggregated d_{IG} but do not know how they calculated their weights.
Study 8 reported and aggregated r_{pb} but used a different method to that used by Study 1 and Study 4. We describe the method they used in the Appendix A.3. From the viewpoint of validity a critical issue is that they derived r_{pb} from the onesided p −value of their statistical tests. For each experiment in the family and for each metric, they used either the MannWhitneyWilcoxon (MMW) test or the t −test depending on the outcome of a normality test. However, Study 8 used statistical tests appropriate for independent groups studies, although the family used 4group crossover experiments, so the resulting p −values are likely to be invalid. However, the study authors were attempting to use a metaanalysis process that would allow them to aggregate their parametric and nonparametric results. The authors reported the heterogeneity of their experiments, but as pointed out in Appendix A.3, the heterogeneity was probably overestimated.
Study 13 reported a standardized effect size based on team improvement, which we refer to as g_{RM}. The authors also reported d_{IG} for each experiment, which they referred to as Hedges’ g, but they did not aggregate it. They estimated the variance of d_{RM} but do not cite the origin of the formula they used. They used Hedges’ Q statistic (see (19)) to test for heterogeneity. The test failed to reject the null hypothesis (i.e., their p −value was greater than 0.05), and they reported what appears to be the unweighted mean of the effect sizes.
Study 14 referred to their effect size as γ for 4 separate hypotheses. However, the hypothesis we believe to be most relevant to investigating the difference between the techniques was based on the difference between the personal improvement observed among participants in one treatment group and the personal improvement among participants in the other group. This is a difference of differences analysis for which it is correct to use the independent groups t −test. However, γ cannot be easily equated to either d_{RM} or d_{IG}. For purposes of analysis, the difference data can be analysed as an independent groups study, but for purposes of interpretation, the mean difference measures the average individual improvement after the effect of skill differences are removed. They report both the weighted and unweighted overall mean. As explained in Appendix A.1.1, the weight was based on the inverse of the variance of γ and was calculated using the formula for the moderate samplesize approximation of the variance of g_{IG}. They also tested for heterogeneity using the Q statistic proposed by by Hedges and Olkin (1985) which depends on the effect size variance.
Both Study 13 and Study 14 also aggregated onesided p −values, as described in Appendix A.4, in order to test the null hypothesis of no significant difference between techniques.
The majority of primary study authors used the MetaAnalysis v2 BioStat (2006) for aggregation, although MetaAnalysis v2 does not support aggregation results from crossover design studies.
As mentioned by Santos et al. (2018), although many researchers used nonparametric methods for at least some of their individual experiments (see Table 3), they subsequently used parametric effect sizes. This is somewhat inconsistent but not necessarily invalid. It would certainly be inappropriate for studies that used both parametric and nonparametric methods to aggregate nonparametric effect sizes and parametric effect sizes in the same metaanalysis, so some consistent effect size metric is necessary.
The advantage of using the standardized mean difference is that the central limit theorem confirms that mean differences are normal irrespective of the underlying distribution of the data. The problem with standardized effect sizes is that the estimate of the variance of the data within each experiment, which is used to calculate the standardized effect size, may be biased for small sample sizes. However, the variance of the mean effect sizes for each experiment calculated as part of any random effects metaanalysis puts an upper limit on the variance of the overall mean effect size. In addition, currently, aggregating nonparametric effect sizes is not feasible. There are no welldefined guidelines identifying which nonparametric effect sizes to use, nor how they might be aggregated.
Only three of the primary studies considered heterogeneity. Study 8 and Study 13 reported nonsignificant heterogeneity. Study 14 reported significant heterogeneity and reported both a weighted and an unweighted mean. Only Study 2 explicitly mentioned using a fixed effects metaanalysis. Since the other studies made no mention of heterogeneity or using any specific metaanalysis model, we assume that the they also undertook fixed effects metaanalysis.
6 The Reproducibility and Validity of the Primary Study Metaanalyses (RQ4)
This section reports our reproducibility assessment and incorporates it with the validity analysis reported in Section 5, since it makes little sense to investigate the reproducibility of invalid metaanalyses. In turn, our reproducibility assessment allowed us to investigate further the validity of the metaanalysis processes adopted in each paper, from the viewpoint of whether processes that were valid in principle, were also applied correctly, in practice. In Section 6.1, we describe the method we used for our reproducibility assessment. In Section 6.2, we report the overall results of the reproducibility assessment, and in the following sections, we discuss the reproducibility results for each study in the context of the validity assessment reported in Section 5.2.
6.1 Reproducibility Assessment Process
 1.
From the descriptive statistics reported in the study, we used (2) to calculate the standardized effect size appropriate for independent groups d_{IG}. Our estimate of \(s^{2}_{IG}\) was usually based on the pooled withintechnique variance. However, in the case of Study 3, Study 7 and Study 11, \(s^{2}_{IG}\) was based on the pooled withincell variance, where a cell is defined as a set of observations that were obtained under exactly the same experimental conditions (see Appendix A.1.2).
 2.
We applied the exact small sample size adjustment J (see (14)) to calculate the effect size g_{IG}.
 1.
We transformed the g_{IG} values to r_{pb} and applied Fisher’s normalizing transformation Fisher (1921).
 2.
We used the Rmetafor tool Viechtbauer (2010) to fit a random effects model using its default method which is the Restricted maximumlikelihood estimation (RMLE) method.
 3.
We backtransformed our metaanalysis results to the standardized mean difference.
This approach is described in more detail in the Appendix A.2. It was the same as that undertaken by Abrahão et al. (2011), which has the advantage of being appropriate for all experimental designs used in our primary studies and does not rely on information such as the variances of standardized effect sizes which was not wellknown to SE researchers.
 1.
For Study 8, we reported our results in terms of the point biserial correlation (i.e., r_{pb}) because Study 8 reported and aggregated r_{pb}.
 2.
For Study 13, descriptive statistics were not reported explicitly and we estimated the mean difference and standard deviations from the reported graphics. In addition, Study 13 explicitly reported the statistics we refer to as g_{RM} and d_{IG}, so we reported both effect sizes and, like the study authors, aggregated the g_{RM} values.
 3.
In Study 14, the authors reported the personal improvement results for each participant, which is equivalent to d_{RM}. So, to report comparable effect sizes, we calculated the descriptive statistics from the reported descriptive difference data (i.e., the posttraining results minus the pretraining results).
Assuming the descriptive data was reported correctly, our metaanalyses should provide more trustworthy results for studies that used an invalid metaanalysis process (in particular, Study 3, Study 7 and Study 11). However, as explained in Appendix A.1.2, if materials, or time period effects are significant our estimates of \(s^{2}_{IG}\) will be inflated which would lead to underestimates of d_{IG}. Also if there were significant interactions between either time period or materials, and technique such effects would also inflate \(s^{2}_{IG}\).
We defined results to be reproducible if the difference between the individual experiment effect sizes and the overall effect size reported in the primary study and those we calculated from the descriptive statistics was less than 0.05, as discussed in Section 3.4. We also compared the probability levels for the overall effect sizes. We expected primary studies that did not appreciate the impact of repeated measures would report smaller p −values than us. As discussed in Section 3.4, we only analyzed one measure per primary study.
6.2 Reproducibility Assessment Results
Calculated and reported effect sizes
Overall mean values of effect sizes reported and calculated
The reproducibility results are collated with the validity assessment for each study, and are discussed in the following sections. In each section, the validity problems identified in Section 5 are identified in the paragraphs labelled “MetaAnalysis Validity Issue”. Critical issues that invalidate the aggregation performed by the authors are identified. If the reproducibility failed or was otherwise deemed invalid, we include a “Cause of Problem” paragraph. Validity issues identified as a result of our reproducibility assessment are identified as metaanalysis process implementation errors in the “Cause of Problem” paragraph.
6.3 Study 1 Validity and Reproducibility
 MetaAnalysis Method Validity Issues:

None.
 Author’s Aggregation Method:

Weighted mean of d_{RM} based on transforming to and from r_{pb}.
 Our Aggregation Method:

Weighted mean of g_{IG} based on transforming to and from r_{pb} as described in Appendix A.2.
 Individual Effect Size Reproducibility:

Failed.
 Mean Effect Size Reproducibility:

Failed.
 Cause of Problem:

Metaanalysis process implementation error  Incorrect use of metaanalysis tool.
Comments: Although we could not detect any validity problems with Study 1, and we based our metaanalysis on r_{pb} derived from g_{IG}, we could not reproduce the effect sizes nor the metaanalysis results. The study reported substantially smaller effect sizes, both for individual experiments and overall, than the ones we calculated. We contacted Prof. Abrahão who was the first author of this paper. She very kindly provided us with the raw data used in Study 1. Using Prof. Abrahão’s raw data, we recalculated g_{IG} for each study and aggregated the data after transforming to r_{pb} and following the process described in the Appendix A.5. Prof. Abrahão agreed with our analysis of her raw data. She also confirmed that she was attempting to calculate the matched pairs effect size (i.e., g_{RM}).
The low values she obtained were due to several different factors. The most significant issue was that she used the MetaAnalysisV2 tool BioStat (2006) that does not support crossover designs, although it does support matched pairs studies. The tool attempts to calculate g_{IG} not g_{RM}.^{9}
6.4 Study 2 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

It is unclear whether the paper aggregated the standardized effect size d_{IG} directly or used the transformation to r_{pb}.
 MetaAnalysis Method Validity Issue 2:

The weights and variances may have been based on the number of observations rather than the number of participants.
 Author’s Aggregation Method:

Unclear. Either the weighted mean of d_{IG} based on transforming to and from r_{pb} or the weighted mean of d_{IG} with weight = N3.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Succeeded.
 Mean Effect Size Reproducibility:

Succeeded.
Comments: According to our criteria, Study 2 was fully reproduced with respect to the individual effect sizes and the weighted mean of the effect sizes. However, there is difference with respect to the p −values for the overall mean that is consistent with using the number of observations rather than the number of participants when calculating the variance of the effect size.
6.5 Study 3 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

Critical validity issue  Incorrect metaanalysis of nonindependent effect sizes.
 MetaAnalysis Method Validity Issue 2:

Unclear whether the authors aggregated d_{IG} or r_{pb}.
 MetaAnalysis Method Validity Issue 3:

The weights and variances may have been based on the number of observations rather than the number of participants for AB/BA crossover experiments.
 Author’s Aggregation Method:

Unclear. Either the weighted mean of d_{IG} based on transforming to and from r_{pb} or the weighted mean of d_{IG} with weight = N3.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Failed (4), Succeeded (1).
 Mean Effect Size Reproducibility:

Failed.
 Cause of Problem:

Critical validity issue.
Comments: Study 3 used different experiment designs. Four experiments were AB/BA crossover experiments, the fifth experiment was an independent groups study. We were able to reproduce the effect size for the fifth experiment.
It is important to note that even though Study 3 used two different experimental designs, once comparable effect sizes are constructed, in this case g_{IG}, results from all experiments can be aggregated. Thus, we provide corrected effect sizes and an overall metaanalysis, using the reported descriptive statistics to calculate g_{IG} for each experiment, followed by aggregation of normalized r_{pb} values.
6.6 Study 4 Validity and Reproducibility
 MetaAnalysis Method Validity Issues:

The study might have based weights and variances on the number of observations rather than the number of participants.
 Author’s Aggregation Method:

Weighted mean of d_{IG} based on transforming to and from r_{pb}.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Failed.
 Mean Effect Size Reproducibility:

Failed.
 Cause of Problem:

Metaanalysis process implementation error  Incorrect use of metaanalysis tool
Comments: Like Study 1, Study 4 reported transforming its standardized effect size to r_{pb} but could not be reproduced. Like Study 1, it reported significantly smaller effect sizes, both for individual experiments and overall, than the ones we calculated. Prof. Abrahão was a coauthor of this paper, but she informed us that the raw data for Study 4 were no longer available. However, since the pattern of results was similar to Study 1 (i.e., the experiment effect sizes were smaller than the one we calculated), it is likely that the analysis suffered from the same problems.
6.7 Study 5 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

The study might have based weights and variances on the number of observations rather than the number of participants.
 MetaAnalysis Method Validity Issue 2:

Unclear whether the authors aggregated d_{IG} or r_{pb}.
 Author’s Aggregation Method:

Unclear. Either the Weighted mean of d_{IG} based on transforming to and from r_{pb} or the weighted mean of d_{IG} with weight=N3.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Succeeded.
 Mean Effect Size Reproducibility:

Succeeded.
Comments: Despite uncertainty about which effect size was aggregated, Study 5 was successfully reproduced both at the individual experiment level and at the overall metaanalysis level. The largest discrepancy occurred for the first experiment results. This was due to a probable rounding error. The mean values of Ueffec for the first experiment (EUL) in Table 7 of FernándezSáez et al. (2016) are 0.76 for Low LoD and 0.76 for High LoD, so we calculated the mean difference (and the effect size) to be zero. In fact, Study 5 reports a standardized effect size of − 0.046 (see FernándezSáez et al. 2016, Fig. 4).
Study 5 did not explicitly report the confidence intervals on mean standardized effect size, but visual inspection of their forest plot (FernándezSáez et al. 2016, Fig. 4) suggests an interval of approximately [− 0.25, 0.4] which is smaller than the interval we calculated [− 0.343,0.612]. So, Study 5 might have underestimated the standard error of the mean standardized effect size.
6.8 Study 6 Validity and Reproducibility
 MetaAnalysis Method Validity Issue:

The study might have based weights and variances on the number of observations rather than the number of participants.
 Aggregation Method:

Based on d_{IG} but not specified in detail.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Succeeded.
 Mean Effect Size Reproducibility:

Succeeded.
Study 6 was successfully reproduced both for individual effect sizes and for the overall mean effect sizes. All discrepancies appear to have occurred because we calculated the small sample size adjusted values. The nonadjusted values for the three experiments are Exp1 = 0.579, Exp2 = 0.3517 and Exp3 = 0.5793, which are very close to the reported values.
6.9 Study 7 Validity and Reproducibility
 MetaAnalysis Method Validity Issue:

Critical validity issue  Incorrect metaanalysis of nonindependent effect sizes.
 Author’s Aggregation Method:

Weighted mean of d_{IG} for each time period.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Failed
 Mean Effect Size Reproducibility:

Failed.
 Cause of Problem:

Critical validity issue.
Comments: Like Study 3, Study 7 calculated standard effect sizes separately for each study. Since the metaanalysis aggregation was invalid, we report our estimates of the effect sizes for each experiment and their overall mean.
We note, however, that the first time period analysis the authors performed is a valid independent groups analysis (see Senn 2002, Section 3.1.2), so a metaanalysis, based on all participants provides valid estimate of d_{IG} and its variance. Compared with an analysis of data from both time periods, the analysis is based on one set of materials rather than two and the estimate of d_{IG} may be biased if the randomization to groups was not sufficient to balance out skill differences. However, it is not affected by any technique by time period or technique by order interactions.
6.10 Study 8 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

Wrongly used p −values from independent groups tests to calculate r_{pb}
 MetaAnalysis Method Validity Issue 2:

Used the number of observations in their heterogeneity assessment instead of the number of participants.
 Author’s Aggregation Method:

Weighted mean of r_{pb} based on the HunterSchmidt method (Hunter and Schmidt 1990).
 Our Aggregation Method:

Aggregation of r_{pb} derived from g_{IG}.
 Individual Effect Size Reproducibility:

Failed.
 Mean Effect Size Reproducibility:

Succeeded due to accidental correctness.
 Cause of Problem:

Metaanalysis process implementation error  Inconsistency between reported p −values and calculated effect sizes.
Comments: Study 8 was reproduced for three of the four effect sizes and the overall mean. The largest discrepancy was found for the first experiment.
Calculating r_{PB} effect size from probabilities
Statistic  Exp1  Exp2  Exp3  Exp4 

p  0.906  0.036  0.003  0.008 
Z  1.317  − 1.799  − 2.748  − 2.409 
r_{pb}(NP)  0.249  − 0.450  − 0.458  − 0.695 
r_{pb}(NO)  0.176  − 0.318  − 0.324  − 0.492 
Thus, although the overall mean r_{pb} value we obtained is very close to the overall mean reported by the authors, the process used to derive the individual effect sizes could not be reproduced.
6.11 Study 9 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

Unclear whether the authors aggregated d_{IG} or r_{pb}
 MetaAnalysis Method Validity Issue 2:

The study might have based weights and variances on the number of observations rather than the number of participants.
 Author’s Aggregation Method:

Unclear. Either the weighted mean of d_{IG} based on transforming to and from r_{pb} or the weighted mean of d_{IG} with weight = N3.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Failed.
 Mean Effect Size Reproducibility:

Failed.
 Cause of Problem:

Metaanalysis process implementation error  Authors ignored effect size direction.
Comments: Study 9 was not reproduced either in terms of individual effect sizes or in terms of the overall mean. Looking at the effect sizes, it is clear that the authors of Study 9 aggregated the absolute mean effect sizes for each experiment, and so overestimated the overall effect size.
This is the only case in which it is possible for the results of a metaanalysis process using one metric to differ, with respect to reproducibility, from the the results obtained using another metric. If all effect sizes of the other metric were in the same direction, using the absolute effect size would not cause a reproducibility problem. This is in fact the case for the other metric used in this study.
6.12 Study 10 Validity and Reproducibility
 MetaAnalysis Method Validity Issue 1:

Unclear whether the authors aggregated d_{IG} or r_{pb}
 MetaAnalysis Method Validity Issue 2:

The study might have based weights and variances on the number of observations rather than the number of participants.
 Author’s Aggregation Method:

Unclear. Either the weighted mean of d_{IG} based on transforming to and from r_{pb} or the weighted mean of d_{IG} with weight = N3.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Not reported.
 Mean Effect Size Reproducibility:

Succeeded.
Comments: Study 10 did not report individual experiment effect sizes, nor any p −values for the metaanalysis, but, did report an overall effect size very close to our calculation.
6.13 Study 11 Validity and Reproducibility
 MetaAnalysis Method Validity Issue:

Critical validity issue  Incorrect metaanalysis of nonindependent effect sizes.
 Author’s Aggregation Method:

Weighted mean of d_{IG} for each time period.
 Our Aggregation Method:

As for Study 1.
 Individual Effect Size Reproducibility:

Not reported.
 Mean Effect Size Reproducibility:

Succeeded due to accidental correctness.
 Cause of Problem:

Critical validity issue.
Comments: Like Study 3 and Study 7, Study 11 calculated standard effect sizes separately for each study. In this case, however, we found an example of accidental correctness. The Study 11 mean effect size was reproduced because the analysis effects were extremely close for both time periods so constructing an average effect size for each experiment gave very similar results to treating the results of each time periods as separate experiments. What is noticeable is that the reported p −value was considerably lower than the one we calculated. This was because the authors believed they had six effect sizes in their metaanalysis rather than three.
Like Study 7, the first time period metaanalysis reported by Study 11 provides a valid estimate of d_{IG} and its variance.
6.14 Study 13 Validity and Reproducibility
 MetaAnalysis Method Validity Issue:

None
 Author’s Aggregation Method:

Unweighted mean of g_{RM} and sum of the natural logarithm of the onesided p −values.
 Our Aggregation Method:

Weighted mean of g_{RM} based on transformation to and from r_{pb} and sum of the natural logarithm of the onesided pvalues.
 Individual Effect Size Reproducibility:

Failed due to extracting basic data from graphics.
 Mean Effect Size Reproducibility:

Succeeded.
Comments: Study 13 did not report the mean and standard deviation of the technique groups. Instead, the authors presented the descriptive statistics in graphical form. However, in contrast to the other studies, Study 13 reported both the d_{IG} (which they referred to as Hedges’ g) and g_{RM} (which they referred to as d) using a valid formula to estimate its standard deviation.
Since the value we used to reproduce the effect sizes were estimated from a diagram, we expected the difference between our results and the reported results to be slightly larger than our 0.05 level, in fact all the differences were less than 0.08
Study 13 aggregated both the onesided pvalues and the individual g_{RM} effect sizes. The overall mean g_{RM} was validated by our difference criterion. The reported aggregated probability, P, was close to the value we calculated,^{10} and overall we conclude that Study 13 has been successfully reproduced.
6.15 Study 14 Validity and Reproducibility
 MetaAnalysis Method Validity Issue:

None
 Author’s Aggregation Method:

Weighted and unweighted mean of g_{RM} and sum of the natural logarithm of the onesided p −values.
 Our Aggregation Method:

Weighted mean of g_{RM} based on transformation to and from r_{pb} and sum of the natural logarithm of the onesided pvalues.
 Individual Effect Size Reproducibility:

For g_{IG} failed due to rounding errors, for p succeeded.
 Mean Effect Size Reproducibility:

Failed due to rounding errors.
Comments: Study 14 used an interesting design that avoids some of the problems associated with replicated measures by analyzing the differences in differences (see Appendix A.1.4). Study 14 actually performed four statistical tests for each of four different variables, including comparing the pretest results for each group, comparing the posttest results for each group, comparing the posttest with the pretest values for each group, as well comparing the mean difference of the difference between pretest and posttest results for each group (which they call the performance improvement). However, for the purpose of comparing the two treatments, the relative performance improvement is the most appropriate measure to test:
We were able to reproduce only one of the standardized mean effect sizes for individual experiments. In addition, we could not reproduce the overall mean effect size. All the data is reported to two significant digits, and it appears that because the raw data values are quite small, this has led to potentially large rounding errors^{11} However, we obtained ttest p −values that were similar to the reported values, and our aggregated p −values were also close.
7 Discussion
This section discusses issues arising from our systematic review and validity and reproducibility studies.
7.1 Summary of Results
We found 13 primary studies that conformed with our inclusion criteria in the sources we searched. All primary studies reported their experimental designs in sufficient detail for us to classify their individual experiments into four distinct design types: 4group AB/BA crossover design,duplicated AB/BA crossover design, independent groups design, and a pretest posttest control design.
All 13 primary studies also provided sufficient information for us to reproduce their metaanalysis results, but, in most cases, only for effects sizes comparable to independent groups designs (i.e., d_{IG} and g_{IG}). Of the crossover designs, only Study 13 reported the improvement effect sizes (g_{RM}). The other crossover design studies did not provide the summary information needed to calculate the personal improvement effect size.
We identified four primary studies that exhibited validity problems sufficient to call into question the reported metaanalysis results, and another six studies where we were unsure about the validity of the metaanalysis. In those six cases, we expected the effect sizes to be slightly biased and effect size variances to be underestimated, see Appendix A.5 for a more detailed explanation.
Of the 12 studies that reported individual experiment effect sizes, we were able to fully reproduce five primary studies. In addition, we also reproduced six of the 12 reported overall effect sizes. In the case of Study 10, which did not report individual experiment effect sizes, we were able to reproduce its overall effect size.
7.2 Experimental Designs Used by Primary Studies
Six studies used the 4group duplicated AB/BA crossover design and four studies used the AB/BA crossover design. Study 3 used two different designs, with 4 experiments using a 4group duplicated AB/BA crossover and one experiment using an independent groups design. The two remaining studies used an independent groups design and a pretest posttest control design. Thus, 12 of the 13 primary studies used repeated measures methods.
Only one family used an independent groups design for all its experiments, although outcomes of this design are the most straightforward to analyse and metaanalyse. However, using more complex designs makes the analysis of individual experiments and their subsequent metaanalysis more difficult. Only 4 of those 12 repeated measures studies used analysis methods appropriate for repeated measures data. Using analysis methods appropriate for independent groups studies has knockon effects for any subsequent metaanalysis that can lead to invalid effect sizes or invalid effect size variances.
The main reason for using repeated measures designs is to be able to account for the individual skill differences among participants. However, the crossover design is not the only way to do this. In particular, the pretest posttest control group experimental design (see Appendix A.1.4) has some desirable properties. It allows the effect of individual differences are catered for by the analysis, but avoids the problem of technique by period interaction which is a potential risk when using a crossover design. For example, there were many studies evaluating the perspectivebased code reading (PBR) methods (see Ciolkowski 2009), some of which used the undefined current method as a control while others used the checklistbased reading (CBR) method as a control. Using a pretest posttest control group, the current method would be used to establish a pretest baseline and then groups could be randomly assigned to training in CBR or PBR and the posttest differences used to assess whether PBR or CBR most enhanced defect detection.
7.3 Metaanalysis Reporting
Primary study authors did not always describe their metaanalysis processes fully and consistently. Few studies reported any information related to the standard error of the average effect size or its confidence intervals. The p −values for the overall effect sizes were reported nine times. In only three cases were the reported and calculated p −values of the same order of magnitude. Two papers reported confidence interval bounds, but these were Study 7 and Study 11 and we disagreed with their aggregation process.^{12}

Studies often reported a name such as Hedges’ g for their standardised mean effect sizes, but did not usually specify how this was calculated. For reproducibility it is important to know both the formula for the standard deviation used to standardise the mean difference and whether or not the small sample size adjustment factor was applied.

Many studies used metrics that corresponded to the fraction of correct responses and which they reported on a [0, 1] scale. This can lead to rounding errors when reproducing results, if descriptive statistics are only reported to two decimal places. It is preferable to represent such numbers as a percentage rather than a fractions. Reporting percentages to two decimal places is appropriate both for means and standard deviations.

Authors using a repeated measures design sometimes failed to report the number of participants in each sequence group. However, this is important for metaanalysis purposes if the individual experiments are unbalanced in any way.
7.4 Metaanalysis Tools
11 of the 13 studies mentioned using a metaanalysis tool. Of those 11 studies, seven exhibited reproducibility problems. It is difficult for researchers to assess whether they have used tools correctly unless there is some way of validating the tool outcomes. This study has shown that attempting to reproduce the results from descriptive data is a useful means of checking the output from tools. Comparing the results of analyzing the raw data as opposed to the descriptive statistics (as reported in Appendix A.5) shows that results based on descriptive statistics may be biased, but they should still provide results of the same order of magnitude, providing a sanity check on the tool outputs.
7.5 Metaanalysis Methods
In this section we discuss the implications of our study on the use of metaanalysis methods to aggregate data from families of experiments.
7.5.1 Testing for Heterogeneity
Only three primary studies (Studies 8, 13 and 14) reported the results of testing for heterogeneity among experiments in a family. It might be expected that a family of experiments was by definition homogeneous. However, some studies such as Study 1 and Study 3 reported families that had considerable differences between the individual experiments (see the supplementary material (Kitchenham et al. 2019b)). It is certainly worth checking for heterogeneity in such cases. In the case of Study 1, our metaanalysis found a heterogeneity value of 4.01 which had an associated p −value of 0.45 suggesting that heterogeneity was limited and the fixed effect analysis undertaken by the authors was appropriate. In the case of Study 3, the heterogeneity value was 8.46 with p = 0.0761. Since heterogeneity tests are not very powerful (see Higgins and Thompson 2002), we suggest that a value less than 0.1 should be accepted as an indication that a random effects analysis might be preferable to a fixed effects analysis.
7.5.2 Metaanalysis Choices
One of the major problems with metaanalysis is that there are many different effect sizes and methods that can be used to aggregate results. The metaanalysis methods used in the primary studies were not always clearly reported, but most studies reported standardized mean effect sizes for individual effect sizes and for the overall mean effect size. Study 8 reported the point biserial correlation coefficient. In addition, Study 13 and 14 used the method of combining pvalues, which is now known to have severe limitations, see Appendix A.4.
Many text books recommend aggregating standardised mean difference effect sizes, see for example, Borenstein et al. (2009) or Lipsey and Wilson (2001), but it depends on obtaining the correct effect size variance.^{13} This is fairly straightforward if the individual experiments have medium to large sample sizes, but is more complicated if experiments have very small sample size (Hedges and Olkin 1985), and also depends on the specific experimental design, as can be seen in Madeyski and Kitchenham (2018b) and Morris and DeShon (2002).
It would seem to be easier to convert to r_{pb} for aggregation, as we did in our reproducibility assessment. This procedure avoids the need to obtain estimates of the standardized effect size variance. However, it must be recognised that the problem with the standardised effect size and its variance is that, for small sample sizes, the estimate of the variance which is used to calculate the standardised effect size is likely to be inaccurate. Converting to r_{pb} does not overcome this problem since the point biserial correlation is itself calculated as the ratio of two variance estimates.
In practice, as proposed by Santos et al. (2018), an option for homogeneous families (i.e., families that use the same material and the same output measures) would be to analyze the data from the family as one large experiment, using what they call an Independent Participant Data (IDP) stratified method. This analyzes the data from all the individual experiments together as a single data set, and uses the individual experiment identifier as a blocking factor. This would lead to an estimate of overall mean difference and the residual variance based on all the participants. An estimate of the effect size of the family and its standard error would then be more likely to be reliable.
It is also possible that using nonparametric effect sizes would avoid some of the problems inherent in using parametric effect sizes. However, although it is possible to calculate a number of different nonparametric effect sizes, it is not clear which nonparametric effect sizes should be used, nor how to aggregate results from individual experiments into an overall effect size.
7.6 Limitations
AB/BA crossover design
Group  Period 1  Period 2 

A  Technique 1  Technique 2 
Materials 1  Materials 2  
B  Technique 2  Technique 1 
Materials 1  Materials 2 
Duplicated AB/BA crossover design
Group  Period 1  Period 2 

A  Technique 1  Technique 2 
Materials 1  Materials 2  
B  Technique 2  Technique 1 
Materials 1  Materials 2  
C  Technique 1  Technique 2 
Materials 2  Materials 1  
D  Technique 2  Technique 1 
Materials 2  Materials 1 
We claimed to have found a reproducibility problem if the difference between the effect size estimates reported by the authors and the ones we calculated was greater than 0.05. The choice of 0.05 was based on convenience and can be criticized. In practice, the value we chose seemed to work reasonably well as a means of drawing our attention to possible reproducibility problems. However, it incorrectly highlighted some differences that we believed to be due rounding errors, and we also observed two examples of accidental correctness. So, it was critical to review the actual metaanalysis process reported by the authors, as well as the difference between reported and calculated effect sizes to confirm whether there were validity or reproducibility problems.
8 Conclusions and Contributions
Our systematic review identified 13 primary studies from five high quality journals. In seven cases we identified validity or reproducibility problems. Even in cases where we reproduced the average standardized effect size, in four cases, we are not sure as to the accuracy of statistical tests of significance and p −values. We conclude that metaanalysis is not well understood by software engineering researchers.
Our systematic review process reported in Section 3 has ensured that the problems we identified were found in papers published in high quality software engineering journals with stringent peer review processes. It is, therefore, important to report such problems and provide guidelines and procedures to help to avoid them in the future. Answers to RQ1 and RQ2 reported in Section 4, provide traceability to the individual primary studies and contextual details of the experimental methods used to analyse each experiment. This confirms that we have not been biased in our selection of primary studies. Answers to RQ3 and RQ4 provide traceability to the individual metaanalysis problems and confirmation that most problems are found in more than one primary study, so are more than just oneoff mistakes.
 1.
To provide evidence that metaanalysis methods are not wellunderstood by software engineering researchers (see Sections 5 and 6)
 2.
 3.
To provide guidelines for reporting and undertaking metaanalysis that could help to avoid metaanalysis errors (see Appendix A.6).
 4.
To describe the model underlying the 4group crossover experimental design (see Appendix A.1.3), since although the design is popular in software engineering research, it has not previously been specified in any detail.
 5.
To provide a worked example of analyzing and metaanalyzing results from a family of studies that used a 4group crossover design (see Appendix A.5).

Whether there is an optimum (or minimum viable) number of experiments in a family.

Whether the conversion to r_{pb} is preferably to aggregating g_{IG} directly, given the small sample sizes and numbers of independent experiments in SE families.

Whether we should use nonparametric methods for analysis and metaanalysis.
Finally, whenever possible, we would ask researchers to make their data sets publicly available. Such data sets allow reviewers to check the validity of results before publication, provide a valuable resource for novice researchers, and allow data to be reanalyzed if new analysis methods become available.
Footnotes
 1.
 2.
Santos et al. (2018) reported that only 5 of the 39 papers they identified reported their raw data, so any reproducibility study we performed would need to be based primarily on summary statistics.
 3.
This criterion was amended after the protocol was completed because we identified the need to exclude correlation studies during data collection.
 4.
In our protocol we used the term correlation coefficient, however after beginning data extraction, we realized we needed to define the correlation coefficient effect size more correctly as the pointbiserial correlation.
 5.
Since we were restricting ourselves to five international journals (see Section 3.2), we did not need to formally exclude extended abstracts or nonEnglish papers.
 6.
Although Santos et al. (2018) found 15 families that used metaanalysis, three of the papers they found were excluded on the basis of our inclusion criteria and we found one study they did not.
 7.
That is, the same conceptual task e.g., fault detection, or a comprehension questionnaire, but with different materials (e.g., a different specification, design or code listing).
 8.
 9.
The tool is intended to help researchers aggregate experiments that use different design methods, and the between groups design is the most commonly used design method.
 10.
In the case of aggregated probability value there is no a priori value of P, so we can only make a subjective assessment of whether the calculated and reported values are close.
 11.
For example, for the metric Y.1 (Interest), the pretest score for group B was 0.81 and the posttest was 0.79 but the difference score was reported as − 0.03 (not − 0.02). This seems a minor issue, but since the difference score for group A was .1 and the pooled within group standard deviation of the difference score was 0.09. A difference score of − 0.03 for group B leads to an effect size of 1.444 while a difference score of − 0.02 leads to an effect size of 1.333 which after adjusting for the small sample size (n_{A} = 5 and n_{B} = 4) become 1.279 and 1.181 respectively.
 12.
Some papers reported forest plots with confidence bounds visible but it is not possible to extract accurate assessments of the values from such diagrams.
 13.
The standardised effect size variance is not the same as the sample variance. It is based on a formula including the number of participants in each different experimental condition and the standardised effect size itself.
 14.
Some researchers recommend using the standard deviation of the control group or the population standard deviation if it is known. See Lakens (2013) for a discussion of various different options for the choice of the standard deviation.
 15.
Please be aware that Hedges and Olkin called the unadjusted estimate of the standardized mean effect size g and the adjusted estimate d. Therefore, it is best to confirm explicitly whether or not the standardized mean effect size has been adjusted for small samples, rather that rely on using a possibly ambiguous label.
 16.
The following R code calculates J for numerical value x: sqrt(2/x)⋆gamma(x/2)/gamma((x1)/2), and is easy to convert to a function.
 17.
In our reproducibility calculations we always used J(df).
 18.
This variance is not the same as the variance used to standardize the mean difference.
 19.
 20.
Heterogeneity is measured as an additional variance τ, which is added to the initial variance. The inverse of the revised variance is then used as the weight in the random effects metaanalysis. If τ is small, the effect on the metaanalysis results will be small.
 21.
Researchers wanting access to the data should contact Prof. Abrahão.
Notes
Acknowledgements
We thank Silvia Abrahão, Carmine Gravino, Emilio Insfran, Guiseppe Scaniello and Genoveffa Tortora for for giving us access to their raw data. We are particularly grateful to Prof. Abrahão for providing us with details of her statistical analysis. We thank the reviewers for their helpful comments, particularly pointing out the issue of validity and the problem of aggregating invalid data. Lech Madeyski was partially supported by the Polish Ministry of Science and Higher Education under Wroclaw University of Science and Technology Grant 0401/0201/18.
References
 Abrahão S, Insfrán E, Carsí JA, Genero M (2011) Evaluating requirements modeling methods based on user perceptions: a family of experiments. Inf Sci 181 (16):3356–3378Google Scholar
 Abrahao S, Gravino C, Insfran Pelozo E, Scanniello G, Tortora G (2013) Assessing the effectiveness of sequence diagrams in the comprehension of functional requirements: results from a family of five experiments. IEEE Trans Softw Eng 39 (3):327–342Google Scholar
 Basili VR, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473Google Scholar
 BioStat (2006) Comprehensive metaanalysis (cma) v2.0. https://www.metaanalysis.com/pages/v2download.php
 Borenstein M, Hedges LV, Higgins JPT, Rothstein HT (2009) Introduction to metaanalysis. Wiley, UKzbMATHGoogle Scholar
 Chow S, Liu J (1992) Design and analysis of bioavailability and bioequivalence studies. TaylorFrancis, New YorkzbMATHGoogle Scholar
 Ciolkowski M (2009) What do we know about perspectivebased reading? An approach for quantitative aggregation in software engineering. In: Proceedings of the 2009 3rd international symposium on empirical software engineering and measurement. ESEM ’09. https://doi.org/10.1109/ESEM.2009.5316026. IEEE Computer Society, Washington, DC, pp 133–144
 CruzLemus JA, Genero M, Manso ME, Morasca S, Piattini M (2009) Assessing the understandability of UML statechart diagrams with composite states—a family of empirical studies. Empir Softw Eng 14(6):685–719Google Scholar
 CruzLemus JA, Genero M, Caivano D, Abrahão S, Insfrán E, Carsí JA (2011) Assessing the influence of stereotypes on the comprehension of UML sequence diagrams: a family of experiments. Inf Softw Technol 53(12):1391–1403Google Scholar
 Cumming G (2012) Understanding the new statistics effect sizes, confidence intervals and metaanalysis. Routledge, UKGoogle Scholar
 Dahl DB, Scott D, Roosen C, Magnusson A, Swinton J (2018) xtable: export tables to LaTeX or HTML. https://CRAN.Rproject.org/package=xtable, r package version 1.83
 Fernandez A, Abrahão S, Insfran E (2013) Empirical validation of a usability inspection method for modeldriven Web development. J Syst Softw 86(1):161–186Google Scholar
 FernándezSáez AM, Genero M, Chaudronand MRV, Caivano D, Ramos I (2015) Are forward design or reverseengineered UML diagrams more helpful for code maintenance?: a family of experiments. Inf Softw Technol 57:644–663Google Scholar
 FernándezSáez AM, Genero M, Caivano D, Chaudron MRV (2016) Does the level of detail of UML diagrams affect the maintainability of source code?: a family of experiments. Empir Softw Eng 21(1):212–259Google Scholar
 Fisher R (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1:1–32Google Scholar
 GonzalezHuerta J, Insfrán E, Abrahão S M, Scanniello G (2015) Validating a modeldriven software architecture evaluation and improvement method: a family of experiments. Inf Softw Technol 57:405–429Google Scholar
 Hadar I, ReinhartzBerger I, Kuflik T, Perini A, Ricca F, Susi A (2013) Comparing the comprehensibility of requirements models expressed in Use Case and Tropos: results from a family of experiments. Inf Softw Technol 55(10):1823–1843Google Scholar
 Hedges LV, Olkin I (1985) Statistical methods for metaanalysis. Academic Press, OrlandozbMATHGoogle Scholar
 Higgins JPT, Thompson SG (2002) Quantifying heterogeneity in a metaanalysis. Stat Med 21(11):1539–1558Google Scholar
 Hunter J, Schmidt F (1990) Methods of metaanalysis: correcting error and bias in research findings. Sage, Newbury ParkGoogle Scholar
 Johnson NL, Welch BL (1940) Applications of the noncentral tdistribution. Biometrika 31(3–4):362–389MathSciNetzbMATHGoogle Scholar
 Jureczko M, Madeyski L (2015) Cross–project defect prediction with respect to code ownership model: an empirical study. eInformatica Softw Eng J 9(1):21–35. https://doi.org/10.5277/eInf150102 Google Scholar
 Kitchenham B, Budgen D, Brereton P (2015) Evidencebased software engineering and systematic reviews. CRC Press, Boca RatonGoogle Scholar
 Kitchenham B, Madeyski L, Budgen D, Keung J, Brereton P, Charters S, Gibbs S, Pohthong A (2017) Robust statistical methods for empirical software engineering. Empir Softw Eng 22(2):579–630Google Scholar
 Kitchenham B, Madeyski L, Curtin F (2018) Corrections to effect size variances for continuous outcomes of crossover clinical trials. Stat Med 37 (2):320–323. https://doi.org/10.1002/sim.7379. http://madeyski.einformatyka.pl/download/KitchenhamMadeyskiCurtinSIM.pdf Google Scholar
 Kitchenham B, Madeyski L, Brereton P (2019a) Problems with statistical practice in humancentric software engineering experiments. In: Proceedings of the 2019 international conference on evaluation and assessment in software engineering (EASE). https://doi.org/10.1145/3319008.3319009, pp 134–143
 Kitchenham B, Madeyski L, Brereton P (2019b) Supplementary materials for the paper “Metaanalysis for families of experiments in software engineering: a systematic review and reproducibility and validity assessment”. http://madeyski.einformatyka.pl/download/KitchenhamMadeyskiBrereton19Supplement.pdf
 Laitenberger O, Emam KE, Harbich TG (2001) An internally replicated quasiexperimental comparison of checklist and perspectivebased reading of code documents. IEEE Trans Softw Eng 27(5):387–418Google Scholar
 Lakens D (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for ttests and anovas. Front Psychol 4(Article 863):1–12Google Scholar
 Lipsey MW, Wilson DB (2001) Practical meteanalysis. Sage Publications Inc., UKGoogle Scholar
 Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? An empirical study. Softw Qual J 23(3):393–422. https://doi.org/10.1007/s1121901492417 Google Scholar
 Madeyski L, Kitchenham B (2017) Would wider adoption of reproducible research be beneficial for empirical software engineering research J Intell Fuzzy Syst 32(2):1509–1521. https://doi.org/10.3233/JIFS169146. http://madeyski.einformatyka.pl/download/MadeyskiKitchenham17JIFS.pdf Google Scholar
 Madeyski L, Kitchenham B (2018a) Effect sizes and their variance for ab/ba crossover design studies. In: Proceedings of the ACM/IEEE 40th international conference on software engineering (May 27–June 3, 2018). ACM, Gothenburg, p 420, DOI https://doi.org/10.1145/3180155.3182556, (to appear in Print)
 Madeyski L, Kitchenham BA (2018b) Effect sizes and their variance for AB/BA crossover design studies. Empir Softw Eng 23(4):1982–2017. https://doi.org/10.1007/s1066401795745 Google Scholar
 Madeyski L, Kitchenham B (2019) Reproducer: reproduce statistical analyses and metaanalyses. http://madeyski.einformatyka.pl/reproducibleresearch/, R package version 0.3.0 (http://CRAN.Rproject.org/package=reproducer)
 Morales JM, Navarro E, SánchezPalma P, Alonso D (2016) A family of experiments to evaluate the understandability of TRiStar and i^{*} for modeling teleoreactive systems. J Syst Softw 114:82–100Google Scholar
 Morris S B (2000) Distribution of the standardized mean change effect size for metaanalysis on repeated measures. Br J Math Stat Psychol 53:17–29Google Scholar
 Morris S B, DeShon R P (2002) Combining effect size estimates in metaanalysis with repeated measures and independentgroups designs. Psychol Methods 7(1):105–125. https://doi.org/10.1037//1082989X.7.1.105 Google Scholar
 Pfahl D, Laitenberger O, Ruhe G, Dorsch J, Krivobokova T (2004) Evaluating the learning effectiveness of using simulations in software project management education: results from a twice replicated experiment. Inf Softw Technol 46(2):127–147Google Scholar
 Rosenthal R (1991) Metaanalytic procedures for social research. Sage, UKGoogle Scholar
 Santos A, Gómez OS, Juristo N (2018) Analyzing families of experiments in SE: a systematic mapping study. CoRR arXiv:1805.09009
 Scanniello G, Gravino C, Genero M, CruzLemus J A, Tortora G (2014) On the impact of UML analysis models on sourcecode comprehensibility and modifiability. ACM Trans Softw Eng Methodol 23(2):13:1–13:26. https://doi.org/10.1145/2491912 Google Scholar
 Senn S (2002) Crossover trials in clinical research, 2nd edn. Wiley, UKGoogle Scholar
 Teruel MA, Navarro E, LópezJaquero V, Montero F, Jaen J, González P (2012) Analyzing the understandability of requirements engineering languages for CSCW systems: a family of experiments. Inf Softw Technol 54(11):1215–1228Google Scholar
 Vegas S, Apa C, Juristo N (2016) Crossover designs in software engineering experiments: benefits and perils. IEEE Trans Softw Eng 42(2):120–135. https://doi.org/10.1109/TSE.2015.2467378 Google Scholar
 Viechtbauer W (2010) Conducting metaanalysis in R with the metafor package. J Stat Softw 36(3):1–48Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.