Effect sizes and their variance for AB/BA crossover design studies
Abstract
Vegas et al. IEEE Trans Softw Eng 42(2):120:135 (2016) raised concerns about the use of AB/BA crossover designs in empirical software engineering studies. This paper addresses issues related to calculating standardized effect sizes and their variances that were not addressed by the Vegas et al.’s paper. In a repeated measures design such as an AB/BA crossover design each participant uses each method. There are two major implication of this that have not been discussed in the software engineering literature. Firstly, there are potentially two different standardized mean difference effect sizes that can be calculated, depending on whether the mean difference is standardized by the pooled within groups variance or the withinparticipants variance. Secondly, as for any estimated parameters and also for the purposes of undertaking metaanalysis, it is necessary to calculate the variance of the standardized mean difference effect sizes (which is not the same as the variance of the study). We present the model underlying the AB/BA crossover design and provide two examples to demonstrate how to construct the two standardized mean difference effect sizes and their variances, both from standard descriptive statistics and from the outputs of statistical software. Finally, we discuss the implication of these issues for reporting and planning software engineering experiments. In particular we consider how researchers should choose between a crossover design or a between groups design.
Keywords
Empirical software engineering Crossover designs Effect size estimation Effect size variance estimation Metaanalysis1 Introduction
Vegas et al. (2016) reported that many software engineering experiments had used AB/BA crossover designs but the reports of the experiments did not use the correct terminology. In their literature review, they found a total of 82 papers that reported humanparticipant based experiments. 33 of those papers used crossover designs in a total of 68 experiments. Only five of papers employing a crossover design used the term crossover, other papers used terms that were incorrect or not specific enough. Furthermore, 17 papers did not take account of participant variability in their analysis (which is the main rationale for using a repeated measures design such as a crossover).
In their paper, Vegas et al. explain both the terminology used to describe a crossover design, and how to analyze a crossover design correctly. However, except for warning readers to “Beware of effect size” and to only calculate effect sizes when the main factor is the only significant variable, Vegas et al. did not discuss effect sizes for crossover designs. In this paper we explain how to construct effect sizes and their variances for crossover designs. We provide an overview of the crossover design, as well as its advantages and limitations in Section 2.
Effect size is a name given to indicators that measure the strength of the investigated phenomenon, in other words, the magnitude of a treatment effect. Effect sizes are much less affected by sample size than statistical significance. Hence, they are better indicators of practical significance (Madeyski 2010; Urdan 2005; Stout and Ruble 1995). Effect sizes are also essential in metaanalyzes (Kitchenham and Madeyski 2016), which in turn allow us to summarize results of empirical studies, even those with contradictory results, that address the same (or closely related) research questions.
 1.
To present the formulas needed to calculate both nonstandardized mean difference effect sizes and standardized mean difference effect sizes^{1} for AB/BA crossover designs (see Sections 4 and 5).
 2.
To present the formulas needed to estimate the variances of the nonstandardized and standardized effect sizes which in the later cases need to be appropriate for the small to medium sample sizes commonly used in SE crossover designs (see Section 5).
 3.
To explain how to calculate the effect sizes and their variances both from the descriptive statistics that should be reported from crossover experiments and from the raw data (see Section 6).
We discuss why these goals are important and how we address them in Section 3. In Section 7, we discuss the implications of the issues presented in this paper from the viewpoint of researchers trying to decide whether to undertake a crossover study or an independent groups study, particularly in the context of families of experiments. We present our conclusions in Section 8.
It is also worth mentioning that in order to streamline the uptake of the research results of this paper, the reproducer R package (Madeyski 2017) complements this paper, as well as Kitchenham et al. (2017a), Madeyski and Jureczko (2015), Jureczko and Madeyski (2015), with the aim of making our work easier to reproduce by others. We have embedded a number of the R functions (used to make the statistical analyzes and simulations in the paper) in the reproducer R package we developed and made available from CRAN (the official repository of R packages)^{2}. The use of the functions (R commands and outputs) is presented throughout the paper (see Outputs 3 and 4), as well as Appendix A (see Outputs Outputs 5, and 6).
2 Background
A crossover design is a form of repeated measures design. A repeated measures design is one where an individual participant contributes more than a single outcome value.
Expected Outcome for Participants in AB/BA Crossover
Sequence  Participant  Period 1  Period 2 

Group  ID  
S G _{1}  j  y _{1,1,j } = μ _{ j } + τ _{ A }  y _{2,2,j } = μ _{ j } + π + τ _{ B } + λ _{ A } 
(technique A)  (technique B)  
S G _{2}  k  y _{2,1,k } = μ _{ k } + τ _{ B }  y _{1,2,k } = μ _{ k } + τ _{ A } + π + λ _{ B } 
(technique B)  (technique A) 
 1.
The effect of individual differences related to innate ability is removed, (i.e., systematic between participant variation is removed). Thus, the effect of different techniques are assessed in terms of the individual improvement for each participant.
 2.
The removal of between participant variation, allows tests of significance to be based on smaller variances (i.e., significance tests are based on the withinparticipant variation).
 3.
Since the variance used to test the technique difference is reduced, it is possible either to reduce sample sizes and maintain statistical power, or to maintain sample sizes and increase power.
Since sample sizes are often relatively low in Software Engineering (SE) experiments, crossover designs have the potential to be very useful. There are obviously disadvantages as well. The correct analysis of crossover data is more complicated than analysis of data from simple experiments where participants are randomly allocated to two different treatment groups^{4}.
Perhaps more importantly, although the crossover design can cope with time period effects that are consistent across all the participants, crossover designs are vulnerable to interaction effects including period by technique interaction, where the performance of participants is affected by which technique they used first. For example, if a technique involves providing additional materials to participants, it may be easier first to understand the task using less (or simpler) documentation, and then perform the subsequent task with the additional (more complex) information, rather than try to perform the first task with too much information. The crossover design is also vulnerable to participant by technique interaction where individual participants behave differently depending on which technique they used. For example, one technique might improve the performance of less able participants but have no effect on more able participants, which would reduce the repeated measures correlation. If researchers expect either of these conditions to hold, they should avoid using a crossover design.
3 Goals and Methodology
Our first goal is to present the formulas needed to estimate the effect sizes used in crossover designs. This goal is important because researchers in all empirical disciplines are increasingly being encouraged to adopt the use of effect sizes rather than just report the results of t or F tests (see APA 2010; Kampenes et al. 2007; Cumming and Finch 2001; Cumming 2012).
To address this goal, we begin by presenting a detailed discussion of the AB/BA crossover model in Section 4, from which the means and variances needed to calculate both standardized and nonstandardized effect sizes are derived.
In Section 5, we specify two different standardized effect sizes suitable for crossover designs depending on whether researchers are interested only in the personal improvement offered by a software engineering technique, or are more interested in the effect of the technique, and want an effect size comparable to that of a standard independent groups design.
Our second (but equally important) goal is to present formulas needed to calculate the variance of both nonstandardized and standardized effect sizes. This goal is important because without knowing the variance of effect sizes, it is impossible to derive their confidence intervals (CIs). Researchers are advised to report CIs (see APA 2010; Cumming and Finch 2001) because they provide a direct link to null hypothesis testing and support metaanalysis.
To obtain the variances of the two standardized effect sizes, we reviewed the literature and found one paper that proposed formulas for the standardized effect size variances (see Curtin et al. 2002). This paper proposed a formula suitable for small sample sizes and a simpler approximate formula suitable for larger sample sizes. However, we could not verify the proposed formulas. For this reason, we derived our equations from first principles based on the noncentral t distribution (Johnson and Welch 1940), as explained in Section 5.3.1. After discussions with Dr. Curtin, we have, together, agreed revised versions of his equations (see Kitchenham et al. 2017b).
Our third goal is to explain how to calculate the standardized and nonstandardized effect sizes and their variances both from the descriptive statistics that should be reported from crossover experiments and from the raw data. To address this goal, we present two examples in Section 6. This goal is important because researchers need to understand how the outcome from statistical analysis tools map to the parameters of the crossover model. Therefore, we include in Section 6.3 an explanation of how standardized effect sizes and their variances can be calculated from analyzes undertaken using R (R Core Team 2016) with the linear mixed model lme4 package (Bates et al. 2015). In addition, researchers who replicate crossover studies and want to aggregate their results with previous studies may not have access to the raw data from previous studies. Therefore, they may need to estimate effect sizes and their estimates from descriptive data. Furthermore, if appropriate descriptive statistics are reported in studies using a crossover design, even if the studies used an inappropriate analysis, the results could easily be reassessed, if researchers know how the descriptive statistics map to the parameters of the crossover model.
4 The NonStandardized Effect Sizes for Crossover Studies and their Variances
This section explains the AB/BA crossover model and how to calculate the nonstandardized effect sizes and their variances.
4.1 NonStandardized Effect Sizes of the AB/BA Crossover Model

τ _{ A } is the effect of technique A.

τ _{ B } is the effect of technique B.

τ _{ A B } = τ _{ A } − τ _{ B } is the difference between the effect of technique A and technique B. It is the nonstandardized mean technique effect size.

τ _{ B A } is the difference between the effect of technique B and technique A where τ _{ B A } = −τ _{ A B }.

π is the period effect size which is the difference between the outcome of using a technique in the first time period and the second time period.

λ _{ A } is the period by technique interaction due to using technique B after using technique A^{5}.

λ _{ B } is the period by technique interaction due to using technique A after technique B.

λ _{ A B } = λ _{ A } − λ _{ B } = −λ _{ B A } is the mean period by technique interaction effect size.

μ _{ i } is the average outcome for participant i.

The group of participants that use technique A first is called sequence group S G _{1}, the group of participants that use technique B first are called sequence group S G _{2}.
Expected Differences and Sums for Participants in an AB/BA Crossover
Sequence  Participant  Crossover  Period  Participant 

Group  Difference  Difference  Total  
S G _{1}  j  τ _{ A B } − π − λ _{ A }  π + λ _{ A } − τ _{ A B }  2μ _{ j } + π + τ _{ A } + τ _{ B } + λ _{ A } 
S G _{2}  k  τ _{ A B } + π + λ _{ B }  τ _{ A B } + π + λ _{ B }  2μ _{ k } + π + τ _{ A } + τ _{ B } + λ _{ B } 
Expected value of groups means for the crossover design
Sequence  Mean crossover  Mean period  Mean participant 

Group  Difference  Difference  Total 
S G _{1}  \(\hat {\tau }_{AB}  \hat {\pi }  \hat {\lambda }_{A}\)  \(\hat {\pi } + \hat {\lambda }_{A} \hat {\tau }_{AB} \)  \(2\hat {\mu } + \hat {\tau }_{AB}+ \hat {\pi } + \hat {\lambda }_{A}\) 
S G _{2}  \(\hat {\tau }_{AB}+\hat {\pi }+ \hat {\lambda }_{B}\)  \( \hat {\tau }_{AB} + \hat {\pi } + \hat {\lambda }_{B}\)  \(2\hat {\mu } + \hat {\tau }_{AB} + \hat {\pi } + \hat {\lambda }_{B}\) 
Thus, we assume that any effect caused by undertaking one technique followed by another is fully modeled by the period effect. We consider this issue further in Section 4.2.2.
Similar equations can be used to calculate the technique effect and the period effect using the mean period differences.
4.2 NonStandardized Effect Size Variances and ttests
In this section we explain how to calculate the variance of the nonstandardized effect sizes, and how these statistics relate to the ttest of the nonstandardized effect size. The relationship between effect sizes and tvariables is also important for estimating the references of standardized effect sizes (see Section 5.3.1). We also discuss the problems introduced both by tests of the period by technique interaction effect, and by nonnormally distributed data.
4.2.1 The technique effect size variance

β _{ s,i } which is the effect due to participant i in sequence group s where s = S G _{1} or s = S G _{2}.

ζ _{ i,s,t } which is the within participant error.
The expected value of β _{ s,i } is zero and the variance of β _{ s,i } is \({\sigma ^{2}_{b}}\). The expected value of ζ _{ i,s,t } is zero and the variance of ζ _{ i,s,t } is σ ^{2} − w. The simplest model assumes that β _{ s,i } and ζ _{ i,s,t } are independent, so their covariance is zero, although Senn points out that other models are possible. The simplest model also assumes that all ζ _{ i,s,t } are independent of each other.
It is important to note that \(s^{2}_{IG}\) should never be used as the basis for the standard error in a ttest because the repeated measures violate the assumption that all the individual values are independent.
In a simple independent groups study we are unable to separate the two components of \(\sigma ^{2}_{IG}\). In contrast, with a repeated measures design such as an A/B crossover we are able to estimate the separate components of variance. However, in order to estimate the variance components we need to consider the variance of the crossover difference scores, \(\sigma ^{2}_{diff}\).
Thus, the nonstandardized technique effect size for a crossover design is obtained from (18), while its variance is obtained from (31).
4.2.2 The period by technique interaction effect
If we find a statistically significant period by technique interaction, it might seem tempting to use (16) to calculate the nonstandardized effect size by removing the estimate of \(0.5\hat {\lambda }_{AB}\) from the mean difference. This appears to be mathematically sound, but it is not statistically sound. The reason is that the variance of \( 0.5\hat {\lambda }_{AB}\) is \( 0.25s^{2}_{sum}\approx s^{2}_{IG}\). If the crossover design is used in order to reduce the variance for statistical tests, adjusting the estimate of τ by half the estimate of λ _{ A B } reintroduces the between participant variance into any statistical tests of the revised estimate. This negates any possible benefit of a crossover design compared with a standard between groups design.
The practical implication of these considerations is that a crossover design should not be used if a significant period by technique interaction is anticipated. Furthermore, if a period by technique interaction is not expected, there is no point testing for one^{11}. Thus, we do not include the period by technique interaction term (which corresponds to the sequence order) in our data analyzes. However, as Vegas et al. point out, the possibility of an interaction remains a threat to the validity of the experiment. We return to the issue of what can be done to address the interaction problem in Section 7.
4.2.3 Handling nonstable variances and nonnormal data
Equation (24) for \(\sigma ^{2}_{IG}\) assumes that the withinsubjects (participants) variance and the betweensubjects (participants) variance are independent and not affected by the different techniques. This is not necessarily the case. For example, a new technique might improve the capability of less able participants thus reducing the difference among individuals. Alternatively, a new technique might be more difficult to apply than other techniques and might improve the performance of the most able participants and reduce the performance of the less able participants making the difference between individuals greater.

The least useful option is to base the estimate of \(s^{2}_{IG}\) solely on the n _{1} participants in the first period control condition. This is not really useful because for crossover designs n _{1} is likely to be relatively small, so the estimate is likely to be inaccurate.

Estimate \(s^{2}_{IG}\) and \(s^{2}_{diff}\) allowing the within cells variance to be different. However, the implications of this approach, such as the relationship between \(s^{2}_{diff}\), \(s^{2}_{IG}\) and \(\hat {\rho }\) are not clear.

Use a robust, rankedbased analysis. This is the most straightforward option and also protects against nonnormal data, such as skewed data and/or data with outliers.
A robust analysis compares the period differences. This is because the expected value of the period differences are π − τ for sequence group 1 and π + τ for sequence group 2. Thus any significant difference between the period differences in the two sequence groups is due to a significant technique effect.
Thus, using a rankbased analysis, if the rank sum of the period differences in sequence 1 is significantly different from the rank sum of the period differences in sequence 2, you can reject the hypothesis that the technique effect size is zero (see Senn 2002, Section 4.3.9). However, if you use the WilcoxonMannWhitney test it is essential to use the exact test, which is the default in R. The probability of superiority^{12} can be used as a nonparametric effect size constructed from the MannWhitney U statistic (see Wilcox 2012; Kitchenham et al. 2017a).
5 Standardized Effect Sizes for Crossover Studies and their Variances
In this section we discuss standardized effect sizes that can be calculated for crossover designs and their variances.
5.1 Formulas for the Standardized Effect Sizes
For purposes of metaanalysis, it is important that standardized effect sizes from crossover designs are comparable with effect sizes obtained from other designs.
5.2 Choosing the Appropriate Standardized Effect Size
In the past, researchers have proposed standardizing repeated measures studies using the independent groups variance (see Becker 1988; Dunlap et al. 1996; Borenstein et al. 2009). The reason for this is to make the results of repeated measures studies comparable with the results of independent group studies. This is particularly important in the context of metaanalysis.
Repeated measures designs are intended to remove the potentially large variation between participants and test the difference between techniques based on the potentially much smaller withinsubject (participant) variation. However, this means that independent groups experiments standardized by \(s^{2}_{IG}\) would have a smaller effect size than the repeated measures experiments standardized by \({s^{2}_{w}}\) even if the nonstandardized mean differences were the same.
Morris and DeShon (2002), however, make the point that the choice of effect size should depend on the goal of the metaanalysis. If the goal is to assess the likely improvement in individual performance then δ _{ R M } is appropriate. If the goal is to assess the difference between techniques then δ _{ I G } is likely to be more appropriate. Nonetheless, whichever goal a metaanalyst has, it should be clearly stated and the method for calculating the appropriate variance explained. The need for both effect sizes is also supported by Lakens (2013).
It should be noted that none of the above sources discuss effect sizes in the context of AB/BA crossover designs. Dunlap et al. (1996), Becker (1988) and Lakens (2013) were concerned solely with withinsubjects beforeafter experiments. Morris and DeShon (2002) discuss effect sizes of independent groups and two repeated measures designs: the beforeafter design and the independent groups beforeafter design, which measures all participants using the same technique prior to splitting the participants into two groups and performing an independent groups experiment.
5.3 Standardized Effect Size Variances
For standardized effect sizes to be useful we need to calculate their variances. However, with the exception of Kitchenham and Madeyski (2016), we are not aware of any software engineering studies that have identified the need to estimate the variance of standardized mean different effect sizes. In this section, we provide formulas to estimate the variance of δ _{ I G } and δ _{ R M } for small, moderate and large sample sizes.
5.3.1 The basic principle
It is important to appreciate that the value of the constant A defined in (46) depends on study design type. For crossover designs \(A=\sqrt {\frac {2n_{1}n_{2}}{(n_{1}+n_{2})}}\). However, for repeated measures beforeafter designs \(A=\sqrt {n}\), while for independent groups designs \(A=\sqrt {\frac {n_{1}n_{2}}{(n_{1}+n_{2})}}\). Thus, the construction of mean difference effect sizes and their variances depends on the specific design type.
5.3.2 Formulas to estimate the medium sample size variance of standardized effect sizes
5.3.3 The approximate variance for large sample sizes
6 Calculating Effect Sizes and their Variances
In this section, we present two small examples illustrating how to calculate crossover study effect sizes and their variances. One example is based on real software engineering data to illustrate the complexity of software engineering data. The other is based on simulated data to illustrate how the AB/BA crossover model is intended to work given that all the basic assumptions underlying the model are true.
It is useful to know how to calculate effect sizes (both nonstandardized and standardized) and their variances both from descriptive data, as well as by using statistical packages to analyze the raw data. If authors report appropriate descriptive statistics, then other researchers (including reviewers and metaanalysts) can, without access to the raw data, construct the model parameters, effect sizes and their variances from the results reported. Therefore, we identify the descriptive statistics that are necessary to calculate the various crossover model parameters and effect sizes. In addition, we show how to analyze raw crossover data using the R language and the lme4 package, and explain how to extract the crossover model parameters from the outcomes of the R package.
We, also, demonstrate two graphical methods of presenting the results of crossover studies. We suggest that they provide a more accurate graphical representation than box plots of the technique outcomes. In particular, they provide visual indications both of the outcomes of the experiment, and of the extent to which data conforms with the crossover model.
This section, thus, provides some advice to authors about how to report the outcomes of their studies that should make their studies more useful to their readers. It also provides two worked examples that novice researchers can try out to help them better understand the crossover design.
6.1 Example 1: Scaniello’s Data
Scanniello crossover data (labeled as EUBASwideSelected in Output 1)
ID  Comp_Level.AM  Comp_Level.SC  Comp_Diff  Comp_Sum  SequenceGroup 

P3  0.82  0.77  0.05  1.59  SG1 
P4  0.60  0.70  − 0.10  1.30  SG2 
P7  0.80  0.93  − 0.13  1.73  SG1 
P8  0.93  0.90  0.03  1.83  SG2 
P11  0.70  0.83  − 0.13  1.53  SG1 
P12  0.90  0.96  − 0.06  1.86  SG2 
P15  0.67  0.83  − 0.16  1.50  SG1 
P16  0.77  0.66  0.11  1.43  SG2 
P19  0.80  0.70  0.10  1.50  SG1 
P20  1.00  0.85  0.15  1.85  SG2 
P23  0.76  0.57  0.19  1.33  SG1 
P24  0.87  0.66  0.21  1.53  SG2 
The study investigated the impact of UML analysis models on source code comprehensibility (measured with the Comp_Level metric) and modifiability. The two techniques being compared are AM (analysis model plus source code) and SC (source code only). The techniques were trialed on two software systems S1 (a system to sell and manage CDs/DVDs in a music shop) and S2 (a software system to book and buy theater tickets. One feature from each system from each system was used as the object of study. The data relates to two groups in the dataset from the EUBAS experiment which itself was one of a family of four experiments. We chose that experiment rather than one of the others, because when we analyzed the EUBAS data, we found a nonzero repeated measures correlation which is an important prerequisite for a crossover design to be of any value in decreasing the variance. It was also the experiment with the largest number of participants.
The full EUBAS experiment was a fourgroup crossover with Group 1 and Group 2 comprising one AB/BA crossover and Group 3 and Group 4 comprising another. The difference was that Group 1 and Group 2 used S1 and then S2, while Group 3 and Group 4 used S2 and then S1 (see Scanniello et al. 2014, Table II). We used the data from participants in the Group 3 and Group 4, as an example of an AB/BA crossover, since we found an anomaly in the reported data for Group 2^{15}. We selected only a subset of Scanniello’s data because we wanted to explain the AB/BA crossover rather than discuss the more complicated fourgroup crossover which can be analyzed as a pair of AB/BA crossovers. Thus, the small balanced dataset provides an example of how the twogroup crossover experimental results can be reported and the relevant analysis statistics are calculated.
Output 1
Panel (a) of Fig. 1 shows a box plot of the crossover difference for each sequence group. The box plots show that the median value of the differences for individuals is below zero for individuals in sequence group 1 and above zero for individuals in sequence group 2. However, a large part of each box spans zero suggesting that there is no significant technique effect. Panel (b) shows the outcomes for each individual for technique. Seven of the individuals performed better using AM compared with five who performed better using SC. Again this gives no indication of any major difference between the techniques. An important issue to note is that participants that used AM before SC did not show the expected association between participant outcomes, i.e., participants who performed well using AM did not seem to perform well using SC and viceversa. In contrast, participants who performed well using SC first generally performed well when subsequently using AM. The lack of a correlation between individual participant outcomes in S G _{1} group means that overall the correlation between participants may be quite low. The graphical display in panel (b) is useful for small data sets since the results of box plots based on very few observations may be misleading, but for larger data sets, the box plots in panel (a) are usually more helpful.
Descriptive statistics for the scanniello data
Sequence  Statistic  AM  SC  CODiff  Participant 

Group  Total  
S G _{1}  Mean  0.7583  0.7717  − 0.0133  1.53 
Variance  0.0037  0.0155  0.0214  0.0171  
Num Obs  6  6  6  6  
S G _{2}  Mean  0.845  0.7883  0.0567  1.6333 
Variance  0.0201  0.0173  0.0148  0.06  
Num Obs  6  6  6  6 
Statistics calculated from scaniello’s data
Statistic  Equation  Value 

Number  
\(\hat {\tau }\)  18  0.0217 
\(\hat {\pi }\)  20  0.035 
\(\hat {\lambda }_{AB}\)  23  − 0.1033 
s I G2  25  0.0142 
s d i f f2  26  0.0181 
s w2  27  0.009 
\(\hat {\rho }\)  28  0.3613 
\( var(\hat {\tau })\)  31  0.001508 
\(se_{\hat {\tau }}\)  32  0.03884 
t  33  0.5581 
Table 6 reports that the t value for testing the significance of \(\hat {\tau }\) is 0.5581 which, at an alpha level of 0.05, is not significantly different from zero. This outcome is consistent with the inferences we drew from panel (a) in Fig. 1. The estimate of ρ is 0.3613. Thus, the correlation between repeated measures in the EUBAS data set is rather low compared with that reported by Dunlap et al. (1996) for testretest measurements. The low correlation was indicated by the lack of correlation between individual values for participants in S G _{1} visible in panel (b) of Fig. 1. The correlation between an individual’s performance indicates the extent to which the crossover design has decreased the variance compared with a standard independent groups design. In extreme cases, if \(s^2_{IG} \approx s^2_w\), then ρ is assumed to be equal to zero and the crossover design has not reduced the variance at all.
6.2 Simulated Data Example
It is often helpful to use simulated data to understand the behavior of statistical tests and graphical representations. It allows us to check the accuracy of model parameter estimates against known values. It can also be used to check how sample size affects the accuracy of estimates or how violations of model assumptions affect analysis results. Examples of the use of simulation in software engineering include (Shepperd and Kadoda 2001) who used simulation to compare prediction techniques, Dieste et al. (2011) who investigated the use of the Q heterogeneity estimator for metaanalysis, and Foss et al. (2003) who investigated the properties of the MMRE statistic.
In this section we present a simulation study to illustrate the relationships between the graphical representations and descriptive statistics, in ideal circumstances (i.e., equal numbers of participants in each sequence group, stable variances, a large between participant correlation, no significant period by treatment interaction, and normal distributions). This dataset will also be used to allow the comparison of model parameter estimates with the known values of those parameters.

There are 15 participants in each sequence group.

The average outcome across different participants is μ = 50. We note that many of the papers used effectiveness measures based on a scale from 0 to 1 based on the proportion of questions answered correctly (see, for example, Scanniello et al. 2014; Abrahao et al. 2013). We chose a value of 50 which is equivalent to 50% of correct answers rather than a value between 0 and 1, so the effects would be clearer in the analysis.

Users of technique 1 achieve an average of 10 units more than users of technique 2, that is τ = 10. For a metric scale based on the number of correct answers to 10 questions, this would be equivalent to increasing the number of correct answers by one.

Users achieve an average of 5 units more in period 2 than in period 1, that is, π = 5.

There is no period by technique interaction effect built into the simulation (i.e. λ _{ A B } = 0).

The variance among participants using a specific technique in a specific time period is σ ^{2} = 25. This means the variance is unaffected by period or technique.

The correlation between outcomes for an individual participant is ρ = 0.75. We chose the value 0.75 because (Dunlap et al. 1996) reported that such values are to be expected for testretest reliabilities of psychometrically sound values. In the software engineering literature, Laitenberger et al. (2001) reported values of r varying from 0.78 to − 0.02^{16} for the correlation between outcomes from teams. However, it would be reasonable to expect correlations based on individuals to be greater than those based on teams.
We simulated data from two different bivariate normal distributions, corresponding to the two different sequence groups. The first set of simulate data corresponding to sequence group S G _{1} came from a bivariate distribution with means: μ _{1} = 60 corresponding to the simulated participants using technique 1 in time period 1 and μ _{2} = 55 corresponding to the simulated participants using technique 2 in time period 2. The covariance matrix was symmetric, with the variance of the simulated participants using a specific technique in a specific time period being set to 25 and the covariance being set to 25 ∗ (1 − ρ) = 18.75. Observations from sequence group S G _{2} were simulated from a bivariate normal distribution with the same variancecovariance matrix and means μ _{1} = 50 corresponding to the simulated participants using technique 2 in time period 1, and μ _{2} = 65 corresponding to the simulated participants using technique 1 in time period 2. After allowing for the common mean effect of 50, the simulated data come from a population where the effect of technique 1 is 10 units and the effect of technique 2 is zero units.
Output 2
The first thing to notice is that even with 15 data points in each sequence group, the box plots deviate from what we expect from a normal distribution (i.e., the median for each sequence group is not in the center of the box). Looking at the box plot, we see the difference between the medians of the boxes is approximately (13 − 4) = 9 units. In general, since the average of the difference values for participants in sequence S G _{1} are estimates of \(\hat {\tau } \hat {\pi }\) and the average of the difference values for participants in sequence S G _{2} are estimates of \(\hat {\tau } + \hat {\pi }\), the difference between the medians will be approximately twice the period effect, which for our simulation was 5. Also the sum of the medians (13 + 4) = 17 will be approximately equal to twice the technique effect, which for our simulation was 10.
Looking at the raw data for each individual, we see that the simulated participants in sequence group S2, that used technique 2 first and subsequently used technique 1, show a strong difference between their outcomes. This is because the impact of using technique 1 is increased by the period effect. The simulated participants in sequence group S1 that used technique 1 first however, showed less of a clear advantage when they used technique 1 compared with their results using technique 2. The individual outcomes for simulated participants in period 2 was greater than the outcome for period 1, for 13 of the 15 simulated participants but the differences were quite small. This is because in the second time period, individual results were increased because of the positive impact of the period effect^{17}.
Descriptive statistics for the simulated data crossover design
Sequence  Statistic  Technique  Technique  CODiff  Participant 

Group  1  2  Total  
S G _{1}  Mean  61.3772  57.8119  3.5653  119.1891 
Variance  12.4561  11.7601  7.7316  40.7007  
Num Obs  15  15  15  15  
S G _{2}  Mean  65.2768  51.1486  14.1282  116.4254 
Variance  12.2649  26.4595  14.9214  62.5274  
Num Obs  15  15  15  15 
These statistics can be used to calculate the estimates of the sample parameters. For example, in this case \(\hat {\tau }=\frac {3.5653 + 14.1282}{2}= 8.84675\), which is a reasonable estimate of τ = 10 and confirms that the effect of the relatively large period effect has been removed.
Parameter and variance estimates for the simulated data
Parameter  Sample  Theoretical  Percent relative 

Estimate  Value  Error  
\(\hat {\tau }\)  8.8467  10  11.5325 
\(\hat {\pi }\)  5.2814  5  − 5.6284 
\(s^{2}_{IG}\)  15.7351  25  37.0595 
\(s^{2}_{diff}\)  11.3265  12.5  9.388 
\({s^{2}_{w}}\)  5.6632  6.25  9.388 
\(\hat {\rho }\)  0.6401  0.75  14.6548 
\( var(\hat {\tau })\)  0.3775  0.4167  9.4072 
\(se_{\hat {\tau }}\)  0.6145  0.6455  3.1046 
t  14.3978  15.4919  7.5992 
6.3 Using R to Calculate NonStandardized Effect Sizes and their Variances
Vegas et al. (2016) proposed the use of linear mixed models to analyze crossover data. They did not recommend a specific statistical package, but in this study we used the R linear modeling package lme4 which can correctly analyze crossover designs with unequal sample sizes. This package assumes that the data is in the long format, i.e., there are two rows for each participant, identifying the participant period, treatment, and results.^{19}
6.3.1 Analyzing Scaniello’s data
Example of the long data format using the scanniello data
ID  TimePeriod  Technique  Comp_Level 

P3  R1  AM  0.82 
P3  R2  SC  0.77 
P4  R1  SC  0.70 
P4  R2  AM  0.60 
The results of using the lme4 package to analyze the Scanniello data are shown in Output 3.^{21} ID is treated as a random effects term, whereas Time Period and Technique are treated as fixed effects terms. Unlike Vegas et al. who include a Sequence effect (which (23) confirms is a means of testing the period by treatment interaction) as well as a Time Period and Technique effect, we adopted Senn’s approach, as discussed in Section 4.2.2, and did not include a parameter related to the period by treatment interaction term (λ _{ A B }).
Output 3
The effect size related to TechniqueSC is − 0.0217 and the effect size related to TimePeriodR2 is 0.035. The nonstandardized effect size variance is the square of technique effect standard error (i.e., 0.388^{2} = 0.1505). The value for the period effect is the same as we found in our manual analysis, but the value of the technique effect is minus the value we found in our analysis. This is because the package calculated \( \hat {\tau }_{BA}\) rather than \( \hat {\tau }_{AB}\). We treated AM as the experimental effect and associated it with the sequence S G _{1} as defined in Table 1. However, the lme4::lmer function in R treats the labels given to different categorical variables as arbitrary, and uses the category corresponding to the larger alphanumeric label as the one for which it will calculate the effect size^{22}. Since SC is greater than AM alphabetically, it calculates the effect size for S C − A M and labels the effect size T e c h n i q u e S C.
The variance term associated with the Residual is \({s^{2}_{w}}\), and the variance term associated with ID is \({s^{2}_{b}}\) giving \(s^{2}_{IG}={s^{2}_{w}}+{s^{2}_{b}}= 0.014012\) compared with our manual estimate of 0.01416. Also, \(\hat {\rho }=\frac {{s^{2}_{b}}}{s^{2}_{IG}}= 0.3546\) compared with our manual estimate of 0.36135. Minor differences between estimates of the variances and the correlation are to be expected when comparing a mixed effects analysis based on maximum likelihood estimation with a manual analysis.
6.3.2 Analyzing the simulated data
The results of the analysis of the simulated data set are shown in Output 4 and can be compared with the values shown in Table 8. The estimates of the period effect sizes are the same, but, again the estimate of the technique effect is negative. This occurs for the same reason that the sign of the technique effect changed for Scaniello’s data. From the variances reported in Output 4, \(s^{2}_{IG}= 10.12 + 5.66 = 15.78\) which compares with the manual estimate of 15.7351 and \(\hat {\rho }= 10.12/15.78 = 0.6413\) which compares with the manual estimate 0.6401.
Output 4
6.4 Calculating Standardized Effect Sizes and their Variances
Example standardized effect sizes
Effect  Scaniello data  Simulation  Theoretical  Percent 

Size  Estimate  data  Value  Relative 
SCAM  T2T1  Error  
d _{ R M }  − 0.2278  − 3.7175  − 4  7.0625 
d _{ I G }  − 0.183  − 2.2268  − 2  − 11.3384 
Example standardized effect size adjustment
Effect  Adjustment  Scaniello data  Adjustment  Simulation 

Size  Scaniello Data  Estimate  Sim Data  Estimate 
c(10)  Revised  c(28)  Revised  
g _{ R M }  0.9231  − 0.2103  0.973  − 3.617 
g _{ I G }  0.9231  − 0.169  0.973  − 2.1666 
Standardized effect size variances and their approximations
Statistic  Equation number  Scanniello data  Simulation data 

v a r(d _{ R M })  49  0.2117  0.3412 
v a r(d _{ I G })  51  0.1366  0.1224 
v a r(g _{ R M })  50  0.1804  0.3231 
v a r(g _{ I G })  52  0.1164  0.1159 
v a r(d _{ R M })_{ A p p r o x }  53  0.1699  0.3318 
P R E v a r(d _{ R M })_{ A p p r o x }  57  19.7557%  2.763% 
v a r(d _{ I G })_{ A p p r o x }  55  0.1096  0.1191 
P R E v a r(d _{ I G })_{ A p p r o x }  57  19.7557%  2.763% 
v a r(g _{ R M })_{ A p p r o x }  54  0.1689  0.3003 
P R E v a r(g _{ R M })_{ A p p r o x }  57  6.3835%  7.0461% 
v a r(g _{ I G })_{ A p p r o x }  56  0.109  0.1077 
P R E v a r(g _{ I G })_{ A p p r o x }  57  6.3835%  7.0461% 
The percentage relative accuracy is the same for v a r(d _{ R M })_{ a p p r o x } and v a r(d _{ I G })_{ a p p r o x }. This occurs because the small sample and medium sample variance of d _{ R M } and d _{ I G } are simply a function of the variance of t multiplied by a constant which cancels out when the relative error is calculated.This is the same for v a r(g _{ R M })_{ a p p r o x } and v a r(g _{ I G })_{ a p p r o x }.
7 Discussion
This paper is intended to followup some additional issues arising from Vegas et al. (2016)’s recent paper identifying problems with the analysis of crossover experiments. Vegas et al. discussed four repeated measures designs other than the simple AB/BA crossover. However, all those designs are extensions of the AB/BA crossover, including either additional sequences and/or additional periods and/or repeating the same techniques, so in order to understand these extensions it is important to understand the basic crossover design. We provide a discussion of the model underlying the AB/BA crossover design, so that issues connected with the construction of effect sizes and effect size variances can be properly understood.
7.1 Impact of Incorrect Analysis on Effect Sizes and their Variances
Vegas et al. (2016) reported that many researchers using crossover designs did not account for the repeated measures in their analysis. For an AB/BA crossover, analyzing the data, without including a factor relating to individual participants effects, would lead to an overestimate the degrees of freedom available for statistical tests by using d f = 2(n _{1} + n _{2} − 2) instead of d f = (n _{1} + n _{2} − 2). In this section, we consider, hypothetically, what the impact on the effect sizes and their variances would be if the subset of the Scaniello data and the simulated data analyzed in Section 6 were analyzed ignoring the repeated measures and period effects.
From the viewpoint of constructing effect sizes, the variances used to standardize the technique effect would be based on the pooled within technique group data. However, in the presence of a significant period effect, this would be a biased estimate of the \(\sigma ^{2}_{IG}\) because the period effect would systematically inflate the variance. This is the inverse of removing a significant blocking effect in an analysis of variance in order to significantly reduce the residual error term. If period is a significant blocking effect, failing to remove its effect from the variance leave an inflated variance. In contrast ignoring the repeated measures might deflate the variance because, if there is a strong correlation between the repeated measures there will be less variability among the observations in each technique group than if the the observations in each group were completely independent.
The value of the technique effect will be correctly estimated by the difference between the mean of the values in each technique group. It is only likely to be biased if the number of observations in each sequence group is unequal (i.e., n _{1}≠n _{2}). Thus, the effect size calculated by using the treatment effect divided by the pooled within group standard deviation will lead to a slightly biased estimate of δ _{ I G }.
To convert to Hedges’g the estimate of δ _{ I G } is multiplied by c(d f). If the analysis has ignored the repeated measures, df will be 2n _{1} + 2n _{2} − 2 rather than n _{1} + n _{2} − 2, and c(d f) will be slightly closer to one than it should be.
Effect of incorrect analyzes
Statistic  Scaniello’s data  Simulated data 

\(s^{2}_{IG}\)  0.0142  15.7351 
\(s^{2}_{IGbiased}\)  0.01393  22.9002 
d _{ I G }  − 0.0183  − 2.2268 
d _{ I G b i a s e d }  − 0.01835  − 1.849 
c(d f)  0.9231  0.973 
c(d f W r o n g)  0.9655  0.9870 
g _{ I G }  − 0.169  − 2.1666 
g _{ I G b i a s e d }  − 0.177  − 1.8247 
v a r(g _{ I G })  0.1164  0.1159 
v a r(g _{ I G B i a s e d })  0.1709  0.09720 
If researchers do not analyze their crossover data as a repeated measures study, they are likely to estimate the variance of their biased estimate of g _{ I G } as if the study was an independent groups study. In Table 13, we compare the correct estimate of v a r(g _{ I G }) with v a r(g _{ I G b i a s e d }) based on the formula for the variance of an adjusted effect size estimate of an independent groups study (see Hedges and Olkin 1985). For the Scaniello data, a b s(g _{ I G b i a s e d }) is greater than a b s(g _{ I G }), and the estimate of v a r(g _{ I G B i a s e d }) is greater than v a r(g _{ I G }). For the simulated data, a b s(g _{ I G b i a s e d }) is less than a b s(g _{ I G }) and v a r(g _{ I G B i a s e d }) is less than v a r(g _{ I G }). This happens because the formula for v a r(g _{ I G }) includes the term \(g_{IG}^{2}\). So, the larger the effect size, the larger the effect size variance.
Overall, we can say that if δ _{ R M } ≈ 0 and ρ ≈ 0, analyzing a crossover study incorrectly is unlikely to lead to an incorrect assessment of the significance of the technique effect. Furthermore, if the effect size is very large, we are likely to find that the effect is statistically significant. That is, for very small effects and very large effects, the incorrect analysis will lead to accidentally correct assessments of significance. However, for small to medium effects it is quite possible that real effects will be considered chance effects, or chance effects considered significant. In addition, in all cases where the nonstandardized effect size, or ρ, or the period effect are nonzero, any estimates of the standardized effect sizes and their variances will be unreliable.
7.2 Standardized Effect Sizes and their Variances
Our presentation of the crossover model raises several issues that have not been fully discussed in the software engineering literature. In particular, we point out that for crossover designs, there are two different standardized effect sizes that can be calculated. Furthermore, each standarized effect size has a different formula for its variance. We also point out that standardized effect sizes and their variances are different for different design types. These issues have implications for metaanalysis in software engineering, where as far as we are aware, only (Madeyski 2010) has explicitly discussed the fact that experimental design type impacts the calculation of standardized effect sizes.
The results of our study have implications for the descriptive data from crossover designs. Our examples in Section 6 show what sample statistics need to be reported to allow effect sizes and their variances to be easily calculated from descriptive statistics. Specifically, researchers should report the mean, sample size and standard deviation (or variance) for all four technique and period groups, as well as either the crossover difference mean and standard deviation for each sequence^{23}. We also suggest graphical representations of crossover data that allow readers to easily visualize the results of the study.
7.3 Implications for Planning Experiments
The crossover model has limitations, and in particular, we have not identified any method to properly address the risk of a significant period by technique interaction biasing any analysis of crossover data. The specific effect of the bias is uncertain because the direction of the period by technique effect can be positive or negative. Assuming t _{ A B } is positive, (16) confirms that if λ _{ A B } is positive, the estimate of the nonstandardized effect size will be decreased, if it is negative, the estimate of the nonstandardized effect size will be increased.
In the case of software engineering techniques, it is difficult to provide convincing a priori arguments that techniques do not interact. Indeed, the subset of Scaniello’s software engineering data that we show in Fig. 1b seems to suggest an interaction term is present since the results of participants who used AM first did not seem to be correlated while the results of participants who use SC first did seem to be correlated.
Another concern is that unless the repeated measures correlation, ρ, is relatively large, the reduction in the variance used for statistical testing will be relatively small. Equation (30) confirms that the reduction in variance is \( 100 \times (\sigma ^{2}_{IG} {\sigma ^{2}_{w}})/\sigma ^{2}_{IG}=\rho \). Thus, for Scanniello’s data, the percentage reduction in the variance given the repeated measure correlation of 0.3613 is approximately 36%. In contrast, for the simulated data, the percentage reduction in the variance is approximately 64%. Thus, unless we know in advance the likely value of the repeated measure correlation, we may radically under or overestimate the impact of the crossover design and could adopt an inappropriate sample size.
This suggests that before we could rely on an AB/BA crossover design to investigate some new topic, we would need to undertake an experiment in order to investigate both the nature of the period by technique interaction and the repeated measures correlation. We might envisage an investigatory crossover experiment aimed at providing such information, where the information from the first period could be used to test the difference between techniques and estimate effect sizes, using a between groups design, and the information from the second period used to investigate the interaction term and the correlation parameter. The problem is that to provide reliable information concerning the interaction, an experiment would have to have a sample size as large as a between groups experiment.
Another option is to consider an alternative to a crossover design that allows the impacts of skill levels to be removed. This can done using what Morris and DeShon (2002) refer to as an independentgroups pretestposttest design ^{24}. In such a design all subjects do the same task using technique A and the same materials M1, then the participants are split into two groups and each group learns a new technique (i.e., techniques B and C) to perform the experimental task. In practice, one of the new techniques might simply be extra coaching for technique A, but it is likely that the design would be fairer if A was a control method and B and C were different competing methods. The difference scores can be used to estimate the difference between technique B and technique C with the effects of individual differences removed. The disadvantage of this design is that it assumes the time effect is equal for both groups.
Vegas et al. (2016) mentioned four other possible repeated measures designs, but, unlike the independentgroups pretestposttest design, those designs have not been discussed in the statistical literature. In our opinion, before such designs should be considered for adaption, the full model underlying the design needs to be articulated, as we did for the simple crossover in Section 4. This should include defining how to calculate appropriate effect sizes and their variances, as well as specifying the theoretical assumptions and practical limitations of the design.
Our best advice is to avoid overcomplex designs that are not fully understood and always aim for the largest sample size. If large sample sizes are not possible, consider planning a distributed experiment as proposed by Budgen et al. (2013). In a distributed experiment, all related experiments must use the same protocol and results are aggregated as a single nested experiment. Kitchenham et al. (2017a) report the analysis of the data that was collected from this distributed experiment.
7.4 Implications for MetaAnalysis
The results in this paper make it clear that it is possible to aggregate experiments that used independent groups designs with experiments that used crossover designs. The crossover designs can be aggregated using d _{ I G }, since this is comparable with the usual standardized effect size for independent groups experiments. The experiments that used an independent groups design, should use the usual standardized mean difference. It is, however, important to use the correct effect size variance, which is based on the variance of the related tvariable. The appropriate formula for the standardized difference of independent group studies and its variance can be found in Hedges and Olkin (1985). We note that the g _{ R M } values from three studies reported by Laitenberger et al. (2001) were used without adjustment in a metaanalysis that involved crossover studies, independent groups studies and repeated measures before/after studies (see Ciolkowski 2009). For example, one of Laitenberger’s studies reported g _{ R M } = 1.46^{25} but with \(\hat {\rho }= 0.77\), the value of g _{ I G } = 0.70. The results reported in this paper will, in the future, allow metaanalysts to select the most appropriate effect size for crossover studies, beforeafter studies and independent groups studies.
7.5 NonNormality and Unstable Variances
We discuss the issue of nonnormality briefly in Section 4.2.3, but a more detailed investigation is needed to determine the most appropriate nonparametric effect sizes and the variance of nonparametric effect sizes. Also the issue of metaanalysis of nonparametric effect sizes needs to be investigated, not only when all effect sizes are nonparametric but also when there is a mixture of nonparametric and parametric effect sizes.
8 Conclusions
This paper provides a discussion of standardized effect size calculations and their variances for crossover designs. This is becoming important because as Vegas et al. (2016) point out many software engineering researchers are employing crossover designs and analyzing them incorrectly. Furthermore, crossover designs are often used in families of experiments, where researchers attempt to aggregate their results using metaanalysis.

To provide equations for nonstandardized and standardized effect sizes. We explain the need for two different types of standardized effect size, one for the repeated measures design and one that would be equivalent to an independent groups design.

To provide formulas for both the small sample size effect size variance and the medium sample size approximation to the effect size variance, for both types of standardized effect size.

To explain how the different effect sizes can be obtained either from standard descriptive statistics or from information provided by the linear mixed model package lme4 in R.

Previous research has suggested that ρ is greater than zero and preferably greater than 0.25.

There is either strong theoretical argument, or empirical evidence from a wellpowered study, that the period by technique interaction is negligible.
Having reproducible research in Empirical Software Engineering in mind we would be happy (after acceptance of the paper and obtaining a permission from the Editor) to make available the source version of the paper with embedded R code (in addition to the reproducer R package already available from CRAN) along with the article in the PDF format.
Footnotes
 1.
For simplicity, we shall refer to these simply as the standardized effect sizes and will not continually repeat the terms mean difference.
 2.
Our package should not be confused with the knitr package we used to embed R code chunks in the paper.
 3.
For readability, we sometimes omit the term AB/BA when referring to the crossover design, but any reference to a crossover design or model in this paper, refers to an AB/BA crossover, which is based on two techniques and two time periods.
 4.
This is referred to as a between groups design or an independent groups design. We prefer the term independent groups in this paper to contrast with repeated measures designs
 5.
In medical experiments, the period by technique interaction term is often referred to as carryover. This is because crossover designs are often used for testing drugs and the effect of the first drug taken may interact with the second drug in an adverse way. Medical experiments therefore leave an appropriate washout period to allow the effect of the first drug to be eliminated from participants before they are given a second drug. We use the term period by technique interaction because carryover and a washout period are not really appropriate concepts for SE experiments. In fact, in the context of training, it might be argued that we want to encourage ‘carryover’ of acquired skills and minimize their ‘washout’.
 6.
By expected outcome, we mean the outcome based on the model excluding any error term. We explain error terms and variances in Section 4.2.
 7.
 8.
In other disciplines, these are also referred to as pretestposttest designs.
 9.
The variance of \(\hat {\tau }_{AB}\) is exactly the same as the variance of \(\hat {\tau }_{BA}\), so for variances and standard deviations we refer to \(\hat {\tau }\) without any additional subscript.
 10.
Power is the probability of rejecting the null hypothesis when the null hypothesis is false.
 11.
See also the discussion in Senn (2002), Section 3.1.4 that presents the arguments against a twostage analysis, where analysts first check for a significant period by treatment interaction. Then, if there is one, they analyze only the data from the first period, and if there is not they perform a standard crossover analysis.
 12.
 13.
It should be noted that Hedges and Olkin (1985) refer to the small sample size adjusted estimate as d and the unadjusted estimate as g.
 14.
Since the variance s ^{2} of a variable x multiplied by a constant b is v a r(b × x) = b ^{2} s ^{2}
 15.
The data for participant 2, who is labeled as being in Group 2, are inconsistent. That is, for participant 2, the labels identifying the system and the time period are the same as for participants in Group 1.
 16.
According to Laitenberger et al. (2001) the results from one team had a large impact on this correlation coefficient. When removing this observation the correlation coefficient changes from − 0.02 to 0.47.
 17.
Of course, it is also possible for a period effect to be negative.
 18.
We omit an estimate of the period by technique interaction term λ _{ A B } because we omitted any such term from our simulation model.
 19.
The data in Table 4 is in the wide format where there is one entry of each participant and the outcomes for each treatment, the sequence order, and the participant identifier are recorded.
 20.
The data we used for the analysis in this section is exactly the same as the data we used in Section 6.1
 21.
The term ” + (1 — ID)” in the formula identifies the factor ID as a random effects term.
 22.
This is the same for all R ANOVAlike functions.
 23.
It is also acceptable to report the the value of \(\hat {\rho }\) instead of the crossover difference statistics
 24.
Morris and DeShon also report the formulas for the effect sizes and effect size variances for this design.
 25.
This was referred to as d in Laitenberger et al. (2001).
Notes
Acknowledgments
We thank Prof. Scanniello and his coauthors (Scanniello et al. 2014) for sharing their data set. We also thank reviewers for their help in improving our manuscript.
References
 APA (2010) Publication manual of the American Psychological Association, 6th edn. American Psychological Association, WashingtonGoogle Scholar
 Abrahao S, Gravino C, Insfran Pelozo E, Scanniello G, Tortora G (2013) Assessing the effectiveness of sequence diagrams in the comprehension of functional requirements: results from a family of five experiments. IEEE Trans Softw Eng 39 (3):327–342. https://doi.org/https://doi.org/10.1109/TSE.2012.27 CrossRefGoogle Scholar
 Arcuri A, Briand L (2014) A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw Test Verification Reliab 24 (3):219–250. https://doi.org/10.1002/stvr.1486 CrossRefGoogle Scholar
 Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixedeffects models using lme4. J Stat Softw 67(1):1–48. https://doi.org/10.18637/jss.v067.i01 CrossRefGoogle Scholar
 Becker BJ (1988) Synthesizing standardized meanchange measures. Br J Math Stat Psychol 41:257–278CrossRefzbMATHGoogle Scholar
 Borenstein M, Hedges LV, Higgins JPT, Rothstein HT (2009) Introduction to metaanalysis. Wiley, NYCrossRefzbMATHGoogle Scholar
 Budgen D, Kitchenham B, Charters S, Gibbs S, Pothong A, Keung J, Brereton P (2013) Lessons from conducting a distributed quasiexperiment. In: Proceedings A C M/I E E E International Symposium on Empirical Software Engineering and MeasurementGoogle Scholar
 Ciolkowski M (1999) Evaluating the effectiveness of different inspection techniques on informal requirements documents. PhD thesis, University of Kaiserslautern, KaiserslauternGoogle Scholar
 Ciolkowski M (2009) What do we know about perspectivebased reading? an approach for quantitative aggregation in software engineering. In: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE Computer Society, Washington, ESEM ’09. https://doi.org/10.1109/ESEM.2009.5316026, pp 133–144
 Cumming G, Finch S (2001) A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions. Educ Psychol Meas 61(4):532–574MathSciNetCrossRefGoogle Scholar
 Cumming G (2012) Understanding the new statistics. Effect Sizes, Confidence Intervals and MetaAnalysis. Routledge Taylor & Francis Group, New YorkGoogle Scholar
 Curtin F, Altman DG, Elbourne D (2002) Metaanalysis combining parallel and crossover clinical trials. I: Continuous outcomes. Stat Med 21:2132–2144. https://doi.org/10.1002/sim.1205 Google Scholar
 Dieste O, Fernández E, GarciaMartinez R, Juristo N (2011) The risk of using the Q heterogeneity estimator for software engineering experiments. In: Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement, IEEE Computer Society, ESEM ’11. https://doi.org/10.1109/ESEM.2011.15, pp 68–76
 Dunlap WP, Cortina JM, Vaslow JB, Burke MJ (1996) Metaanalysis of Experiments with Matched Groups or Repeated Measures Designs. Psychol Methods 1 (2):170–177CrossRefGoogle Scholar
 Foss T, Stensrud E, Myrtveit I, Kitchenham B (2003) A simulation study of the model evaluation criterion mmre. IEEE Trans Softw Eng 29(11):985–995CrossRefGoogle Scholar
 Freeman P (1989) The performance of the twostage analysis of twotreatment, twoperiod crossover trials. Stat Med 8:1421–1432CrossRefGoogle Scholar
 Hedges LV, Olkin I (1985) Statistical methods for metaanalysis. Academic Press Orlando, FloridazbMATHGoogle Scholar
 Johnson NL, Welch BL (1940) Applications of the noncentral tdistribution. Biometrika 31(34):362–389MathSciNetCrossRefzbMATHGoogle Scholar
 Jureczko M, Madeyski L (2015) Cross–project defect prediction with respect to code ownership model: an empirical study. eInformatica Softw Eng J 9(1):21–35. https://doi.org/10.5277/eInf150102 Google Scholar
 Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) Systematic review: A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(1112):1073–1086. https://doi.org/10.1016/j.infsof.2007.02.015 CrossRefGoogle Scholar
 Kitchenham BA, Madeyski L (2016) Metaanalysis. In: Kitchenham B A, Budgen D, Brereton P (eds) EvidenceBased Software Engineering and Systematic Reviews. CRC Press, chap 11, pp 133–154Google Scholar
 Kitchenham B, Madeyski L, Budgen D, Keung J, Brereton P, Charters S, Gibbs S, Pohthong A (2017a) Robust statistical methods for empirical software engineering. Empir Softw Eng 22(2):579–630. https://doi.org/10.1007/s1066401694375
 Kitchenham B, Madeyski L, Curtin F (2017b) Corrections to effect size variances for continuous outcomes of crossover clinical trials. Statistics in Medicine https://doi.org/10.1002/sim.7379, http://madeyski.einformatyka.pl/download/KitchenhamMadeyskiCurtinSIM.pdf, (accepted)
 Laitenberger O, Emam KE, Harbich TG (2001) An internally replicated quasiexperimental comparison of check list and perspectivebased reading of code documents. https://doi.org/10.1109/32.922713, vol 27, pp 387–421
 Lakens D (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for ttests and ANOVAs. Front Psychol 4(Article 863):1–12. https://doi.org/10.3389/fpsyg.2013.00863 Google Scholar
 Madeyski L (2010) Testdriven development: an empirical evaluation of agile practice. Springer, Heidelberg. http://www.springer.com/9783642042874, Foreword by Prof. Claes WohlinCrossRefGoogle Scholar
 Madeyski L, Orzeszyna W, Torkar R, Józala M (2014) Overcoming the equivalent mutant problem: a systematic literature review and a comparative experiment of second order mutation. IEEE Trans Softw Eng 40(1):23–42. https://doi.org/10.1109/TSE.2013.44 CrossRefGoogle Scholar
 Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? an empirical study. Softw Qual J 23(3):393–422. https://doi.org/10.1007/s1121901492417 CrossRefGoogle Scholar
 Madeyski L (2017) reproducer: Reproduce Statistical Analyses and MetaAnalyses. http://madeyski.einformatyka.pl/reproducibleresearch/, R package version 0.1.9 http://CRAN.Rproject.org/package=reproducer
 Morris SB (2000) Distribution of the standardized mean change effect size for metaanalysis on repeated measures. Br J Math Stat Psychol 53:17–29CrossRefGoogle Scholar
 Morris SB, DeShon RP (2002) Combining effect size estimates in metaanalysis with repeated measures and independentgroups designs. Psychol Methods 7(1):105–125. https://doi.org/10.1037//1082989X.7.1.105 CrossRefGoogle Scholar
 R Core Team (2016) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.Rproject.org/ Google Scholar
 Scanniello G, Gravino C, Genero M, CruzLemus JA, Tortora G (2014) On the impact of UML analysis models on sourcecode comprehensibility and modifiability. ACM Trans Softw Eng Methodol 23(2):13:1–13:26. https://doi.org/10.1145/2491912 CrossRefGoogle Scholar
 Senn S (2002) Crossover trials in clinical research, 2nd edn. Wiley, NYCrossRefGoogle Scholar
 Shepperd M, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRefGoogle Scholar
 Stout DE, Ruble TL (1995) Assessing the practical significance of empirical results in accounting education research: the use of effect size information. J Account Educ 13(3):281–298CrossRefGoogle Scholar
 Urdan TC (2005) Statistics in plain english, 2nd edn. Routledge, UKzbMATHGoogle Scholar
 Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132. https://doi.org/10.3102/1076998602500.2101 Google Scholar
 Vegas S, Apa C, Juristo N (2016) Crossover designs in software engineering experiments: benefits and perils. IEEE Trans Softw Eng 42(2):120–135. https://doi.org/10.1109/TSE.2015.2467378 CrossRefGoogle Scholar
 Wilcox RR (2012) Introduction to robust estimation and hypothesis testing, 3rd edn. Elsevier Inc., AmsterdamzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.