Randomised controlled trials provide the best quality evidence in medical research, [1] but they require a large commitment of time and effort, certainly from the investigators and often from participants. As a result, trials can be expensive. For these reasons, investigators may consider evaluating more than one intervention in the same study. For a controlled trial of two interventions, one could consider a parallel three-arm trial, or even a four-arm trial if two distinct control groups are required. An example is a comparison of mailed guidelines with and without an educational outreach visit from community pharmacists to improve prescribing in general practice.[2] If target differences for both interventions are identical, these would require increases in sample size of 50% and 100% respectively compared with a two-arm trial. Correspondingly, the analyses would involve only two thirds or half of the total sample size. Since the power to detect treatment differences depends on the number of participants in the groups being compared rather than the total number in the trial, this can represent a rather inefficient use of resources.

An alternative may be a factorial trial, where participants are allocated to receive neither intervention, one or the other, or both. An example of such a trial is an evaluation of two decision aids for newly diagnosed hypertensive patients – that is, individual decision analysis and an information video plus leaflet.[3] Other examples are a factorial trial of two interventions to improve attendance for breast screening,[4] and a factorial trial of two interventions to improve adherence to antidepressant drugs.[5]

Although their use to date may have been limited,[6] factorial trials have the potential to confer advantages over the standard parallel-groups design. First, they enable efficient simultaneous investigation of two interventions by including all participants in both analyses. Second, it is possible in a factorial trial to consider both the separate effects of each intervention and the benefits of receiving both interventions together. In order to realise these advantages, however, factorial trials require some special considerations, particularly at the design and analysis stages. Although these issues have been discussed previously, [7] factorial trials continue to be often inappropriately analysed and interpreted. The aim of this paper is to explore these issues in the context of an individually randomised 2 × 2 factorial trial, although in principle the methods generalise to trials of more than two interventions.


Design considerations

The prime issue here is the sample size of the trial. The most common procedure is to perform a separate calculation based on target effect sizes for each of the interventions compared with their respective controls (Table 1). The trial sample size is then simply the larger of these, and the trial is said to be powered to detect the main effects of each intervention. However, this sample size is based on the crucial assumption that there is no interaction between the interventions – in other words, that the effect of intervention A does not differ depending on whether participants also receive intervention B. This will by no means always be a reasonable assumption, especially where interventions involve behavioural and/or organisational change.

Table 1 Sample sizes required for 90% power and 1% two-sided alpha: main effects. Intervention A target difference = 0.35 standard deviations (SDs), total sample size = 486 (243 allocated to Intervention A, 243 allocated to the relevant control). Intervention B target difference = 0.3 SDs, total sample size = 664 (332 allocated to Intervention B, 332 allocated to the relevant control). A total sample size of n = 664 participants yields 90% power to detect differences of 0.3 SDs for Intervention B and 97% power to detect differences of 0.35 SDs for Intervention A.

If a trial is to have adequate power to detect an interaction, then the sample size will in general need to be increased. For example, to detect with the same power an interaction of the same magnitude as the main effects, a fourfold increase in sample size is required (Table 2).[8] With no increase in sample size, the interaction would need to be at least twice as large as the main effects to be detected with the same power;[8] this is very unlikely to be the case in practice. Smaller, more plausible interactions would require greatly increased sample sizes. If the interaction is of primary interest then it is essential that the trial is powered to detect a reasonable target interaction effect.

Table 2 Sample sizes required for 90% power and 1% two-sided alpha: interaction

If the primary comparisons are the main effects then the approach in Table 1 is justifiable on grounds of efficiency. At the same time, it should be appreciated that the resultant precision for the interaction may be inadequate to exclude such an effect – that is, the confidence interval for the interaction will be relatively wide. In other words, the sample size will be insufficient to investigate the initial assumption that the interaction is unimportant. Virtually identical arguments apply to interactions for binary outcomes, although if logistic regression is used then the relative sizes of the interaction and main effects in Table 2 relate to the log odds scale.

Analytical considerations

The second consideration is the analytical strategy, which should follow CONSORT guidelines.[9] In particular, the primary analyses should address the principal research questions. Table 3 presents the basic descriptive statistics for the analysis of an example 2 × 2 factorial trial.[3] To evaluate the decision analysis intervention, we compare patients who received both interventions plus those who received decision analysis only with patients who received video and leaflet only plus those who received neither intervention. In general, the correct analysis of such data requires the use of a multivariable regression model, especially if the numbers of subjects in each of the four combinations shown in Table 3 are unequal (in technical terms, if the design is 'unbalanced').

Table 3 Descriptive statistics for the primary outcome (crude mean decisional conflict scores[3]) for the analysis of a 2 × 2 factorial trial

The approach in such models is essentially to obtain an average of the two differences (28–44) and (27–33), weighted according to the sample sizes. Regardless of the technical details, conceptually the primary analysis is a comparison of the margins of the 2 × 2 table. In the regression analyses, the effect of each intervention is adjusted for the other intervention as well as any necessary covariates, such as the outcome measure at baseline and stratification variables. In the context of a randomised trial with a continuous outcome, such adjustments are primarily to improve precision, especially for individually randomised trials. [10, 11]. For binary outcomes, a multivariable (logistic) regression analysis is required in order to obtain correct estimates of the effects and their standard errors.

In focussing on the average effect of each intervention, however, the above analysis assumes that the effect of each intervention is uninfluenced by the presence or absence of the other – that is, there is no interaction between them. Since factorial trials are rarely powered to detect interactions between the interventions, such effects are usually investigated as a secondary analysis. These are readily performed as extensions to the multivariable regression models described above, by simply introducing the appropriate interaction terms. However, the precision of the estimates of interaction is very likely to be too poor for large effects to be ruled out. In particular, a high p-value will most likely reflect low power and so cannot be taken as evidence for no interaction.

A special consideration for binary outcomes is the choice of regression method. Logistic regression is commonly used since, among other advantages, predicted proportions from this model are constrained to be in the allowable range (that is, between zero and one). [12] Logistic regression estimates odds ratios for the interventions and assumes that these effects operate multiplicatively on this scale. [13]

Presentation of results from a factorial trial

Regarding the results obtained at the main trial follow-up, the primary analysis relating to the margins of the 2 × 2 table should give estimates (such as a difference, odds ratio or risk ratio) and 95% confidence intervals comparing those individuals allocated to receive an intervention with those allocated to not receive it. The number of such comparisons will be equal to the number of interventions investigated in the trial. A common misunderstanding is that the outcome measures should be analysed and presented separately for each of the four factorial cells, but to do so would fail to realise the full efficiency and purpose of the factorial design. Even in trials powered for main effects, a test and confidence interval for the interaction should be provided. An indication of the imprecision of the results for the interaction is especially important given the above concerns about the adequacy of the sample size to investigate such effects. Table 4 demonstrates how the results of the primary analyses in our example trial were presented. [3]

Table 4 Presentation of the results of the primary analyses in a 2 × 2 factorial trial[3]

In addition to the primary comparative statistics noted above, it is also advisable to present descriptive statistics for outcome measures at follow-up within each of the factorial 'cells' in the trial (four in the case of a 2 × 2 design). These can either be tabulated or included in the text of the paper along with the regression coefficient and 95% confidence interval for the interaction term. This allows interpretation of the magnitude of any antagonism or synergism between the interventions, and would of course be essential if the interaction was the primary effect of interest. In our example, there was a significant antagonistic interaction, such that there was no added benefit from a second intervention (Tables 3 and 4).

The most appropriate presentation of baseline data depends on the original primary research question and the results obtained. If an interaction is either posited or observed, then descriptive baseline data for the four cells is more helpful; otherwise, the margins are more relevant to the issue of baseline comparability and correspond to the primary analysis. With more than two interventions the marginal approach increasingly becomes the only feasible option.


Factorial designs provide an efficient method of evaluating more than one intervention in the absence of interactions. This raises the question, however, of the degree of certainty one might have in advance that there is no interaction between the interventions. Although Bayesian methods might be helpful here in that they formalise such prior information/beliefs, in practice there will be much uncertainty, and so the issue is rather one of a judgement as to how influential any likely interaction might be in the context of the trial. In particular, if the direction of the effect of intervention A is different for the levels of intervention B (a 'qualitative' interaction) then a factorial trial would be appropriate if this interaction was of key interest, in which case the trial should be powered to detect the interaction. If there is likely to be only a minor difference in magnitude in the effect of intervention A across the levels of intervention B (that is, a small 'quantitative' interaction) then a factorial trial powered to detect the main effects is more appropriate.[3] In any case, the practical question of how to present the intervention effects in the presence of a sizeable interaction remains. If the interaction is qualitative then the main effects will almost certainly be misleading and the cell means and interaction effect together with separate estimates and confidence intervals for the relevant subgroups will be the only option. [14] For quantitative interactions such as in our example, the main effects will over-estimate the effect for some individuals and under-estimate it for others. Whilst the interaction and the cell means must still be presented, the main effects may nonetheless be a reasonable representation of the intervention effects, both separately and combined.

A factorial trial would be unsuitable for interventions that could not be used in conjunction with one another, such as two different minor surgical procedures for a dermatological problem. For interventions such as those in Table 3, though, factorial trials are an especially useful option if the principal interest is in comparing each intervention with its respective control and also in considering if there is any suggestion of an interaction between them. Indeed, an appropriately powered factorial trial is the only design that allows such effects to be investigated. Conversely, factorial designs would be contra-indicated if primary interest was in the direct comparison of the two interventions applied individually – for example, decision analysis alone versus video/leaflet alone.

The decision as to the suitability of the factorial design must therefore take a number of issues into account – in particular, the nature of the interventions, the setting of the study including the participants, the comparisons of interest and the outcome measure. For instance, interactions may be considered to be more likely with behavioural interventions, when as in our example the benefits may be achieved with either intervention and there is relatively little additional benefit from receiving a second intervention.[3] In terms of the outcome measure, a consideration for binary variables beyond the issues covered in this paper is the choice of the statistical model employed – that is, whether the effects of the interventions are presumed to work additively in a linear model for proportions, or multiplicatively as in a linear logistic model. [12] Since the presence or absence of interactions for a binary outcome depends on the statistical model employed, choice of the latter is an important issue.


Difficulties in interpreting the results of factorial trials if an influential interaction is observed should be recognised as the cost of the potential for efficient, simultaneous consideration of two or more interventions. As described in this paper, factorial trials can in principle be designed to have adequate power to detect realistic interactions, but this has major implications for the sample size. On the other hand, unlike parallel groups trials a factorial design does enable investigation of interactions in the analysis, albeit with limited power. Researchers should be aware of such issues when using factorial designs.