Introduction

Comparative bioavailability (BA) studies, designed to demonstrate bioequivalence (BE) between two products, are an essential part of the generic approval process (1,2,3,4,5), bridging an innovator’s product from the formulation used in clinical phase III to the to-be-marketed formulation (6), in the case of major variations of an approved product (7), to assess potential food effects (8) or drug-drug interactions (9, 10), and dose-proportionality (6). Such studies often involve multiple groups of subjects. This division is usually necessitated by logistical constraints, such as the limited capacity of available beds or staffing levels at a single site. In many cases, groups are admitted in a staggered manner over the course of a few days but are recruited from the same subject pool. Studies conducted across multiple sites are beyond the scope of this research.

Given the close temporal proximity and the shared subject pool, one would generally not expect a relevant group effect to be introduced. However, deviations in point estimates could indicate a true group-by-treatment interaction, meaning that the treatment effect is not independent of the group. This observation could also be a result of chance. The validity of naïvely pooling data across these staggered groups can still be questioned.

We assessed the relevance and impact of group-by-treatment interactions through simulations and a meta-study comprising over 240 well-controlled trials.

Methods and Materials

The simulations and evaluation of datasets in the meta-study were performed in R 4.3.1 (11).

Models

The following linear models of loge-transformed pharmacokinetic (PK) responses with all fixed effects were used:

  1. (1)

    group, sequence, treatment, subject(group × sequence), period(group), group × sequence, group × treatment

  2. (2)

    group, sequence, treatment, subject(group × sequence), period(group), group × sequence

  3. (3)

    sequence, subject(sequence), period, treatment

First public information about the use of Model 1 to test for a group-by-treatment interaction for the 2-treatment 2-sequence 2-period crossover design (2 × 2 × 2) by the US Food and Drug Administration (FDA) became available in 1999 (12), where subject(group × sequence) was considered a random effect. It must be mentioned that due to the group × treatment term, the main effect of treatment cannot be interpreted and, hence, must not be used to assess bioequivalence. The FDA suggested testing the group-by-treatment interaction at the 0.1 level (12, 13). If significant, data of groups must not be pooled, and bioequivalence can be demonstrated in one of the groups by Model 3, provided that the group meets the minimum requirements for a complete BE study that might also lead to the paradoxical situation that BE is demonstrated in a small group but fails in larger ones. If not significant, pooled data can be analyzed by Model 2. More details were given by the FDA later (14, 15), but without specifying a level of the test.

Model 2 takes the multi-group nature of the study into account and provides an unbiased estimate of the treatment effect. In the Eurasian Economic Union, Model 2 is mandatory, unless a justification to use Model 3 is stated in the protocol and discussed with the competent authority (16). Health Canada and the FDA recommend mixed-effects models, where subject-related effects are random and all others are fixed (2, 12, 14). Model 3 is the standard model for bioequivalence (e.g., 4, 5) with all effects fixed (analysis of variance, ANOVA).

In Model 2, the residual degrees of freedom (df) is \(\sum {n}_{i}-2-({n}_{G}-1)\), where ni is the number of subjects in sequence i, and nG is the number of groups, and in Model 3 \(df=\sum {n}_{i}-2\). In both models, the back-transformed (1-2α) confidence interval (CI) is calculated as

$${\text{CI}}=100{\text{exp}}\left(\overline{{{\text{log}} }_{e}{x}_{{\text{T}}}}-\overline{{{\text{log}} }_{e}{x}_{{\text{R}}}}\mp {t}_{df,\alpha }\sqrt{m\times{MSE} \sum_{i=1}^{i=s}\frac{1}{{n}_{i}}}\right),$$

where \(\overline{{{\text{log}} }_{e}{x}_{{\text{T}}}}\) and \(\overline{{{\text{log}} }_{e}{x}_{{\text{R}}}}\) are the means of the loge-transformed responses of the test and reference treatments, t is the t-value for df degrees of freedom at level α (commonly 0.05), m is the design constant (e.g., 1/2 in a 2 × 2 × 2 crossover design, 3/8 in a two-sequence three-period full replicate design, 1/4 in a two-sequence four-period full replicate design, 1/6 in a three-sequence three-period partial replicate design), MSE is the residual mean squares error, s is the number of sequences, and ni is the number of subjects in sequence i.

It must be mentioned that the MSE is generally slightly different in Models 2 and 3, whereas the point estimate (PE) is identical if sequences are balanced and group sizes are identical, but different in the case of imbalanced sequences and unequal group sizes. Due to the fewer degrees of freedom, the CI of Model 2 is consistently wider than that of Model 3.

It should be mentioned that in comparative BA studies, subjects are uniquely coded (17, 18). Thus, sequence and related nested effects — as recommended in all guidelines — lead to over-specified models and can be removed entirely, without affecting the estimated treatment effect and its associated MSE.

Simulation Scenarios

Monte-Carlo simulations were performed based on the fact that the mean μ follows a lognormal distribution and the variance s2 follows a χ2-distribution with n–2 degrees of freedom (19). We simulated 100,000 studies in each scenario using the pseudo-random number generator Mersenne-Twister (20) with a fixed seed of 123456 to support reproducibility and assessed them for the group-by-treatment interaction. In scenarios 1–12, we simulated 2 × 2 × 2 designs with two groups. In scenarios 1–10, we simulated a sample size of 48 subjects to achieve ≥ 90% power for a geometric mean ratio (GMR) = 1 and CVw = 33.5%. This sample size was selected to align closely with the median sample size 47 in the meta-study (see below). In scenarios 11 and 12, we simulated a sample size of 80 subjects to achieve ≥ 80% power for GMR = 0.90. To simulate unequal variances of groups, variance ratios of 0.667 and 1.5 were explored.

The level of the test of the group-by-treatment interaction was set to 0.05 (21). If no true group-by-treatment interaction was simulated, the fraction of studies with p(G × T) ≤ 0.05 represents empirical α, whereas if a true group-by-treatment interaction was simulated, it represents empirical power. The p-values of the group-by-treatment interaction tests are expected to follow a standard uniform distribution with ∈ {0, 1} and were assessed by the Kolmogorov–Smirnov test. Supplementary graphs illustrating the distribution of these p-values for each scenario are included to complement the Kolmogorov–Smirnov test findings, and the R-script to reproduce the simulations is provided in the Online Resource.

Table I presents a summary of simulation scenarios, categorizing them based on multiple parameters such as group sizes (n), whether the data exhibit equal or unequal variances of groups, and the corresponding CV, the GMR for each group involved in the scenarios, and indicating the presence or absence of true group-by-treatment interaction. Below is a detailed breakdown of these scenarios:

  1. (1)

    Two groups of 24 subjects each, equal variances of groups, GMR = 1 in both groups, no group-by-treatment interaction

  2. (2)

    Two groups of 24 subjects each, unequal variances of groups (variance-ratio 0.667), GMR = 1 in both groups, no group-by-treatment interaction

  3. (3)

    Two groups of 24 subjects each, unequal variances of groups (variance-ratio 1.5), GMR = 1 in both groups, no group-by-treatment interaction

  4. (4)

    n1 = 38, n2 = 10,equal variances of groups, GMR = 1 in both groups, no group-by-treatment interaction

  5. (5)

    Two groups of 24 subjects each, equal variances of groups, GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; pooled GMR = 1

  6. (6)

    Two groups of 24 subjects each, unequal variances of groups (variance-ratio 0.667), GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; pooled GMR = 1

  7. (7)

    Two groups of 24 subjects each, unequal variances of groups (variance-ratio 1.5), GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; pooled GMR = 1

  8. (8)

    n1 = 38, n2 = 10, equal variances of groups, GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; weighted GMR = 1

  9. (9)

    n1 = 38, n2 = 10, unequal variances of groups (variance-ratio 0.667), GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; weighted GMR = 1

  10. (10)

    n1 = 38, n2 = 10, unequal variances of groups (variance-ratio 1.5), GMR = 0.95 in the first group, GMR = 1.0526 in the second group, true group-by-treatment interaction; weighted GMR = 1

  11. (11)

    n1 = n2 = 40, equal variances of groups (CVw = 30%), GMR = 0.90 in both groups, no group-by-treatment interaction

  12. (12)

    n1 = 64, n2 = 16, unequal variances of groups (variance-ratio 1.5), GMR = 0.8290 in the first group, GMR = 1.2500 in the second group, true group-by-treatment interaction; weighted GMR = 0.9000

Table I Simulation Scenarios

Meta-study

The meta-study included a total of 328 datasets of AUC and 331 of Cmax from 249 comparative BA studies (BE, food effect, drug-drug interaction, dose-proportionality), 157 analytes; 242 2 × 2 × 2 designs, 33 two-sequence four-period full replicate designs, three partial replicate design, as well as 46 incomplete block designs extracted from six-sequence three-period and four-sequence four-period Williams’ designs. The studies consisted of two to seven groups, with a median sample size of 47 subjects (15–176) and a median interval separating groups of six days (1 to 62 days). It should be noted that the extreme interval of 2 months in one study was due to COVID-19 restrictions. The next largest interval was 18 days. In 76.3% of the studies, the interval was 1 week or less; in 30.8%, it was only 1 or 2 days. There are more datasets than studies because some contain more than one analyte (fixed-dose combinations or parent and metabolite). The datasets were assessed by all models. Since in some of the datasets bioequivalence of Cmax was assessed by reference-scaling or with wider fixed limits, only AUC targeting BE with conventional limits of 80–125% was assessed by a recently proposed method (22), where

  • a “concordant quantitative interaction” was defined as when the treatment effect is overall equivalent as well as in all groups but differs in magnitude,

  • a “concordant qualitative interaction” was defined as when the treatment effect is overall and in at least one group equivalent, in at least one group not equivalent, and the treatment effects in all groups are in the same direction, and

  • a “discordant qualitative interaction” was defined as when the overall treatment effect is equivalent, the treatment effect in some groups is not equivalent, and the treatment effect in some groups can be in opposite directions.

We restricted the method to two groups, because more would result in a multidimensional problem. Of note, a manipulation (i.e., an undocumented interim analysis after the first group and switching Test (T) with reference (R) in the second) would be only possible if groups are separated by a long interval. Such suspected manipulation could be easily detected by plotting T/R-ratios against subject ID. Details of the datasets are given in the Online Resource.

Results

Simulations

Table II presents the result of simulation scenarios, indicating the presence or absence of a true group-by-treatment interaction by empirical α or power (i.e., the fraction of studies with a significant group-by-treatment interaction in Model 1 if no or a true group-by-treatment interaction was simulated), and p-values of the Kolmogorov–Smirnov test.

Table II Results of 100,000 Simulated Studies in each Scenario

To provide a clearer and more synthesized understanding of our simulation results in Table II, we have categorized the key findings regardless of the study design (crossover or parallel) as follows:

  1. (1)

    Simulations without group-by-treatment interaction (Scenarios 1–4, and 11): In these scenarios, where no group-by-treatment interaction was introduced, the proportion of studies detecting a statistically significant interaction was close to the anticipated significance level of approximately 0.05.

  2. (2)

    Crossover design simulations with group-by-treatment interaction (Scenarios 5–10, and 12): When a group-by-treatment interaction was introduced into these simulations, the empirical power increased in relation to the absolute value of the difference between the population means of the two groups.

In our first simulation scenario, used as an illustrative example in Fig. 1, we observed that the interaction was detected in about 4.97% of cases, even without a true group-by-treatment interaction. This detection rate is around/similar to the upper 95% significance limit of the binomial test (0.0511), indicating a low rate of false positives. Additionally, the uniformity of the p-values, validated by the Kolmogorov–Smirnov test (p = 0.756), suggests that their distribution aligns with the expected uniform pattern under the null hypothesis.

Fig. 1
figure 1

p(G × T) = 0.0497, p(unif.) = 0.756 (simulation scenario 1)

Meta-study

In 15 (4.57%) of the AUC datasets and 18 (5.44%) of the Cmax datasets, a significant (p < 0.05) group-by-treatment interaction was detected, which is approximately the level of the test and does not exceed the upper 95% significance limits of the binomial test (0.0731 for n = 328 and 0.0725 for n = 331). See also Figs. 2 and 3, as well as Table III. Neither concordant nor discordant interaction was detected in the eligible AUC datasets (Fig. 4). In the dataset with the largest interval of 62 days separating groups, the PE in the first group was 95.37% and in the second 100.92%. The subjects’ T/R-ratio showed no trend (see the Online Resource).

Fig. 2
figure 2

AUC, p(G × T) = 0.0457, p(unif.) = 0.661 (meta-study, n = 328)

Fig. 3
figure 3

Cmax, p(G × T) = 0.0544, p(unif.) = 0.483 (meta-study, n = 331)

Table III Results of the Meta-study
Fig. 4
figure 4

PEs of AUC, analysis of interaction (22) (meta-study, n = 226 targeting BE by Model 3; center square quantitative, yellow areas concordant qualitative, orange areas discordant qualitative, 95% confidence ellipse in green, unity line in bright green)

Discussion

As demonstrated in the simulations, significant group-by-treatment interactions were detected at approximately the level of the test, although none was simulated. Consequently, these cases are considered false positives. When true group-by-treatment interactions were simulated, in most cases, the test failed to detect them, i.e., showed low empirical power. Only with large sample sizes and extremely different group sizes is a true group-by-treatment interaction correctly detected with sufficient power. Heteroscedasticity did not affect the results, which is not surprising since the pooled data models assume homoscedasticity.

The simulations underscored a crucial consideration in the context of group-by-treatment interaction testing, revealing that the smaller the true group-by-treatment interaction, the more challenging it becomes to detect. This prompts a thoughtful reflection on the definition of what is “small enough to be ignored for practical purposes.” Conversely, the findings emphasize that a substantial group-by-treatment interaction is necessary for the test to be valuable in studies designed to demonstrate bioequivalence. This is corroborated by the empirical power results presented in Table II and Fig. 13 of the Online Resource.

Based on the meta-analysis of well-controlled studies, it appears that significant group-by-treatment interactions are detected merely due to chance and can be considered “statistical artifacts” or false positives. Although only 226 datasets of AUC with two groups were eligible for a recently proposed method (22), neither concordant nor discordant interaction was detected. Testing for a group-by-treatment interaction to detect data manipulation is limited, since there is no evidence that manipulation is linked with clinic groups.

When the datasets of the meta-study were evaluated by Model 2, about 6.4% less than with Model 3 passed the conventional limits for BE of 80.00–125.00%. This difference can be attributed to potential bias in the estimation of the treatment effect introduced by group-related terms (i.e., subject(group × sequence), period(group), group × sequence) and fewer degrees of freedom leading to slightly wider confidence intervals. However, this observation might not only be due to fewer degrees of freedom, but also mainly due to different residual errors and imbalanced sequences together with unequal group sizes. This finding is similar to another meta-study (23), where fewer studies passed with the carryover term in the model than without the term. It is impossible to predict whether the additional group terms by Model 2 can “explain” part of the variability, i.e., its residual MSE may be smaller or larger than that of Model 3.

In light of these results, we consider that Model 1 originally proposed by the FDA (12, 13) as a pre-test should be avoided due to the risk of type I error inflation. Well-known examples where a pre-test inflates the type I error are assessing variance homogenicity (24) and testing for a sequence effect in comparative bioavailability (25, 26). For this reason, our recommendation is to use Model 2 (or 3) instead. This investigation is reminiscent of the discussion of the subject-by-formulation variance component, with a similar result: The estimate for this variance component was positively biased, leading to substantial false-positive tests (27). In analogy, none of the published adaptive sequential methods contains a “poolability criterion” (28,29,30,31,32,33,34). Instead, data are always pooled, regardless of the results of the stages. As recently recommended, the planned model and procedures should be unambiguously stated in the protocol (5, 14, 15). Subgroup results should always be interpreted cautiously (35). In order to increase power, Bayesian shrinkage analysis of subgroups (36) must only be applied if specified a priori and not post hoc (i.e., after detecting a significant group-by-treatment interaction). Data-driven post hoc analysis is also discouraged by the International Council of Harmonisation (5).

It must be mentioned that in frequentist statistics, the outcome of any level α-test is dichotomous: The null hypothesis is either rejected or not rejected, not something that can be represented with a probability. It is a common fallacy to regard the p-value as the probability that the null hypothesis is true — or the alternative hypothesis is false (37, 38). It is well known that the more sophisticated interaction terms have a higher standard error than those of the main effects. Moreover, even more so in this case, since they involve a comparison between subjects instead of the main comparison which is within subjects, with a lower residual variance. On the other hand, the main analysis in an equivalence study is based on a Neyman-Pearson (NP) test, designed with a review of the evidence (either published or not) in favor of {θ1, θ2}, the limits of the “not clinically relevance” margin. That is, the alternative hypothesis H1 is an interval θ1 < μTμR < θ2. Furthermore, the sample size has been determined to obtain the desired power, taking into account the standard error of the estimator of the main comparison. This higher standard error leads to a lower power for the interaction, which, added to the lack of prior support for Δ, explains the results obtained in this study, which summarizes the well-known joke “Enjoy your unexpectedly significant results, … because you will not see them again.”

In order to recapitulate, the standard significance test lacks both power and prior support for H1, leading to (39) — which respects the NP test. Therefore, we must distinguish between NP and significance testing (39), as well as remember the advice about lack of power and prior support for H1 (40).

Conclusion

Testing for a group-by-treatment interaction is neither useful nor appropriate. When a group-by-treatment interaction does not exist in the data, it will incorrectly be detected at the level of the test. Even when a true group-by-treatment interaction exists, it will likely not be detected — except in the case of large sample sizes and extremely different group sizes — because in crossover designs, T vs. R is tested with a greater sample size than the G × T interaction; in the former, all subjects are used, whereas, for G × T, the subjects are split into groups and tested between them. Since the test has low power but will be significant at the α level even in the absence of true group-by-treatment interaction, it is not in any way clear how this test could contribute to regulatory decision-making. This work demonstrates a lack of utility for including a group-by-treatment interaction in the model for assessment of single-site comparative bioavailability studies when the clinical trial study structure is divided into groups for logistical reasons. The authors thus see no particular merit in this test for regulatory submissions anywhere.