Between-Batch Bioequivalence (BBE): a Statistical Test to Evaluate In Vitro Bioequivalence Considering the Between-Batch Variability

Bioequivalence testing is an essential step during the development of generic drugs. Regulatory agencies have drafted recommendations and guidelines to frame this step but without finding any consensus. Different methodologies are applied depending on the geographical region. For instance, in the EU, EMA recommends using average bioequivalence test (ABE), while in the USA, FDA recommends using population bioequivalence (PBE) test. Both methods present some limitations (e.g., when batch variability is non-negligible) making it difficult to conclude to equivalence without subsequently increasing the sample size. This article proposes an alternative method to evaluate bioequivalence: between-batch bioequivalence (BBE). It is based on the comparison between the mean difference (Reference − Test) and the Reference between-batch variability. After presenting the theoretical concepts, BBE relevance is evaluated through simulation and real case (nasal spray) studies. Simulation results showed high performance of the method based on false positive and false negative rate estimations (type I and type II errors respectively). Especially, BBE has shown significantly greater true positive rates than ABE and PBE when the Reference residual standard deviation is higher than 15%, depending on the between-batch variability and the number of batches. Finally, real case applications revealed that BBE is more efficient than ABE and PBE to demonstrate equivalence, in some well-known situations where the between-batch variability is not negligible. These results suggest that BBE could be considered as an alternative to the state-of-the-art methods allowing costless development. Graphical abstract Electronic supplementary material The online version of this article (10.1208/s12248-020-00486-5) contains supplementary material, which is available to authorized users.


INTRODUCTION
In the pharmaceutical industry, generics are becoming more and more important, mainly driven by countries' regulations for giving patients easier access to drugs. From an industrial point of view, the final goal is to have a generic product that complies with all the regulatory requirements and is safe for the patient, i.e., has the same final therapeutic effect than the brand-name product. One way to prove the equivalence, without long and expensive clinical trials, is to conduct in vitro bioequivalence evaluation between the Innovator (Reference product) and the proposed generic (Test product). The demonstration of equivalence could be conducted through statistical methods (1-6). In vitro bioequivalence testing is not considered and evaluated in the same way by all instances taking part in the process (7,8). Indeed, in the USA, FDA recommends the use of population bioequivalence (PBE) (9,10), whereas Europe recommends using average bioequivalence (ABE) (11,12).
On one side, ABE (12) is based on the two one-sided t test (TOST) developed by Schuirmann in 1987 (13). The method consists of comparing the difference between the Reference and Test means (arithmetic or geometric) to a preset equivalence limit θ ABE , by computing the 90% confidence interval of the mean difference. A log transformation may be applied prior to the ABE application depending on the data distribution (e.g., with AUC). In opposition, PBE (2,(14)(15)(16)(17)(18)(19) systematically applies a log transformation to the data and scales the difference between the two geometric means according to the variability on the Reference product. Moreover, PBE induces an asymmetrical Electronic supplementary material The online version of this article (https://doi.org/10.1208/s12248-020-00486-5) contains supplementary material, which is available to authorized users. formula through the computation of the variance difference σ 2 T −σ 2 R . Therefore, PBE is more restrictive than ABE if the Reference product variability is low. Conversely, PBE is less restrictive than ABE if the Reference variability is high, even more so if the Test product variability is lower than that of the Reference product (11).
These statistical methods are different, and consequently provide different conclusions (1). Consequently, being bioequivalent using the different existing statistical methods on one device could be impossible. Recently, the European Federation for Pharmaceutical Sciences initiated work on a harmonization process to evaluate bioequivalence across the world with the aim of obtaining a robust and common approach (1). In addition, a recent report highlighted the difficulties on the statistic relating to in vitro or in vivo bioequivalence testing (2). This report highlights on the necessity to develop "more consistent, better aligned, science-based approaches" across countries (20).
ABE has a bioequivalence limit of 15%. The guidance recommends the calculation of 90% confidence intervals for the observed in vitro differences (2,21,22). Contrary to the FDA (14), EMA does not account for the parameter specificities and, particularly, the variability. This aspect, also called the one-size-fits-all criterion, may be a major concern for the bioequivalence test (23). Indeed, the confidence interval inflation, which depends on the variability and heteroscedasticity, may lead to reject the bioequivalence even when both products appear instinctively equivalent (5).
In the last decade, PBE has been challenged by several studies (1, 2,9,21), especially in the context of in vitro bioequivalence testing. The main limitation is the asymmetry of the method, promoting situations with higher variability on the Reference product than on the Test product (9,21). In addition, the bioequivalence limit is fixed by the guidance on what may be questionable (21,24). Finally, as highlighted in Morgan et al. (2018), the IPAC-RS PBE working group recommends further investigation into the appropriateness of a log transformation for in vitro data in equivalence assessments, (9,21). Therefore, the asymmetrical formula of PBE may lead to accept the bioequivalence, even when both products appear instinctively not equivalent. This may increase the type I error, which reflects the probability to incorrectly accept equivalence. This situation appears especially when the Reference variability and heteroscedasticity are important (11).
In order to prevent the risks of erroneously accepting equivalence in cases of high Reference variability, it is recommended to consider more samples (4,5,25,26). This approach also allows to improve the power test (i.e., the true positive rate TPR) on PBE and ABE calculations. However, even if this solution could be acceptable for in vitro studies, it could be ethically questionable for in vivo studies in order to limit exposures to patients.
While both ABE and PBE tests consider the device total variability, the between-batch variability, which is part of the last, is not considered individually in the mathematical definitions proposed by the FDA and EMA guidelines. Recent works recommended to consider the between-batch variability in the bioequivalence tests (2,9,21,27,28). Indeed, both methods do not consider the between-batch variability. On one side, ABE only considers within-product variability. On the other side, the FDA guidance on Budesonide (29) recommends to decompose the variance as a composite of a super-batch variability (i.e., after pooling all batches per product) and the within-individual variability to study the life-stages (begin, middle, and end-ofuse) equivalence. Burmeister et al. (27) illustrated the inability of the state-of-art methods to prove the equivalence between two batches of the same Reference product, which are by definition equivalent. Furthermore, Burmeister et al. (28) showed an increased probability, around + 25%, to incorrectly conclude on equivalence (i.e., the type I error or false positive rate, FPR) in the presence of between-batch variations. In addition, Morgan et al. (9) confirmed the increase of at least 15% of the probability to incorrectly conclude on equivalence (type I error) when neglecting between-batch variability in PBE studies. This same study also highlights an increase of the probability to incorrectly reject equivalence (i.e., the type II error or false negative rate, FNR) close to 20% in high between-batch variability cases. Indeed, authors obtained a type II error of 6% with no between-batch variability and 24% with 50% of the variability attributed to the between-batch, when the relative standard deviation is of 10%.The same magnitude is observed with higher relative standard deviation on the Reference [9]. Thus, considering this between-batch variability in the statistical formula has a potential to improve the probability to correctly accept the equivalence.
From this statement, an alternative statistical test, named between-batch equivalence (BBE), is proposed to assess in vitro bioequivalence. This statistical approach is based on the comparison between the mean difference (Reference − Test) and the Reference between-batch variability. The main hypothesis is that considering the between-batch variability of the reference, the BBE test will be more appropriate to demonstrate equivalence in the case of variable drug products, without needing to increase the total required sample size. This statistical method can deal with normal scale data as well as after log transformation of the raw data. As a first step, this study focused on the development of the statistical method, including an exact procedure to implement the test statistic and a confidence interval approximation to graphically illustrate the test results. Then, the type II error of the BBE method, i.e., the FNR, was estimated by simulations and compared with the two mainly recognized statistical methods (ABE and PBE). In a second step, the type I error of the BBE method, i.e., the FPR, was estimated by simulations to ensure that the BBE type I error remains of the theoretical order of 5%. In a third step, the BBE type II error was assessed and compared with the reference methods through a real case application on nasal spray in vitro performance that were performed on two Reference products from the market (which are by definition equivalent). Finally, the result interpretation and conclusions are drawn.

MATERIAL AND METHODS
Theory: The Between-Batch Bioequivalence Procedure

Statistical Model
As reported in several studies, neglecting non-zero between-batch variability can have a strong impact on the bioequivalence conclusions (21,27). Thus, the method developed in this work proposes to include the Batch factor into the statistical equivalence test. In fact, the multiple measurements on a single batch induce a dependence between data and a violation of the independence assumption with the state-of-the-art methods ABE and PBE. The assumption of independence could be satisfied if the Batch factor is considered in the model. This BBE statistical context is close to the nested mixed model with a fixed effect Product, and a random effect Batch nested in Product.

Y∼Product þ Batch Product ½
For such models, the significance test for the Product fixed effect is performed through the mean square (MS) ratio MS(Product)/MS(Batch[Product]), which follows a F distribution under the null hypothesis. Conceptually, the nested mixed model computes the ratio of the difference between the two means (Reference and Test) on one side and the mean difference between batches per product on the other side.
The mixed-effect model described above is a difference test, i.e., the alternative hypothesis assumes the difference between the two products. In the context of equivalence testing, the alternative hypothesis assumes no difference between the two products. Furthermore, the mean square Batch[Product] considers the batches of the two products. Conceptually, the equivalence test should compare the difference between the two products with the differences between Reference batches. BBE can be formulated through the two following hypotheses: where μ T and μ R stand for the means of the Test and Reference products measurements, σ BBR for the betweenbatch variability on the Reference product, and θ for the bioequivalence limit. Thus, the method is based on the comparison between the mean difference (Reference − Test) and the Reference between-batch variability. The proposed approach has been patented under the reference WO2020/053223 A1 (30).

Exact Procedure
Let us suppose throughout this section that the observation of each batch follows a normal distribution with a part of common variance. More specifically, σ 2 BBR ¼ σ 2 BBT , where σ 2 BBT stands for the between-batches variance of the Test products.
Under this assumption, the batch means samples of the Reference and Test, respectively x BiR and x BiT , follow a G a u s s i a n d i s t r i b u t i o n x BiR ∼N μ R ; σ 2 BBR À Á a n d x BiT ∼N μ T ; σ 2 BBR À Á . From the H 0 /H 1 hypotheses testing, BBE can be endorsed if the probability of −θ < δ < θ is greater than the confidence level (1 − α), with δ = (μ T − μ R )/σ BBR . An estimator of δ can be expressed as g ¼ x T −x R ð Þ =s BBR . However, the sampling distribution of g cannot be analytically determined. Based on (27), the sampling distribution of g K can be calculated if K is a constant, defined by: where n BR denotes the number of sampled batches for the Reference product and n BT is the number of sampled batches for the Test product.
q follows a non-centered Student distribution (T nc ) with n BR − 1 degrees of freedom and a noncentrality parameter equal to Þ =s BBR is a biased estimator of δ. An unbiased estimator of δ can be obtained through the application of a correction function, c,that only depends on the degrees of freedom of s R . This correction function can be approximated by We can then infer that an unbiased estimator of δ = (μ T − μ R )/σ BBR is given by Thus, the BBE test statistic follows a noncentral Student distribution with n BR − 1 degrees of freedom.
The BBE test statistic where T is a noncentral Student law with n BR −1 degrees of freedom and noncentrality parameter The H 1 alternative hypothesis can be written as Thus, the method of the BBE test consists of calculating the test statistic then, the H 1 hypothesis (i.e., the equivalence between the two products) is endorsed with T nc the noncentral Student distribution quantile.

Bioequivalence Limits
Conceptually, the Test can be considered statistically equivalent to the Reference if its mean is comprised inside the 95% tolerance interval of the Reference batch means. Based on the central limit theorem, it can be inferred that the means of the Reference batches follow a normal distribution of mean μ R and variance σ 2 BBR : x BiR →N μ R ð ; σ BBR Þ. Thus, 95% of the means of the Reference product batches are included in the interval [μ R − 1.96 σ BBR ; μ R + 1.96 σ BBR ].
Therefore, in the remaining parts of this work, the Test will be considered statistically equivalent to the Reference if BBE bioequivalence limit will be challenged on TPR estimation through different real cases.

Estimate the BBE test statistic
Estimate the bioequivalence limits .
The punctual estimation of the BBE CI test is then given by: The 95% confidence interval upper limit of E λ must be strictly negative to accept the alternative hypothesis. This upper limit is given by the following coefficient noted H λ : Thus, the 95% confidence interval upper limit of the BBE CI test is given by: where df is the degree of freedom of the student coefficient after Welch-Satterthwaite correction

Experimental: Simulation Design
Simulation studies were processed to estimate and assess the proposed method type I and type II errors, i.e., the false positive and false negative rates respectively. The type II error simulations were computed for the three equivalence methods (ABE, PBE, and BBE) in order to compare their ability to accurately prove equivalence. Simulations consisted of 10,000 replications for each combination of triplets (σ R , π BB , n Batches ). As detailed in Table I, a wide range of different input parameters were used to generate the simulated dataset. These parameters, including the number of batches ranging from 3 to 10, RSD ranging from 5% to 40%, and the part of variability attributed to between-batch variability ranging from 10% to 90% were used to fit a large variety of real cases.
Four different mean values of the Test population were used:

&
The first one, μ T = μ R , was used to estimate the true positive rate (TPR) i.e., the power test (1 − type II error). The results are analyzed and compared with the three equivalence methods (ABE, PBE, and BBE) with respect to the triplet [σ R , π BB , n Batches ].

&
The three last ones were computed to estimate the false positive rate (FPR), i.e., the type I error. The results are analyzed for the BBE method with respect to the same triplet. & μ T = μ R + 1.96 σ BBR corresponding to BBE equivalence limit & μ T = μ R + 1.96σ BBR × 1.05 representing a deviation of 5% from the bioequivalence limit. & μ T = μ R + 1.96σ BBR × 1.1 representing a deviation of 10% from the bioequivalence limit.
The number of products (n PB ) composing each batch is defined as the round value of the ratio n/n Batches . Then, n = 32 for 4 and 8 batches; n = 30 for 3, 5, 6, and 10 batches; n = 28 for 7 batches; and n = 27 for 9 batches.
To compute the between-batch (σ BBR ) and the withinbatch (σ WBR ) variabilities, the sum of squares had to be calculated: The between-batch sum of squares SS BB = π BB × SS Tot , & The within-batch sum of squares SS WB = SS Tot × (1 − π BB ), where π BB denotes the percentage of the total variability attributed to the between-batch variability. Then, The following procedure was computed to simulate the Reference data (the same approach is applied to simulate the Test product).

Estimation of batch means values
Estimation of the j products parameter values inside each batch a. Estimation of a normal distribution with null mean and standard deviation of Due to simulations, the real mean of this distribution is not exactly equal to 0. Then, this distribution is centered back to a mean of 0 by s u b t r a c t i n g t o x BijR0 t h e m e a n o f x BijR0 : Finally, for each product (j) parameter value (x BijR ) of each batch (i) is estimated by summing x BijR0 to its corresponding batch mean value: ABE (T ABE ), PBE (T PBE ), and BBE (T BBE ) test statistics were estimated for the 10,000 replications of each triplet [σ R , π BB , n Batches ] and the TPR of each method was calculated. As required, a log transformation was applied to the data before estimating the PBE test statistic. No data transformation was applied to perform ABE and BBE. An equivalent procedure was applied to estimate the BBE FPR. Simulations and statistical analysis were computed using R version 3.6.1 (31).

Experimental: Real Cases
The BBE approach has been applied to two nasal sprays, Nasonex® (Merck) and Flonase® (GSK), which are corticosteroids used to treat nasal symptoms such as congestion, sneezing and runny nose caused by seasonal or year-round allergies. These two products are already commercialized and considered individually as Reference products. In accordance with FDA guidance (16,19,33), two main parameters of in vitro spray performance were considered for the bioequivalence tests. The control of theses parameters ensures the quality of the nasal spray product and at the end the efficacy of patient treatment. The first, Dv50 (or D50) related to droplet size distribution (DSD), (34) was measured by laser diffraction using a Spraytec (Malvern, UK) and an automatic actuator (Proveris, USA). Second, the spray pattern (area) was measured using a Sprayview system (Proveris, USA). This parameter, describing the shape of the spray (35) is known for its high variability (36). The study has been performed at two distances (3 cm and 6 cm) as recommended in the guidance (18) with an actuation speed of 80 mm/s. Only data for Dv50 at a distance of 3 cm and a spray pattern area at a distance of 6 cm are presented here.
The statistical comparison of the three methods (ABE, PBE, and BBE) was performed separately on batches of Flonase and Nasonex products. After checking for the Normality of the distribution, the raw data were not transformed for ABE and BBE, except for the PBE which requires a log transformation. The aim was to evaluate whether the three methods are able to adequately conclude in an equivalence context of each product. For that purpose, the true positive rate of each method was estimated and compared. For both criteria (D50 and area) and both products (Nasonex and Flonase), batches of each product were randomly selected to be considered as a Reference or Test for the need of the bioequivalence evaluation (26).
Twenty-three batches of Flonase, each composed of 10 products, and 16 batches of Nasonex, each composed of 6 products, were used. For the Flonase product, all the possible combinations (without repetition and without replacement) of twice 3 batches (3 considered as the Reference and 3 considered as the Test) were computed. In the same way, all the possible combinations (without repetition and without replacement) of twice 5 batches of the Nasonex product (5 considered as the Reference and 5 considered as the Test) were computed. For both the products, this resulted in comparing 30 products in the Reference and Test populations. A total of 2,018,940 combinations for the Flonase product and 2,018,016 combinations for the Nasonex were obtained and the 3 methods were applied to the D50 and area criteria.
As both Flonase and Nasonex products are commercial products, the selected batches can be considered as equivalent (all the Flonase batches are equivalent to all the other ones and all the Nasonex batches are equivalent to all the other ones). Thus, the ratio between the number of positive equivalence results to the total number of tests (i.e., the total number of combinations) reflects the TPR.

Simulation Results: Type II Error
The true positive rate (TPR) comparison between the 3 statistical methods is compiled in Fig. 1 with respect to the Reference variability. From a general point of view, we can observe that the more batches in the Reference, the greater the TPR, whatever the bioequivalence method. Furthermore, each test exhibits a specific trend when looking at the relationship between the TPR and the total variability on the Reference.
For ABE, a decreasing sigmoid is observed, showing that the higher the Reference variability, the lower the ABE TPR. Moreover, an inflection point can be identified in terms of CV R . Before this point, the ABE TPR is close to 1, while after the ABE TPR decreases and tends to 0 when CV R tends to infinity. This inflection point depends on both the number of batches in the Reference and the proportion of the total variability attributed to the between-batch variability.
For PBE, an inflection point can also be identified. Before this inflection point, the PBE TPR decreases from values close to 1 to its minimal value, while after, the PBE TPR slowly increases and tends to 1. However, contrary to ABE, this inflection point has a fixed CV R value around 10%. It should be noted that the results only present situations where both the Reference and Test variabilities are equal. The simulations are performed with equal variances, i.e., under homoscedasticity assumption.
An inflection point is also observed for BBE, corresponding to a CV R value of 6%. This point is characterized by a sharp rise of TPR for CV R values lower than 6%. TPR approaches the asymptote y = 1 when CV R values are greater than 6%.
Going deeper in details, a plateau is observed for ABE, with TPR values close to 1, which length depends on both the number of batches in the Reference and the proportion of the total variability attributed to the between-batch variability (π BB ). Indeed, the more batches in the Reference, the longer the plateau. In opposition, the higher the between-batch variability, the smaller the plateau. Thus, the ABE performance is the highest for small CV R values, small π BB , and high number of batches.
The PBE true positive rate also depends on the number of batches in the Reference, with higher TPR values observed for the highest number of batches in the Reference. However, no strong relationship was observed between the TPR and the between-batch variability.
Concerning the BBE test, the TPR also depends on the number of batches in the Reference, with a TPR rise with the number of batches. The BBE TPR also depends on the between-batch variability, with a global increase of the TPR function of π BB . However, this dependence is less marked when the Reference number of batches increases. For instance, considering 3 batches and a total variability of 10%, the BBE TPR goes from 58% when π BB = 20% to 84% when π BB = 80%, while considering ten batches and a total variability of 10%, the BBE TPR goes from 97% when π BB = 20% to 98% when π BB = 80%.
The true positive rates (TPR) of the two state-of-the-art methods were compared with the BBE TPR and the differences between them are reported in Fig. 2. From a general point of view, results showed that the BBE method is more appropriate than the ABE and PBE methods for the high value of reference total variability. More in detail, when compared with the ABE, the BBE method showed higher TPR from CV R values around 15% when the between-batch variability equals 20% of the total variability, between 10% and 15% for 50% of the total variability attributed to the between-batch variability, and around 10% when the between-batch variability represents 80% of the total variability. Moreover, the higher the total variability, the higher the TPR gain.
When comparing BBE with PBE, the CV R threshold to reach higher TPR is lower than the one obtained with ABE, while the TPR gain is lower. Indeed, this threshold is reached for CV R values slightly varying between 5 and 10%, depending on the proportion of the total variability attributed to the between-batch variability. For high CV R values, the TPR difference between PBE and BBE decreases and tends to 0. The maximal gain between PBE and BBE is around 30%, while this value grows to more than 90% when considering ABE versus BBE.
Thus, depending on the properties of the sampled batches, one method can reach higher TPR than the two others. Cutoff values have been identified and reported in Fig. 3. This figure delimits the area where the performance of one method is stronger than the others with respect to   The relation between the graphical illustration of BBE (BBE Confidence Interval: BBE CI ) and BBE is illustrated in Fig. 4. The graph on the left is the representation of BBE CI TPR as a function of the BBE TPR. Results are colored upon different configurations of between-batch variability. This figure shows that a vast majority of the measurement are represented close to the line y = x, meaning that TPR of both tests are close. In addition, for a small number of observations, measurements slightly deviate from this line, with BBE CI power test being always lower than BBE ones. On the right side, the graph represents the ratio between BBE CI and BBE test with respect to the between-batch variability. When the between-batch variability increases, the ratio of BBE CI to the BBE power tests increase and converges to 1, showing that BBE CI is particularly relevant in non-zero between-batch variability situations (ratio greater than 0.9 when π BB > 30%).

Simulation Results: Estimation of the Relative Difference Between Means
With a similar simulation approach, BBE type II error was also estimated with respect to the relative difference between Reference and Test means. In that aim, the Reference mean was fixed at μ R = 10 and the following values were used for the Test mean μ T ∈ [10, 20], corresponding to the relative difference between means included in [0%, 100%].
First, Fig. 5 confirmed the TPR values observed in Fig. 1 corresponding to the relative difference between means equal to 0. The main objective of Figure 5 is to illustrate the acceptable relative difference between means at fixed TPR levels in function of RSD, π BB and n BR values.
As expected, the acceptable relative difference increases with the RSD and the π BB , which together represent the Reference between-batch variability (σ BBR ). Indeed, considering 5 batches on the Reference product and CV R = 40%, the acceptable relative difference to reach a TPR greater than 80% is around 8% when the Reference between-batch variability representing 20% of the total variability (π BB = 20%), around 15% with π BB = 50%, around 20% with π BB = 80%. On the other hand, considering 5 batches on the Reference product and π BB = 50%, the acceptable relative difference to reach a TPR greater than 80% is around 2% when the Reference Relative Standard Deviation is equal to 10% (CV R = 10%), around 8% with CV R = 20%, around 12% with CV R = 30%, around 14% with CV R = 40%.
Similarly, the acceptable relative difference increases with the number of batches. For instance, with π BB = 50%, CV R = 40 % , and TPR = 80 % , the maximum difference between Reference and Test means is around 8% with 3 batches, around 15% with 5 batches, around 18% with 7 batches, and finally around 22% with 10 batches.
Simulation Results: Type I Error Figure 6 illustrates the BBE false positive rate (type I error), as a function of the Reference variability. Observations are made upon the number of studied batches (columns, from 3 to 10 batches) and under consideration of the part of the Reference variability explained by the between-batch (lines, from 20 to 80%). The 3 curves represent different values of the Test mean. The first one corresponds to the BBE bioequivalence limit, the second and third ones to the bioequivalence limit plus a small deviation.
The first statement on this figure is that the BBE type I error for all configurations is equal to 5% or less. From the Colors represent the between-batch variability (π BBR ), expressed as a percentage of the total variability, arbitrarily discretized to the values of 20%, 50%, and 80%. The right panel shows BBE CI to BBE power tests ratio with respect to the between-batch variability. Colors represent the number of batches analysis of the figure, the type I error is independent of the part of the Reference variability that is linked to between batch whereas the type I error decreases when the number of studied batches increases. Figure 7 illustrates the BBE type I error mean values with respect to the number of studied batches on three different estimations of the value of the Reference mean, showing mean value slightly greater than 5% with 3 batches in the Reference (type I error = 5.4%) at the bioequivalence limit. Type I error is continuously decreasing when the number of studies batches increases.

Real Case Results: Challenge of the BBE Bioequivalence Limit
The objective of this section is to confirm the BEE bioequivalence limit initially set at 1. 96

Type I error estimation
μ T = μ R + 1.96 × σ BBR (Bioequivalence limit) μ T = μ R + 1.96 × σ BBR × 1.05 μ T = μ R + 1.96 × σ BBR × 1.1 Fig. 6. BBE type I error as a function of the relative standard deviation on the Reference. Type I errors are estimated for three values of the Test mean, namely μ T = μ R + 1.96σ BBR , corresponding to the bioequivalence limit, μ T = μ R + 1.96σ BBR + ϵ, and μ T = μ R + 1.96σ BBR + 2ϵ − β, α) with the prefixed BBE bioequivalence limit, it remains relevant to confirm this bioequivalence limit on real data. The 23 batches of Flonase and the 16 batches of Nasonex allowed to study the evolution of the TPR with respect to different acceptance criteria values for BBE in an equivalence context. Table II summarizes the estimated BBE bioequivalence limits required to reach acceptable levels of TPR (80%, 85%, 90%, 95%). With more than 3 batches, the θ = 1.96 bioequivalence limit allowed to reach at least a level of 80% on the TPR. Moreover, the TPR level is greater than 90% with 7 batches. Figure 8 illustrates the evolution of the TPR with respect to θ.The higher the pair (CV R , π BB ) is (i.e., higher the σ BBR is), the lower the required BBE bioequivalence limit to reach an acceptable value of TPR is. These analyses show that the BBE bioequivalence limit fixed at θ = 1.96 is a reliable compromise for achieving a TPR level of at least 80%, especially with more than 3 batches.

Real Case Results: True Positive Rate
As reported in Table II, the results observed in the real case application confirm those observed in the simulation results. Indeed, when the total variability of the Reference is low (around 5% for the DV50 criterion) the two state-of-theart methods reached high true positive rates, with values greater than 99%. In this situation, the BBE equivalence test reached significantly lower true positive rates, depending on the between-batch variability (69% with π BB =25% and 84% with π BB = 43%). On the other hand, with the Area parameter, exhibiting higher variability on the Reference, 27% and 18% for the Flonase and Nasonex products, the BBE method showed higher true positive rates than ABE and PBE (70% versus 14% and 60% for the Flonase and 85% versus 62 and 69% for Nasonex) (Table III).

Real Case Results: Concrete Examples
In order to illustrate the relative performance of the equivalence tests, concrete examples are given in this section. Table IV summarizes the results on the Flonase real case. The left part of Figure 9 exhibits raw data for Flonase® nasal spray batches, each composed of 10 samples. On the right side, the graphs represent the illustration of the bioequivalence test using the 3 statistical methods (ABE, PBE, and BBE). For the D50 criterion, the ratio between geometric means of Reference and Test is close to 1, the difference between the arithmetic means is low, and both variances are low (around 5%) whatever the number of batches. Thus, the

Number of batches Type I error estimation
μ T = μ R + 1.96 × σ BBR (Bioequivalence limit) Fig. 7. BBE type I error as a function of the number of batches. Type I errors are estimated for three values of the Reference mean, namely μ T = μ R + 1.96σ BBR , corresponding to the bioequivalence limit, μ T = μ R + 1.96σ BBR + ϵ, and μ T = μ R + 1.96σ BBR + 2ϵ three tests were able to recognize equivalence between Reference and Test. For the Area criterion, the ratio of geometric means is still close to 1 (from 0.90 to 0.99). However, the difference in arithmetic means is greater (around 10% of the Reference mean), the Reference variability is higher than the Test one (above 30% versus around 20% respectively). In such a situation, PBE and BBE were able to recognize the equivalence between Reference and Test from 3 batches while ABE needed an increased number of batches (6 batches).

Nasonex®
Table V summarizes the results on the Nasonex real case. Figure 10 represents bioequivalence tests results for Nasonex® nasal spray batches each composed of 6 samples. For the D50 criterion, the ratio between geometric means of Reference and Test is close to 1, the difference between the arithmetic means is low, and both variances are low (around 5%) whatever the number of batches. Thus, the three tests were able to recognize equivalence between Reference and  Fig. 8. Bioequivalence true positive rates (power test profiles, y-axis) with respect to the relative acceptance bioequivalence limit. The study is done taking into account two parameters (Dv50 and Area, in columns) and three different number of batches (3: Flonase®, 5: Nasonex®, 7: Nasonex®). Each curve corresponds to different values of the couple (CV R , π BB ) Test. For Area criterion, variance in Reference product is lower compared with Test product and remain moderate (below 20%). The boxplot in Fig. 8 reveal that two Test batches (tests 2 and 3) present lower values than the other, and lower than the Reference observations. The geometric means ratio increases with the number of batches (from 0.95 with 5 batches to 1.01 with 8 batches). This increase of the geometric means ratio, in combination with the increasing number of batches allowed PBE to accept bioequivalence from 8 batches.

DISCUSSION
In this work, an alternative bioequivalence method is proposed to mitigate the two state-of-the-art methods (PBE and ABE) limitations. This method, named between-batch bioequivalence (BBE), is based on the comparison between the mean difference (Reference − Test) and the Reference between-batch variability. The theoretical justification of this approach was established through the derivation of the relevant statistic test (T BBE ) which follows a noncentral Student's T distribution. An exact procedure was developed to implement the BBE approach. In addition, this work also introduces an illustrative method (BBE CI ) to compute BBE through the confidence interval estimation. This illustrative method brings a simple and robust way to quickly illustrate the BBE results. The BBE performance was evaluated through its true positive rate estimation on simulated data. Furthermore, these true positive rates were compared with the ones of ABE and PBE. In a second step, the false positive rate of the BBE method was evaluated on the simulated data. Finally, to illustrate the advantages and drawbacks of BBE compared with ABE and PBE, the three methods were applied on two commercial nasal spray products and their ability to demonstrate equivalence was evaluated.
Regulatory authorities in the United States (FDA) and the European Union (EMA) recommend the application of respectively PBE and ABE to assess equivalence between a generic and a reference drug. However, these two methods do not take into account the repartition of both the Reference and Test products into batches. More importantly, they do not consider the variability between the different batches and what may be a strong limitation, as reported by several studies (2,9,21,27,28). More recently, the International Pharmaceutical Aerosol Consortium on Regulation and Science (IPAC-RS) PBE working group evaluated the impact of increasing the Reference batches number and the consideration of between-batch variability in ABE and PBE (37). This report highlighted the beneficial effect of increasing the number of sampled batches and considering the betweenbatch variability in bioequivalence tests (both ABE and PBE) which increase the true positive rate and decrease the false positive rate. However, considering that the between-batch variability in ABE and PBE did not overpass the inherent limitations at high CV R values of these methods, i.e., the ABE true positive rate declined and PBE increased permissiveness to a large difference between Reference and Test means.
The differences observed between the true positive rates of the three bioequivalence methods can be explained by their specific mathematical definitions and thus their biological equivalence definitions. Indeed, ABE only considers the product variabilities to construct the confidence interval of the mean difference. Thus, ABE type II error increases when the variability and consequently betweenbatch variability increases. On the other side, PBE considers the variability difference between Reference and Test In opposition, BBE compares the mean difference (μ T − μ R ) with the Reference between-Batch variability, which is usually related to the total Reference variability. Thus, BBE type II error decreases when the Reference between-Batch variability increases. As expected, the higher the Reference variability and especially the between-batch variability, the greater the BBE true positive rate. Even with a relatively low between-batch variability (e.g., 20%), the BBE TPR is higher than 80% with 5 batches and higher than 90% with 7 batches, from a CV R of 6%. However, under a CV R of 6%, the BBE TPR decreases and thus, BBE seems less appropriate. The ABE TPR increases with the number of Reference batches considered, even with the same total number of samples (e.g., considering 30 samples, ABE TPR > 80% for CV R up to 10% and 3 batches whereas with 10 batches CV R goes up to 15%). However, a big loss of ABE statistical strength is observed when the CV R increase. In other words, ABE is less appropriate from a CV R of 10% and 3 batches (TPR < 80%).
A high PBE TPR is observed for CV R up to 10% and high CV R from 30%. This behavior can be explained by the PBE formula itself (29): the denominator of the PBE statistic takes the maximum between the Reference variability and 10%, explaining the inflection point observed in the TPR curves. After this 10% threshold, the PBE statistic inversely depends on the CV R value. Thus, the PBE test will be more permissive to high mean differences for the highest Reference variabilities (11). Between these two CV R values (10% to 30%), the PBE TPR goes from 50 to 80%, depending on the number of batches in the Reference: the higher the batches number, the higher the TPR.
This study suggests that the number of batches used to evaluate bioequivalence between a generic product and its Reference should be increased. There is a strong dependence between this number of Reference batches and the true positive rate that can be explained by a better estimation of the mean and the variability of the population by considering more batches. The US FDA (17) recommends a minimum of 30 samples (i.e., 3 batches of 10 samples). From this study, the recommendation would be to use at least 5 batches of 6 units, keeping the total amount of samples at 30. This recommendation is aligned with the recent report of Chen et al. suggesting to increase the number of sampled batches rather than the total number of measurements (37). A goodness-offit study was performed on the real data to validate that the estimates of the means are sufficiently accurate with 6 units per batch in comparison with 10 units per batch. This study is detailed in the supplementary materials. At a constant sample size per product (n R = n T = 30), it seems more appropriate to take 5 batches of 6 units rather than 3 batches of 10 units. In summary, increasing the batches number may constitute a valuable alternative approach to the total sample size increase. This approach is more compatible to the context of a generic drug development by resulting in a better characterization of both the Reference and Test populations without increasing the total number of samples.
Considering the batch factor in the bioequivalence test seems essential to better evaluate the between-batch variability. As previously reported in several studies [9,21,39], this lack of consideration of the between-batch variability (not taken into account in the PBE and ABE formula) can lead to a true positive rate decrease for non-negligible values. While , and between-batch bioequivalence BBE) results, using 3 (N = 30), 4 (N = 40), 5 (N = 50), and 6 (N = 60) batches for both the Reference and Test products this should not induce a higher risk for the customer, this may lead to increase the risk to erroneously reject equivalence and thus indirectly the development costs. From a different perspective, regulatory authorities recommend increasing the sample size when the true positive rate is low (4,5,25,26). However, this could induce a counterproductive effect as demonstrated in Chen et al. (37), increasing the number of measurements within a batch inflates the false positive rate when the between-batch variability is high and the number of batches is low. The BBE method is based on a mixed effect model formula, taking the batch into account as a nested factor. Thus, the batch factor and the between-batch variability are considered in the BBE formula.
Overall, the BBE true positive rate analysis highlight high and stable values. Indeed, from a Reference variability of 6%, the BBE TPR remains stable over the studied range of Reference variability values. Furthermore, with at least 5 batches, TPR is greater than 80% (when CV R ≥ 6%). Thus, this method allows to reach reasonable performance without needing to increase the total sample size. The benefits of BBE have been illustrated through a real case study of two commercially available nasal sprays using bioequivalence assumptions. In contrast to ABE and PBE, BBE was able to prove bioequivalence in all cases with 30 samples, whereas additional experimental measurements have been conducted by increasing the samples up to 60 samples in the case of ABE for Flonase and up to 48 samples in the case of PBE for Nasonex. This real case application highlighted the strengths and weaknesses of each method. ABE is strongly dependent on within-product variability because this method is performed through the confidence interval of the mean difference. Furthermore, the Welch-Satterthwaite correction of the degrees of freedom expands the confidence interval under heteroscedasticity, which is quite often observed in real cases. On the other side, PBE is strongly dependent on the difference between the Reference and Test variability, promoting situations where Test variability is lower than the Reference variability, but penalized situations with lower Reference within-product variability. By considering between-batch variability, BBE is useful when the Reference between-batch variability is a non-negligible part of the Reference within-product variability.
One of the main results of this study was the identification of ranges, in terms of total and between-batch variability and of batch numbers, where the BBE TPR was higher than the two other methods. Indeed, results showed that whatever the total number of batches in the Reference, the BBE TRP is always higher than ABE when CV R is greater than 15%, this threshold being reduced when the between-batch variability increases, going from CV R = 15% when the betweenbatch variability represent 20% of the total variability to CV R = 10% when the between-batch variability increases to 80% of the total variability. When comparing BBE with PBE, the threshold seems independent to the between-batch variability and the number of batches in the Reference. However, the true positive rate gain itself depends on the number of batches going from a minimal value of 15% with 3 batches to 20% with 10 batches.
While the BBE method showed strengths and advantages as compared with the ABE and PBE methods, there are also some limitations. First of all, the BBE required at least 5 batches to reach a suitable performance. Another limitation of the BBE method came from its dependence on the total variability on the Reference. Particularly, the method showed weak true positive rates for the lowest values of CV R (CV R ≤ 6%). This observation emphasizes that the BBE method may not be appropriate in low variability situations, especially when few batches are considered. Regarding the bioequivalence limits, FDA and EMA have defined bioequivalence limits in the guidance (15,22). The BBE bioequivalence limit has been defined at 1.96 and evaluated through the simulation studies described above. However, this value still requires an evaluation by regulatory authorities. Finally, the BBE method assumes equality in Reference and Test between-batch variances. Indeed, only the Reference between-batch variability is considered in the equation of the model. Even though this assumption may appear as a limitation of BBE, the Reference between-batch variability is quite often in real life higher than that of the Test, because of a longer time interval between batch productions. Thus, the Test betweenbatch variability is underestimated leading a greater stringency of the method and avoids false positives (i.e., the inflation of the type I error). In other words, BBE will avoid considering the two products as equivalent if the Test between-batch variability is higher than that of the Reference. Also due to the longer time interval between batch productions, the Reference within-product variability is mainly explained by the between-batch variability, or at least the between-batch variability will be high enough to have statistical reliability on BBE.
As a first perspective, the dependency on the number of batches that both ABE and the PBE TPR exhibit asks whether they are possibly dependent on the false positive rate. Indeed, this aspect was partially covered in the IPAC-RS PBE working group. A more systematic study approach may be useful to ensure the final consumer safety. In this study, the real case application was limited to in vitro nasal spray in vitro bioequivalence. We recommend conducting further bioequivalence studies to evaluate the method and prove its relevance for both in vivo and in vitro bioequivalence testing. Further studies should be conducted to explore the PBE TPR under heteroscedasticity. Last but not least, it could be interesting to set up a protocol that combines the advantages of each method. In that goal, a unique criterion function of the variability (between-batch and within-product) should be defined to characterize the validity range of each method.

CONCLUSION
The purpose of this work was to propose the development of an alternative statistical method to evaluate equivalence. In opposition to ABE and PBE, the BBE method considers the between-batch variability.
Simulation and real data studies proved the robustness of BBE compared with ABE and PBE, especially in a non-zero between-batch variability context. Moreover, BBE does not require to increase the number of samples in well-known cases where ABE and PBE have lower TPR.
This work highlighted the BBE prerequisites, namely at least five batches per product (Reference and Test) and a relative standard deviation on the Reference product (CV R ) greater than 6%. BBE computation, described in the section "BBE computation," can be accomplished with elementary calculations or with the following web application [40].
Thus, BBE may be of particular interest to optimize the generic development in both in vitro and in vivo contexts.

ACKNOWLEDGMENTS
The authors would like to thank Dr. Laurent Vecellio, Scientific Director at Nemera and researcher in University of Tours (France) for his help on the writing and the reviewing of this manuscript. The authors would also like to thank M. Maxime Rioland and Dr. Frank Sauvage, data scientists at Seenovate for their contribution to the development of the statistical method.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.