Introduction

Biosimilars are highly similar to a previously licensed biological reference product (U. S. Food and Drug Administration Guidance for Industry 2015; European Medicines Agency 2014). Equally, regulators require that biological medicines that undergo a manufacturing process change are demonstrated to be highly similar to the version of the medicine before the process change (ICH Harmonised tripartite guideline Q5E 2004). In both cases a thorough comparison of the quality attributes such as the physicochemical and functional attributes sets the foundation to establish the high similarity. Clinical studies are rarely needed for introduction of process manufacturing changes and are used in a tailored manner during biosimilar development, which puts the major burden of proof on the comparison of quality attributes (U. S. Food and Drug Administration Guidance for Industry 2015; European Medicines Agency 2014; McCamish and Woollett 2012). Notably, the definition of “high similarity” includes a certain discretion and allows for differences between the two products in quality attributes if sufficient product understanding is available to conclude that those differences are clinically meaningless. Many quality attributes may also show a certain but controlled variability between different production lots of a given product (Lamanna et al. 2017; Schiestl et al. 2011; Kim et al. 2017). Regulatory guidelines request that manufacturers control the critical quality attributes of biologics to stay within appropriate ranges or limits, so that the quality and clinical properties remain consistent over time (ICH harmonised tripartite guideline Q7 2000; ICH Harmonised tripartite guideline Q8(R2) 2009). Comparing quality attributes in a similarity exercise requires the comparison of different lots to describe and analyze the variability between them. Different non-inferential and inferential statistical approaches have been proposed with the aim to increase objectivity and robustness of these assessments (Tsong et al. 2017; Chow et al. 2016). The simplest approach is the visual display. Other approaches compare the test sample with observed (MinMax) or estimated ranges of the reference sample, such as xSigma or tolerance intervals. Yet another approach is equivalence testing of means. However, the application of statistics is not as easy as it may appear at the first glance. In 2018 FDA withdrew a dedicated draft guidance with specific proposals, which may illustrate the difficulties in establishing standards in this area (U. S. Food and drug administration 2018). A first caveat when applying statistical tests is the essential flexibility of the requirement of “high similarity” to allow for differences if they are clinically meaningless. Statistics may facilitate the detection of differences, e.g. in data distributions or ranges, but the determination whether or not these differences are clinically relevant is a scientific question that cannot be addressed by a statistical approach alone. This article focuses on the meaningful detection of statistical differences. In the event that differences are detected, the next step in the evaluation of claims for comparability of a manufacturing change or for biosimilarity, i.e. the determination of the clinical relevance of detected differences, requires the assessment of all relevant product and process knowledge, including structure-function relationships, understanding of the mode of action, safety data, and clinical experience with the product.

A comparison of the different statistical approaches or tests requires the calculation of the operating characteristics based on a clear hypothesis for accepting a claim for statistically equivalent quality. The concept of statistically equivalent quality is the scientific basis behind existing regulations for manufacturing process changes as well as the variability of quality attributes in routine production. This article proposes such a hypothesis and a tool which allows the calculation and comparison of the operating characteristics such as the average false acceptance rate and average false rejection rate. A false rejection means that a product is rejected although it fulfills our hypothesis for equivalent quality, whereas a false acceptance means that a product is accepted although it does not fulfill our hypothesis. The numbers for those error risks as calculated by the tool are relative and not absolute because they depend on the calculation parameters. However, they allow meaningful comparisons of the different statistical tests with regards to their utility in similarity exercises. For the purposes of simplicity within this article, we use the terms statistical approach and statistical test synonymously.

Materials and methods

Hypothesis for accepting a claim for statistical equivalence for the analyzed quality attribute

The variability of the reference product defines the acceptable quality for the test product population. Equivalence for the quality attribute is established if the population of the test product lies within the population of the reference product. The width of the population is described by 3σ because 3σ is commonly used as a threshold to describe a population. E.g. In statistical process control data points beyond 3σ are investigated because they might result from special cause variability and not belong to the population.

The population of the test product is therefore in the population of the reference product if μtest – 3σtest > μref – 3σref and μtest + 3σtest < μref + 3σref. In other words, and assuming normal distributions, if at least the central 99.7% of the test product are within the central 99.7% of the reference product. The equivalence region described by this definition is illustrated as a triangle in Fig. 1.

Fig. 1
figure 1

Definition of the statistical equivalence region: The dark grey region is the defined true similarity region and indicates those combinations of SD ratio and difference in mean where the test population is within the reference population. The population width is described by μ ± 3σ, which translates into a triangle with the corners of SD ratio and difference in means of (0/0, 1/0, 0/3). The light grey region is the statistical non-equivalence (false similarity) region indicating where the test population is not considered to be within the reference population

The tool - calculation of the average false acceptance rates and average false rejection rate

Acceptance rates, i.e. likelihood of passing the test, are calculated by Monte Carlo methodology. Under the assumption of normally distributed data, the reference population and test population are distinguished by a relative difference in mean (μtestref)/σref and the ratio of SD (standard deviation) σtestref. For any given sample size for the reference product and test product, nref and ntest respectively, nref and ntest samples are drawn repeatedly (nsim = 1000) and randomly from the defined reference and test population and evaluated using the statistical test (MinMax, 3Sigma, tolerance interval, equivalence test for means) for acceptance. The acceptance rate is calculated by the proportion of accepted samples to generated samples (nsim). For any given sample size combination, acceptance rates can be calculated systematically for all relevant combinations of the difference in means and the ratio of SD. In this article, acceptance rates were calculated for a grid covering all combinations of the difference in means from 0 to 4 σref with a stepsize of 0.1 σref and ratio of SD from 0 to 4 with a step size of 0.1.

Calculated acceptance rates can be visualized by plotting the acceptance rate as a function of the difference in means (μtestref)/σref and the ratio of SD σtestref (see Additional file 1: Figure S1. for an example contour plot).

Average false acceptance rates (false positive) are calculated as an average of the acceptance rates for all grid points in the statistical non-equivalence region, which consequently represent false acceptance rates. Average false rejection rates (false negative) are calculated as an average of all rejection rates (1 – acceptance rates) for all grid points in the statistical equivalence region.

The code for these calculations is provided in the Additional file 1.

Statistical tests

  1. a)

    MinMax: A MinMax range is defined by the lowest and highest value of a sample. The MinMax test is accepted if the MinMax range of the test sample is within the MinMax range of the reference sample (minTest > minRef and maxTest < maxRef)

  2. b)

    3Sigma: the 3Sigma range is calculated for the reference sample as (μref-3σref, μref + 3σref). The 3Sigma test is accepted if the MinMax range of the test sample is within the 3Sigma range.

  3. c)

    Tolerance interval (TI): The tolerance interval is calculated for the reference sample as (μ-k*σref,μ + k*σref). The k-factor is calculated two-sided with a confidence level of 0.9 and a proportion of the population covered by the tolerance interval of P = 0.99. The tolerance interval test is accepted if the MinMax range of the test sample is within the tolerance interval calculated for reference sample.

  4. d)

    Equivalence testing of means (EQT): A two one-sided t-tests’ (TOST) procedure is used to test for equivalency of the means of the reference product and the test product. The equivalence margin is defined as δ = 1.5 sref (standard deviation of the reference product sample), the Type I error probability is controlled at level α = 0.05 for a conclusion of equivalence. The test is accepted if the (1-2α)100% = 90% confidence interval for the difference in the means is within (−δ, +δ) .

Results

The operating characteristics of a statistical test depend on the underlying hypothesis of statistical equivalence for the quality attribute, which, for this article, is fulfilled if the population of the test product lies within the population of the reference product. This hypothesis reflects the current regulation of manufacturing processes, which require that critical quality attributes are controlled within ranges or limits. Under the assumption of normally distributed populations, Fig. 1 illustrates the resulting region of statistical equivalence for a test product population which is distinguished from the reference population by a difference in means and the ratio of the distribution width (ratio of SD). The triangle illustrates the equivalence region as the area where the conditions for statistical equivalence as defined above are met. Average false rejection and average false acceptance rates were calculated for the different statistical tests as described in the Methods section and displayed in Fig. 2 which provides a comparison of the statistical tests for these error rates for sample size of nref and ntest = 10.

Fig. 2
figure 2

Average operating characteristics: The average operating characteristics are shown for the tests MinMax, 3Sigma, equivalence testing of means and tolerance interval for a sample size of nref = 10 and ntest = 10. The horizontal axis represents the average false acceptance rate and represents the risk for a statistical false positive conclusion on similarity. The vertical axis represents the average false rejection rate and represents the risk for a statistical false negative conclusion on similarity

Figure 3 shows the impact of varying sample size for MinMax, 3Sigma and Equivalence testing of means on the average false acceptance and rejection rates. For MinMax, increasing nref lowers the false rejection rate without a large impact on false acceptance rate. Increasing ntest reduces false acceptance rates but slightly increases false rejection rates. Similar trends with different magnitudes are observed for 3Sigma, where increasing nref reduces false rejection rates but also slightly reduces false acceptance rates. Increasing ntest reduces strongly false acceptance rates while it has only marginal impact on the false rejection rate. While increasing sample sizes reduce the false acceptance rate for MinMax and 3Sigma as expected, they increase the likelihood for passing the test, and associated false acceptance rates, for equivalence testing of means. This effect is especially pronounced with increasing test sample size. This different behavior of equivalence testing can be attributed to the lack of alignment of the equivalence test with the proposed equivalence hypothesis requiring that the population of the test product lie within the population of the reference product. Range-based tests are in general better aligned with this equivalence hypothesis. Tolerance interval testing shows generally a low false rejection rate but at small sample sizes there is also a high false acceptance rate. However, increasing sample sizes, especially for nref reduce the false acceptance rate to levels comparable with the other statistical tests (see Fig. 4).

Fig. 3
figure 3

Sample size dependency of average operating characteristics: The average operating characteristics are shown for different test and reference product sample sizes (nref = 4,6,8,...,30, ntest = 4,6,8,...30) for the tests MinMax, 3Sigma and equivalence testing of means. Arrows indicate the effect of increasing test product and reference product sample size (annotated as ntest and nref, respectively). The grey background polygons accentuate the area of operating characteristics for the individual tests

Fig. 4
figure 4

Sample size dependency of average operating characteristics (Tolerance Interval): The average operating characteristics are shown for different test and reference product sample sizes (nref = 4,6,8,...,30, ntest = 4,6,8,...30) for the test Tolerance Interval. Arrows indicate the effect of increasing test product and reference product sample size (annotated as ntest and nref, respectively). In contrast to Fig. 3, the x-axis for the false acceptance rates is extended in order to visualize all data points

Discussion

From a statistical viewpoint, without further knowledge of the impact of differences in quality attributes on safety and efficacy and without taking into account any risk mitigation by proper control strategy in manufacturing, the average false acceptance and rejection rates represent estimates for false positive and false negative decision of similarity between quality attributes of two products. Both error rates are important and should be as low as possible, however, a small false acceptance rate is even more desirable because it might impact risks posed to the patient, whereas a false rejection rate primarily impacts the risk for the manufacturer. The tool is therefore well suited to compare different statistical tests for its applicability in similarity assessments. Any specific application for a similarity exercise additionally requires consideration of potential multiplicity effects as typically many quality attributes are compared in parallel (Bretz et al. 2010). The tool also assumes normally distributed data and process variability without special cause variation, meaning that the analytical variability is negligible and the sample data do not shift over time. Non-normally distributed data and special cause variation require additional considerations with regard to sampling distributions and data evaluation.

The results provided in this article reveal that MinMax is a conservative approach with a low false acceptance rate, but it has a high false rejection rate. Equivalence testing has also a high false rejection rate and with increasing sample size a considerable false acceptance rate. The 3Sigma approach provides a more practical compromise of error rates, which further improves with larger sample size. Tolerance interval testing is only usable if sample size is sufficiently large.

A frequent practical question in the evaluation of similarity is on how many test samples are needed for robust decision making. The tool shows nicely that very small sample sizes can considerably increase the false acceptance rates for the range-based tests. The tool allows definition of acceptable sample sizes based on desired operating characteristics and/or to investigate alternative strategies to control the false acceptance rate.

For the equivalence test, on the other hand, an increasing sample size leads to greater precision in estimating the mean difference. In combination with the lack of alignment of the EQT with the equivalence hypothesis (test population in a reference population), this leads to an undesired increase of the false acceptance rate with increasing sample size.

While the examples illustrate the impact of sample size, the tool can also be used to assess the impact of other statistical testing parameters on the false acceptance and rejection rates. Finally, alternative hypotheses for statistical equivalence of the quality attributes can be easily assessed. For example, the equivalence area can be defined differently to allow a small difference in means when σ is the same on one side, but restrict the uncomfortable but also highly unlikely situation that a very narrow distributed test distribution is located in the far tail of the reference distribution. Such a hypothesis could define equivalence of the quality attribute if the central 95% if the test population are within the central 99% of the reference population. – see Additional file 1: Figure S2. For the operating characteristics of such an alternate hypothesis please see Additional file 1: Figure S3. (MinMax, 3Sigma, Equivalence testing of means) and Fig. 4 (TI).

Conclusion

Regulatory guidelines for biosimilar evaluation and comparability of process manufacturing changes require highly similar quality attributes between biosimilar candidate and reference medicine, and pre- and post-change product respectively (U. S. Food and Drug Administration Guidance for Industry 2015; European Medicines Agency 2014; ICH Harmonised tripartite guideline Q5E 2004). The definition of “high similarity” of quality attributes includes a range of variability. Even statistically significant differences could be acceptable if sufficient knowledge allows the conclusion that such differences are clinically meaningless. However, statistical approaches may facilitate the comparison of quality attributes by identification of statistical differences, which require further scientific evaluation before drawing a conclusion on whether a claim of high similarity is fulfilled or not. The tool presented in this article provides a means to calculate relevant operating characteristics such as error rates for average false acceptance (false positive) and false rejection (false negative) rates of different statistical approaches. Those properties allow a meaningful comparison and proper selection of statistical tests and may inspire research for novel statistical approaches for comparing quality attributes.