In this part, case studies are presented to illustrate how components of a comparability exercise interrelate and whether any uncertainties around biosimilarity can be reduced by other data from the comparability exercise.
The review focuses on biosimilars for which differences were observed in different parts of the comparability exercise and on the justification for why these differences did not preclude regulatory approval. The cases are summarised in Table 2 and further discussed in the following sections.
For one biosimilar, infliximab , extensive analytical tests showed physicochemical and structural comparability except for a small difference in the proportion of afucosylated forms. This glycoform is associated with the binding affinity of the molecule to the FcγRIIIa receptor expressed on various immune cells. The biosimilar and the reference infliximab demonstrated comparable binding to complement receptor and all types of Fc-receptors except for FcγRIIIa/b, where an approximately 20% reduction of binding was seen. This was accompanied by a 20% lower antibody-dependent cell-mediated cytotoxicity (ADCC) activity measured for the biosimilar in a highly sensitive assay using Jurkat cells overexpressing tumour necrosis factor receptors and natural killer cells. However, this difference disappeared under more physiological conditions, e.g. when serum was added or peripheral blood mononuclear cells were used as effector cells, suggesting that the observed difference in FcγRIIIa binding was not relevant in vivo . These differences were confirmed not to be clinically relevant in a large PK study in patients with ankylosing spondylitis and a large phase III equivalence trial in patients with RA.
All indications of the reference product were approved in the EU and in the USA but initially not in Canada, where extrapolation from autoimmune arthritis to inflammatory bowel disease was not accepted, as the ADCC mechanism of action was viewed as more important in the latter. Health Canada’s decision was reversed in 2015 based on additional functional data and the totality of evidence .
Comparability Regarding Pharmacokinetics and/or Pharmacodynamics
Two examples of an initially failed and subsequently successful PK study conducted with adalimumab biosimilars in healthy volunteers have been published [36, 37]. In both instances, it was argued that the differences in glycan structures known to affect PK (high mannose content) were too small to explain the initially observed PK differences, as only high mannose content of at least 20% would have the potential to alter the systemic exposure because of increased receptor-mediated elimination .
In one of these cases, the initial failed PK study  was performed with a clinical trial formulation exhibiting differences in the buffering system compared with the commercial formulation. Post-hoc analyses correcting, for example, for body weight and protein content as covariates were performed but were unable to provide a satisfactory rationale for the observed differences. PK similarity was demonstrated in a second, improved PK study using the formulation intended for commercialization, a larger subject sample size considering the high PK variability, a standardized injection site and the predefined covariates body weight and age.
In the other case, an extensive root cause investigation was performed, including batch selection, investigational medicinal product (IMP) storage and transport, IMP preparation, IMP administration, PK sampling, PK sample shipping and testing, impact of body weight and antidrug antibody (ADA) development and other subject characteristics . However, no root cause driving the negative outcome of the PK study could be identified by the applicant.
Therefore, a second PK study was conducted with an adapted study design aiming at reducing inter-subject variability (body mass index [BMI] restriction, inclusion of only male subjects, increased sample size). IMP handling and dosing was also simplified by using prefilled syringes so they did not require IMP compounding as in the previous study. With this improved design, PK similarity could be shown.
Generally life-cycle management on the part of the originator, such as with the recent formulation changes to originator adalimumab may pose a further challenge to biosimilar developers. Differences in formulation strengths will require analytical comparability testing in line with ICH Q5E. Depending on the results, the need for (non-)clinical comparability studies will be identified but at any rate a clinical PK study is highly recommended.
In conclusion, a failed comparative PK study requires a thorough root cause analysis. The insights, if any, should lead to improved design of a subsequent study. To decrease the notably high PK variability of many biologicals, test conditions should be standardized as much as possible.
Teriparatide is a relatively simple molecule, as it is a 34 amino acid monomer and contains no glycosylation or other post-translational modifications.
To claim similarity of a biosimilar teriparatide to its reference product, a thorough physicochemical, structural and biological characterization as well as impurity profiling was performed . The clinical development programme consisted of one single-dose comparative PK study in 54 healthy women.
The predefined equivalence range of 80–125% for the relative bioavailabilities were met, but 100%, i.e. unity, was not included in the 90% confidence interval (CI), indicating statistically significant differences, which is acceptable with appropriate justification . Mean exposure and peak serum concentrations were around 8% lower with the biosimilar than with the reference product, but the clinical impact of the observed difference was considered to be negligible given the available literature on the reference product regarding the impact of body weight and administration site on exposure. No relevant differences were identified by further analysing delivered volumes, active substance content, structure of the active substance and the PK assay.
In terms of the PD properties of teriparatide, the applicant measured serum calcium at several timepoints during the comparative PK study, as teriparatide is known to cause transient increases in calcium after each dose. An equivalence margin was not prespecified, but statistical analyses of all serum calcium concentrations showed close similarity between test and reference.
The close analytical and functional similarity, together with the similar PD profiles and the absence of a relevant difference in PK supported CHMP’s conclusion of biosimilarity.
Pegfilgrastim is associated with notably high PK variability . Therefore, every effort should be undertaken to minimize this variability in comparative PK studies. Since the dose-exposure relationship of pegfilgrastim is greatly disproportional, i.e. in healthy subjects, a tenfold increase in dose was shown to lead to an approximately 75-fold increase in exposure , and correction for protein content using linear models is inappropriate, attention should also be paid to administering exactly the same dose of test and reference product.
Interestingly, the high PK variability is not paralleled by high PD variability. On the contrary, the exposure-response relationship appears rather flat, having led to highly similar PD responses (i.e. ANCs) even in cases of high PK variability or failed PK similarity, thus rendering PD endpoints less sensitive than PK endpoints to detect potential differences between two pegfilgrastim-containing products.
Two of six marketing authorization applications for biosimilar pegfilgrastims originally had failed PK trials using traditional comparability margins of 80–125% [41, 42].
Despite highly similar PD responses and data from phase III trials showing similar clinical performance, CHMP did not accept that predefined PK similarity margins were not met, since PK studies are generally considered to be more sensitive to detect product-related differences than PD endpoints or clinical trials in patients. Both products were withdrawn during the initial review process but were later resubmitted with results from new PK studies to support biosimilarity. However, justification is always needed as to why evidence of similar PK profiles from new studies could outweigh results of failed PK comparability in previous studies. In one of the cases just described, the following reasons were accepted by CHMP. The initial study was a parallel-group study that was underpowered based on a grossly underestimated inter-subject variability. The second PK study employed a crossover design which per se reduces variability and was sufficiently powered to show PK similarity. In addition, sample heterogeneity was further reduced in the second study by formulating stricter inclusion requirements for ANC and BMI, both of which are known to be key drivers of pegfilgrastim PKs. Thus, insights gained from the failed first study were used to design a more appropriate and successful second study . In the other case, the CHMP was concerned about the validity of results from the PK studies, and the applicant chose to withdraw the second application .
For the other four approved pegfilgrastim biosimilars, similar PK profiles were shown in crossover studies. The new draft revision of the respective guideline  suggests widening of the acceptance range may be possible for the main PK parameters of pegfilgrastim since the PK variability is high and the PK–PD relationship is rather flat. In contrast, narrow comparability margins of 90–111% are suggested for the main PD parameter, ANC. Of note, based on the consideration that the endpoint ‘duration of neutropenia’ in clinical trials in patients receiving chemotherapy is rather insensitive with regard to detecting product-related differences of two (peg)filgrastim-containing medicinal products, and a suitable PD endpoint is available and measurable in healthy subjects, the requirement for a clinical trial has been lifted in the new draft revision of the guideline .
For one biosimilar insulin glargine , the analysis of the main PD endpoint ‘glucose infusion rate’ (GIR) in the insulin clamp study presented by the applicant yielded results within the predefined comparability range. However, study subjects with very low glucose requirements (approximately < 5% of mean values) were excluded, which was not prespecified in the statistical analysis plan. When including glucose profiles from all subjects in the analysis, the predefined acceptance range was not met. Therefore, PD similarity was formally not shown.
Nevertheless, the following findings and arguments led to the conclusion of no relevant product-related PD differences:
Firstly, variable response to insulin is known from the literature, and cases of low glucose requirements were seen equally with test and reference product.
Secondly, sensitivity analyses using different cut-offs for exclusion of ‘low glucose requirements’ yielded point estimates consistently around one, suggesting intra-individual variability rather than product-related differences being the likely cause for not meeting the primary PD objective of the clamp study.
Thirdly, the primary analysis was based on log-transformed values, which is problematic for values close to zero and increases variance. An analysis of non-transformed data including all study subjects resulted in 95% CIs meeting predefined equivalence margins.
Finally, the guideline on biosimilar insulin and insulin analogues  provides the possibility of considering PD endpoints as secondary (thus allowing descriptive analyses) if analytical and functional similarity can be concluded from comprehensive characterisation and comparison of test and reference product and if the PD data reasonably support PK results.
All these requirements were considered fulfilled. Therefore, biosimilarity was concluded based on the totality of data and scientific considerations, and the biosimilar was finally approved.
Comparability Regarding Clinical Efficacy Outcomes
To date, five trastuzumab products have been approved in Europe, three from the pivotal comparability study in early breast cancer (EBC) and two in metastatic breast cancer with different endpoints and equivalence margins depending on the indication, backbone therapy and reference studies. Both clinical development paths are viewed as feasible and both are sensitive models and allow for meaningful results and extrapolation to other lines of therapy, a view shared by regulators in other jurisdictions.
For two trastuzumab biosimilars, the phase III study in patients with human epidermal receptor-2 (HER2)-positive EBC/locally advanced breast cancer did not formally meet the upper bound of the predefined equivalence margins for the primary endpoint (pCR), confirming non-inferiority but not formally excluding the possibility of superior efficacy [19, 46]. Overall, structural and functional similarity was shown in a comprehensive head-to-head comparison. However, the applicants noted a slightly reduced ADCC activity of some more recent batches of the reference product Herceptin used in the clinical trial, an observation also described in the literature . The overall contribution of ADCC activity versus antiproliferative effects through inhibition of ligand-independent HER2 signalling to the therapeutic benefit of trastuzumab is unknown. The observed difference in pCR was considered at least partly confounded by the small shift in ADCC activity in a number of the Herceptin batches used in the pivotal trial. Overall, it was considered doubtful that a shift as small as the one observed would have any significant impact in terms of clinical outcomes, although numerically it is thought to have contributed to a more extreme location of the point estimate and upper bound of the CI, shifting the latter beyond the prespecified equivalence margin.
In both instances, no clinically meaningful differences in the safety profile were found, notably no differences in cardiac toxicity as measured by left ventricular ejection fraction and incidence of symptomatic heart failure, which could be of concern with truly increased efficacy. Furthermore, the small difference was viewed as not being clinically significant given the small magnitude and the unlikely impact of small changes in pCR on clinically relevant time-dependent clinical endpoints.
From a PK perspective, comparability between the biosimilars and the reference product was demonstrated in healthy volunteers since the ratios (90% CI) of geometric means for both primary PK endpoints were within the acceptability range of 80–125%. Similarly, minimum concentrations measured at steady state (C trough) during the phase III trial supported similarity between treatments.
Given the totality of data submitted, similarity between the two trastuzumab biosimilars and the reference product was considered sufficiently established.
During the development of a rituximab biosimilar , shifts in some quality attributes (e.g. charge variants, glycan structures, ADCC) were noted for the EU reference product and US Rituxan . As both versions of the reference products were on the market simultaneously, both were considered safe, effective and appropriate for use in comparability studies.
Similar efficacy and safety of the biosimilar and the reference product was concluded based on the comparative clinical trial in patients with previously untreated, advanced-stage follicular lymphoma who received rituximab-cyclophosphamide, vincristine, prednisone (CVP) combination treatment. The difference in ORR was only − 0.40% (95% CI − 5.94 to 5.14) and the entire 95% CI for the difference in ORR between the two treatments was within the prespecified equivalence margin of ± 12%, allowing conclusion of equivalence.
However, at the time of data cut-off, more patients in the biosimilar group than in the reference product group had progressed or died, with an adjusted hazard ratio estimate (biosimilar/reference product) for progression-free survival (PFS) of 1.33 (90% CI 0.98–1.80). For CHMP, these figures suggested immature PFS data as 30.3% and 25.4% of patients had < 6 months and 6–12 months of PFS follow-up, respectively, and considering that the majority of PFS events occurred during the maintenance phase of the trial. In conclusion, the difference in PFS was considered to be due to patient heterogeneity or random data variation rather than a real treatment effect.
In comparison, for another biosimilar rituximab  the pivotal efficacy trial was an equivalence trial conducted in patients with RA, and a second, supportive study was performed in patients with advanced follicular lymphoma covering eight treatment cycles to demonstrate similar PK and non-inferior efficacy of the biosimilar to the reference product. The primary endpoint was met in both trials; however, in the supportive trial, again, the secondary endpoints PFS and overall survival were inconclusive at the time of analysis which, based on the totality of the evidence, did not preclude approval in the EU but led to additional requests by the US FDA. Therefore, another prospectively designed comparative clinical trial was conducted with rituximab monotherapy in patients with low tumour burden follicular lymphoma. The primary endpoint was ORR at 7 months, and the FDA-requested 90% CI for the difference in ORR between the two treatments fell within the prespecified margins, thus leading to FDA approval and confirming the positive scientific opinion reached by CHMP before .
Taken together, clinical comparability between a biosimilar candidate and rituximab originator can be demonstrated based on a PK trial in healthy subjects together with an efficacy trial performed in an autoimmune or oncological indication and a PK bridging study in the other indication. In the future, CHMP may revisit the value of the currently requested additional PK bridging data.
Comparability Regarding Immunogenicity
For one biosimilar infliximab , the primary efficacy analysis of the comparative phase III study in patients with RA demonstrated equivalent ACR20 response rates at week 30 with biosimilar and reference product. However, ADA rates measured with a highly sensitive assay were about 5–12% higher in the biosimilar cohort at the individual time points of determination (with about 50% of patients in the biosimilar cohort determined ADA positive at any time in the trial).
Some CHMP members argued that, although the predefined equivalence margins for the efficacy endpoint were met, the point estimate was consistently below one, indicating that efficacy of the biosimilar may be somewhat lower than that of the reference product. They further argued that it was not possible to exclude with reasonable certainty that this was the result of the higher incidence of ADAs, considering that ADAs were shown to reduce efficacy. There was also concern about extrapolation to other indications as patients with RA are treated concomitantly with immunosuppressive therapy, which is not the case for other infliximab-licensed indications. Therefore, the difference in ADA development could be even greater in the other indications.
However, the majority view was that the numerical differences in ADA rates had no meaningful impact on any of the efficacy parameters analysed, as the primary endpoints fell within the predefined comparability margins. Additional data showed that a similar percentage of patients in both treatment arms required increased doses of study drug irrespective of ADA status, which provided further evidence that ADAs did not have a relevant impact on efficacy. Further, the numerically higher incidence of ADAs in patients treated with the biosimilar was not associated with an unfavourable safety profile compared with the reference product. Moreover, despite concomitant treatment with methotrexate, ADA development is reportedly highest in patients with RA, indicating that immunogenicity would not be of greater concern if the biosimilar were to be used in the other licensed indications of the reference product
From an analytical, functional and PK perspective, convincing similarity of the biosimilar and the reference product was shown, further supporting the conclusion of biosimilarity and leading to approval.
The company submitted 78-week follow-up data post-authorisation which showed slightly higher ADAs for patients treated with Flixabi, but without any clinical impact, i.e. no statistically significant or clinically meaningful difference between treatment groups .
For a biosimilar etanercept , analytical, functional and PK data supported biosimilarity. The pivotal efficacy study was conducted in patients with RA and provided robust evidence of equivalent efficacy between the biosimilar and the reference product based on ACR20 response at week 24, supported by most secondary efficacy parameters and sensitivity analyses.
However, in terms of immunogenicity, there was a significant difference in overall ADA formation at week 24. While only three biosimilar-treated patients tested positive for ADAs at some point of the study, 39 patients tested positive in the reference product group, one of which also tested positive for neutralizing antibodies. The clinical impact of the difference in ADAs seemed negligible and the difference largely vanished after 8 weeks of treatment. In addition, the applied electrochemiluminescence assay suffered from a low drug tolerance, rendering the finding of reduced ADA incidence with the biosimilar uncertain.
In summary, the ADA data suggested reduced immunogenicity of the biosimilar, but this would not preclude a conclusion of biosimilarity .