Across the review period there were 45 biosimilar drug substance (INN) development programs reviewed. Of these, 42 programs obtained marketing authorizations in the EU and 23 were licensed in the US. Three out of 45 programs did not get an EU marketing authorization nor US licensure and will be discussed later in the text. All of the 42 approved programs had conducted at least one PK study (Table 1), while 38 programs conducted comparative efficacy and safety studies (Table 2). The source data and references to all programs cited in Tables 1 and 2 are provided in Supplementary Table 1 (see electronic supplementary material).
Table 1 Summary of clinical PK studies of development programs that led to marketing approval Table 2 Summary of clinical comparative efficacy studies of development programs that led to marketing approval Evaluation of Pharmacokinetic Studies
Table 1 summarizes the PK studies that were part of EU and/or US approved biosimilar development programs. In six (14%) out of the 42 programs, at least one PK study failed to show bioequivalence in relevant study endpoints. The drug substance (INN) and number of programs that did not meet relevant PK study endpoints were adalimumab (2), filgrastim (1), pegfilgrastim (2), and etanercept (1). The probable reasons for not showing bioequivalence included problems in study design or underestimated variability of the serum concentrations [12,13,14,15, 19,20,21,22]. For one adalimumab program that had a PK study failure, the company demonstrated that inadequate stratification of the study population with respect to their propensity for developing anti-drug antibodies influenced serum product levels and bioequivalence [23].
Clinical PK is the only test to assess and compare the combined impact of protein, device, and formulation on systemic exposure. This test is especially important as both device and formulation may differ between the biosimilar and its reference product. In 14% of the finally successful development programs, at least one PK study did not meet the primary endpoints, which shows its discriminatory power. These failures were shown to be due to methodological issues and not related to different product quality. Accordingly, meeting PK equivalence is a strong confirmation of biosimilarity. On the other hand, a failure of PK endpoints requires investigation to determine the root cause, which could be either a study design issue or a true difference that might be clinically relevant.
To this point, following a review of their own biosimilar experience to date, EU regulators concluded that PK studies will remain a sine qua non in biosimilar development [3].
Evaluation of Comparative Efficacy Studies
Evaluation of clinical comparative efficacy studies of development programs that led to marketing approval are summarized in Table 2.
Of the 42 approved biosimilar development programs, 38 had a clinical efficacy study. Four biosimilar programs—two pegfilgrastims, one enoxaparin, and one teriparatide—were approved in the EU without comparative efficacy studies, but were instead approved with a data package that included clinical studies conducted with suitable biomarkers. It is important to note that enoxaparin and teriparatide are not regulated as biologics in the US and follow-on versions were approved not as biosimilars but as drugs in the US. The majority of the biosimilar development programs with comparative efficacy studies [33 (87%); Table 2, category a] met primary efficacy endpoints and showed comparable safety/immunogenicity.
Another three (8%) programs initially failed primary endpoints in the comparative efficacy studies but still obtained biosimilar approval after post-hoc analysis and/or scientific justification (Table 2, category b). Two of these were trastuzumab biosimilar development programs. According to their EPARs, the comparative efficacy studies did not formally meet the primary endpoint and superiority of biosimilar against the reference product could not be excluded [24, 25]. For the US-FDA assessment, this was only raised in the ABP980 program, whereas the efficacy trial of the SB3 program met the primary endpoint, presumably because the US-FDA and EMA had different expectations for the analysis plan of the same study [26, 27]. During the EMA assessment, both companies argued that the issue was caused by a temporary reduction in antibody-dependent cellular cytotoxicity (ADCC) potency in a substantial number of reference product batches used in the clinical trial [24, 25]. In the reference product arms, 40% (program SB3) and 20% (program ABP980) of the patients received batches with low ADCC. The argument was made that this numerically impacted the primary endpoint, shifting the confidence interval to where it was slightly above the predefined equivalence margin. This explanation was supported by post-hoc analysis, which excluded patients treated with low ADCC reference product batches, for example. The shift in ADCC activity was observed in reference product batches that had expiry dates between 2018 and 2019 and was confirmed with in vitro functional assays and by quantifying the amount of afucosylated glycans present in the Fc domain of mAb because this glycan moiety is known to impact ADCC function [28, 29]. These findings point to a presumably unintended variation in the reference product by the reference product manufacturer and the importance in general of strict adherence to ICH Q5E and adequate manufacturing controls for all biologics. While some level of batch-to-batch variability is inevitable, it is important that this variability remain within acceptable ranges to avoid any detrimental impact on clinical outcome [30]. The trastuzumab biosimilar examples show that for products with substantial differences in critical quality attributes (CQAs) linked to a contributory mechanism of action, comparative efficacy studies can be sensitive enough to detect a difference in clinical endpoints. This example demonstrates the value of efficacy studies in confirming biosimilarity, however, it also shows that physicochemical methods and functional assays are able to detect differences in functional attributes with much greater sensitivity. In this case, a reduction in ADCC activity was caused by an increase in the amount of afucosylated glycans, which likely led to a detectable difference in efficacy outcomes.
The third case in category b impacted one pegfilgrastim biosimilar development program [31]. A comparative efficacy trial in breast cancer patients compared the biosimilar candidate with EU-sourced and US-sourced reference product. The study met its primary endpoint in demonstrating equivalent efficacy between the biosimilar candidate and the EU reference product. However, the primary endpoints did not show equivalence between EU- and US-sourced reference product and consequently between the biosimilar and US-sourced reference product. The biosimilar was approved in the EU because all requirements for comparing the biosimilar and the EU reference product were met, including a PK/PD study with EU-sourced reference product [31]. The biosimilar has not yet been approved in the US within the data cut-off date of this review [32]. However, three other independent pegfilgrastim biosimilar development programs led to approvals in both the EU and US [14, 15, 21, 32, 33]. All three programs confirmed similarity of EU and US reference product by extensive bridging data. Therefore, it is likely that the failure of the bridging study between EU and US reference products of the pegfilgrastim program mentioned above was due to other issues of the comparative efficacy trial, rather than any clinically relevant differences between EU and US reference products.
In two biosimilar development programs, efficacy studies failed to demonstrate comparable immunogenicity and required further optimization of the manufacturing process to improve product quality and to enable biosimilar approval (Table 2, category c).
The first example is a biosimilar somatropin [34]. The comparative efficacy study performed in 1999 confirmed that clinical efficacy endpoints were all met but there were higher rates of immunogenicity with the biosimilar candidate. The root cause analysis revealed a correlation between immunogenicity rate and higher amounts of host cell protein (HCP) impurities in the biosimilar, which were not detected by the commercial HCP assay used for process development and clinical trial batch release. Subsequently, the manufacturing process was optimized to purge the HCP impurities and further clinical studies confirmed comparable immunogenicity rates between the reference product and the biosimilar. The improved product quality together with the confirmatory clinical data enabled approval in 2006.
The second case affected a biosimilar epoetin [35, 36]. A comparative efficacy study, undertaken to complete the data package for gaining approval for subcutaneous (SC) treatment of chronic kidney disease patients, was stopped in 2009 after two patients developed neutralizing antibodies. The root cause analysis revealed that residual tungsten in the syringe, once in contact with epoetin solution, catalyzed the formation of insoluble aggregates, which are thought to increase the risk of immunogenicity. The clinical comparative efficacy trial was successfully repeated with clinical study material filled in low-tungsten syringes, which enabled the approval of the SC administration in the EU in 2016 [37].
Both cases of category c were traced to the presence of elevated levels of process-related impurities and not product-related impurities. The presence of HCP impurities are dependent on a number of factors including cell line, fermentation conditions, and the manufacturing process whilst the ability to detect and quantify them is dependent on the sensitivity and selectivity of the assay. From a regulatory perspective, expectations are that the amount of HCP impurities in the drug substance should be as low as possible. The HCP issue seen with somatropin is unlikely to happen again because the state-of-the-art in HCP control has increased substantially in the last two decades. For example, there is now greater understanding of the risk of HCP impurities, there are better assays for detection and pharmacopeia guidelines available for HCP analysis [38, 39]. Another research group also investigated the issue of residual tungsten originating from syringe manufacturing and how it could induce protein aggregation [40]. Learning from the past and with additional knowledge, a repetition of the epoetin case is unlikely.
In general, the control of process-related impurities and prevention of unwanted immunogenicity is required for all biologics throughout their product life cycle, including manufacturing changes. Extensive knowledge of these impurities and other risk factors for unwanted immunogenicity help to design manufacturing controls and regulatory oversight to achieve comparable low immunogenicity [41, 42].
Biosimilar Development Programs that Did Not Receive Approval or are Currently on Hold in the EU and/or US
In an attempt to counter the sampling bias of this analysis as it contains all biosimilar programs that received at least one approval in the EU or US, it is also important to evaluate the contribution of clinical data of those programs that did not result in biosimilar approval in any of those regions. We found three published examples of programs that entered the clinical stage of development and are currently on hold, received a negative opinion from the regulators and were not approved, or where the company chose to withdraw their application before the end of the formal review process.
In the EU, an interferon alfa biosimilar candidate received a negative EMA opinion in 2006. Most importantly, the analytical and functional data was not deemed by the regulators to be comparable between the reference and biosimilar products. Furthermore, Study 002 for this proposed biosimilar showed highly anomalous PK data. There were also uncertainties in PD equivalence, especially in viral response that could not be resolved. Nonetheless, the primary and secondary endpoints of the comparative efficacy trial were met [43].
In the EU, an insulin biosimilar candidate was withdrawn by the company in 2012. The analytical and functional data was not deemed by regulators to be comparable between the reference and biosimilar products. In this case, comparative clinical PK studies demonstrated bioequivalence, there were similar PD outcomes, and the supportive comparative efficacy studies showed similar efficacy and safety. However, in addition to the lack of analytical and functional comparability there were other relevant good manufacturing practice (GMP) issues and the clinical trial material was not demonstrated to be representative of the proposed commercial product [44].
An abatacept biosimilar candidate missed the primary endpoint in a three-arm PK study with US- and EU-sourced reference product. The program is on hold [45]. No further information was found in the public domain.
In summary, in the EU, both the interferon alfa and insulin biosimilar candidates failed to demonstrate analytical comparability and the abatacept candidate failed to show PK bioequivalence with EU- and US-sourced reference product. Therefore, these cases demonstrated that issues in showing biosimilarity were identified either at the analytical or clinical PK level, prior to entering the comparative efficacy trial. The interferon example also illustrates that a successful comparative efficacy study cannot compensate for gaps in the analytical data.