FormalPara Key Points for Decision Makers

The dominant approaches to modeling the cost effectiveness of immuno-oncology (IO) therapies in advanced non-small cell lung cancer are Markov and partitioned survival models. There is substantial variability in the handling of related methodological challenges.

The lack of long-term trial results for IO therapies led to the use of real-world data for survival extrapolation and ad hoc assumptions related to long-term benefits of IO. A treatment effect lasting for 3 or 5 years after the initiation of the therapy was commonly assumed.

For utility modeling, health state-based utilities and time-to-death models are widely used.

1 Introduction

Lung cancer is among the most common types of cancer, affecting approximately 2.2 million people worldwide in 2020 [1]. It is the leading cause of global cancer-related mortality, resulting in 1.80 million deaths annually [1]. Non-small cell lung cancer (NSCLC) accounts for 85–90% of lung cancers [2].

The treatment landscape in lung cancer has been significantly affected by advances in the categorization of patients into biomarker-based subgroups. Today, drug development has a strong focus on oncologic biomarkers, such as programmed death-ligand 1 (PD-L1), epidermal growth factor receptor (EGFR) mutation, anaplastic lymphoma kinase (ALK) mutation, B-Raf proto-oncogene (BRAF) mutation, MET exon 14 skipping, neurotrophic tyrosine receptor kinase (NTRK) 1/2/3, reactive oxygen species (ROS)-1, and ret proto-oncogene (RET) [3]. Although chemotherapy was long a cornerstone of treatment for advanced or metastatic NSCLC, new strategies specific to genetically differentiated subpopulations, along with tests to identify them, offer alternative treatments for patients with specific characteristics [4, 5]. Immune checkpoint inhibitors, such as those blocking the programmed cell death protein-1 (PD-1)/PD-L1 pathway (hereafter referred to as immuno-oncology therapy [IO]), are among the most-researched new treatment strategies for NSCLC and other cancer types [6]. Many of these drugs are being developed in conjunction with companion diagnostic assays, which help determine the eligibility of patients and aim to predict outcomes linked to a specific drug. If available, use of a specific test is frequently part of a drug’s labelling [7, 8]. Since 2016, several IOs have been approved for NSCLC as first-line (1L) treatment in patients whose tumors have no EGFR or ALK mutations and as second-line (2L) treatment irrespective of EGFR and ALK status [4, 5]. As with other new therapies, the benefits of IO must be weighed against treatment costs compared with conventional chemotherapy.

Our aim was to review economic evaluations of IO for previously untreated patients with advanced and metastatic NSCLC. Specifically, our objectives were to identify methodological challenges associated with modeling the cost effectiveness of IOs in the 1L treatment of advanced or metastatic NSCLC and to understand how these challenges have been handled in published studies and health technology assessments (HTAs). Our motivation for concentrating on the 1L setting stemmed from a focus on methodology, since issues such as PD-L1 testing and switching to IOs are more pronounced in 1L models. Economic evaluations of IOs in the 1L treatment of NSCLC have been reviewed previously [9]. We are expanding on previous findings by analyzing further methodological aspects of these models.

Against this background, a systematic literature review (SLR) was undertaken to provide a comprehensive overview of economic models used in cost-effectiveness analysis (CEA) as well as cost-utility, cost-minimization, cost-benefit, and cost-consequence analyses of IOs as 1L treatments in advanced or metastatic NSCLC. Hereinafter, we use the term CEA more commonly to refer to all analysis types, as is frequently done in practice.

Throughout the review, we focused on specific methodological topics such as statistical bias arising from treatment switching (switch), survival extrapolation and its validation (extrapolation), using external real-world data (RWD), estimation of long-term treatment effect (effect), assumptions related to duration of treatment (DOT), PD-L1 testing (assay), and utility estimation (utility). Of interest were the justifications that authors of the reviewed articles (or technology appraisal [TA] submissions) provided for their methodological choices as well as the strengths and limitations reported. Finally, we identified, where possible, emerging methodological approaches that may become usual practice. Although other methodological topics are related to assessment of oncology therapies in NSCLC, we directed our focus to factors that are likely key drivers of cost-effectiveness results. As more IOs become available for patients with NSCLC, it is important that we learn from previous practices and better approach these issues in future CEAs of IOs. This analysis aimed to spur this debate by providing an overview of current practices, offering insights for assessing their suitability, and developing recommendations for future CEAs of IOs.

2 Methods

The SLR was conducted in accordance with Cochrane Collaboration [10] and PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [11]. Literature searches were performed using MEDLINE, Embase, and EconLit. The literature search covered 1 January 2009 through 22 January 2020, which was the date when the search strategy was applied. Searches were limited to identify studies published from 2009 onwards, as older economic evaluations may no longer be applicable to current clinical practice given the rapid change in the treatment landscape and the fact that IOs have only recently been approved in NSCLC. Searches were conducted using combinations of keywords for recurrent or metastatic treatment-naïve NSCLC, relevant treatments approved or under investigation for this patient population, and economic study designs of interest. Search terms were adapted to each database by using appropriate indexing terms (e.g., Emtree in Embase and medical subject headings in MEDLINE) and search syntax. The search strategies can be found in the electronic supplementary material (ESM)-1–3.

Recent scientific conferences may provide details on research that has not yet been published as a manuscript in a peer-reviewed journal. For this reason, we searched the 2016–2019 meetings of the American Society of Clinical Oncology, American Association for Cancer Research, European Society for Medical Oncology, and Professional Society for Health Economics and Outcomes Research to identify any economic studies not indexed in Embase. Relevant TA documents from the UK National Institute for Health and Care Excellence (NICE), Scottish Medicines Consortium (SMC), German Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (IQWiG), French Haute Autorité de Santé (HAS), Canadian Agency for Drugs and Technologies in Health, and Australian Pharmaceutical Benefits Advisory Committee (PBAC) were also considered. The bibliographies of relevant published SLRs and TA submission documents were checked as a quality-assurance step.

2.1 Study Selection

Predefined criteria for selecting economic models (i.e., population, interventions and comparisons, outcomes, study design, and time period [PICOS-T]) are outlined in Table 1. Articles were included if they met all of the following predefined criteria: economic evaluations (i.e., cost-benefit, cost-effectiveness, cost-utility, cost-minimization, or cost-consequence analysis) of IOs for patients with recurrent or metastatic, EGFR/ALK mutation-negative NSCLC who had no prior systemic treatment for recurrent or metastatic disease (treatment naïve). Articles were also considered for inclusion if no information on EGFR/ALK mutation status was provided. Only articles and submission documents (or summaries of submission documents) published in English were considered. No geographic limits were applied. Budget impact analyses were not considered in this review as their purpose differs from that of economic evaluations [12].

Table 1 PICOS-T criteria

For articles identified through database searching, two reviewers conducted title and abstract screening and full-text review independently. Disagreements between the reviewers were resolved by a third reviewer. Manual searches of conference abstracts and TAs were conducted by one reviewer, with search approaches and results validated by a second reviewer.

2.2 Data Extraction

Data extraction was performed by one reviewer, with a second validating all data. The use of a standardized data extraction template using prespecified key data elements focusing on the major elements of modeling methodology ensured a systematic approach to data extraction.

Data elements included study characteristics and model specifications, CEA model structure (e.g., Markov, partitioned survival [PS]), time horizon, perspective of the evaluation (e.g., societal, health system), health utility modeling, patient population, and interventions under consideration. Additional extraction of methodological elements provided information on model assumptions (switch, extrapolation, effect, DOT, assay, and utility). The initial set of topics to be explored was specified a priori based on the authors' prior research and expert agreement, which lay around factors that likely drive the cost-effectiveness outcomes. The list was revised after the first round of data extraction to further expand on approaches that were not consistently applied among identified CEAs. Some of the original topics of interest were not pursued as they were not considered in any of the identified CEAs (e.g., application of cure models).

2.3 Quality Assessment

The quality of the CEAs reported in peer-reviewed, full-text articles was described using the Drummond checklist [7]. Quality assessment was not performed for models reported in conference abstracts because available information was limited. Model specifications in the context of TAs had already been critically reviewed by the HTA agencies, with model quality described as part of the critique. Particularly in the case of NICE TAs, detailed information on model specifications and any related challenges highlighted by the evidence review group (ERG) were included in submission documents. Further quality assessment of models in TAs was therefore not warranted. We did not exclude any studies from this review based on their quality as the aim of our research was to provide a comprehensive overview of the modeling methods used for IOs rather than to synthesize the cost effectiveness of specific therapies or inform decision making regarding 1L treatment of NSCLC.

2.4 Evidence Summary

To align with the objective of this research, the study characteristics and model specifications were qualitatively summarized. Section 3 first reports the main characteristics of the included studies and our findings related to each of the methodological questions raised. Economic models included in TA submissions were discussed separately from those published as full-text articles or conference abstracts to recognize that authors of journal articles are allowed and encouraged to explore unconventional modeling techniques, whereas methods applied in TAs are more likely to reflect current expectations of HTA bodies. Broader context and interpretation of our findings on each modeling topic are provided in Sect. 4. We completed our analysis by identifying key results and providing recommendations for future CEAs in Sect. 5.

3 Results

3.1 Literature Search Results

Database searches resulted in 1724 records. Following removal of duplicates, 1362 abstracts were screened, of which 217 were considered for full-text review. Upon full-text screening, 27 articles identified through database searches ultimately met the inclusion criteria, with an additional 19 records (three conference abstracts, one full-text publication, and 15 TAs) identified through other sources or bibliography checks. In total, 46 records on economic models (18 full texts [9,10,11, 13,14,15,16,17,18,19,20,21,22,23,24,25,26,27], 13 conference abstracts [28,29,30,31,32,33,34,35,36,37,38,39,40], and 15 TAs [41,42,43,44,45,46,47,48,49,50,51,52,53,54,55]) met eligibility criteria and were ultimately included in the review (see Fig. 1).

Fig. 1
figure 1

PRISMA diagram. aThe references of relevant SLRs were checked for eligibility as a quality-assurance step. 1L first line, ALK anaplastic lymphoma kinase, EGFR epidermal growth factor receptor, IO immuno-oncology, NSCLC non-small cell lung cancer, PICOS population, intervention, control, outcomes, study design, PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses, SLR systematic literature review, TA technology appraisal

Of the 31 models published as full-text articles or presented at conferences, five were adaptations of other models included in the SLR (three models published as full texts [9, 16, 21] and two presented in a TA [33, 34]). In total, 15 relevant, company-submitted TAs and all related evaluation documents (e.g., ERG analyses) were found for NICE (n = 4), SMC (n = 4), PBAC (n = 5), and the pan-Canadian Oncology Drug Review (pCODR; n = 2). No models for 1L IOs were included in assessments by France’s HAS or Germany’s IQWiG. Of the 15 identified TAs, three were resubmissions [46, 47, 52] for which the last model version was included in this summary to avoid counting a model twice. However, model information that was included in the original submission but not mentioned in the resubmitted documents was assumed to also apply to the finalized model.

When summarizing our findings, we covered 26 publications comprising 38 unique models (15 full texts [10, 11, 13,14,15, 17,18,19,20, 22,23,24,25,26,27], 11 conference abstracts [28,29,30,31,32, 35,36,37,38,39,40], and 12 TAs [41,42,43,44,45, 48,49,50,51, 53,54,55]) and discussed adaptations only when explicitly mentioned.

3.2 Summary of the Included Studies

Tables 2 and 3 present the characteristics of the included CEAs. Of the unique models, most were of a European (including the UK) perspective. Model populations were similar across CEAs; almost two-thirds of models were in patients with metastatic NSCLC only [10, 11, 14, 17,18,19, 23, 24, 26, 28,29,30, 35,36,37,38, 40], whereas more than one-third included patients with advanced or metastatic disease [13, 15, 20, 22, 25, 27, 31, 32, 39]. All 38 models compared the cost effectiveness of IOs (atezolizumab, nivolumab, and pembrolizumab, alone or in combination with other treatments) against standard chemotherapy; eight models [11, 14, 18, 19, 29, 41, 43, 49] included more than one IO strategy, of which three [11, 14, 29] concerned different IOs (Fig. 2). However, only one of the three studies also directly compared IOs that had corresponding PD-L1 assays with each other [11], whereas the remaining two compared IOs versus chemotherapy only [14, 29]. Most studies were Markov cohort models [15, 20, 22,23,24,25,26,27, 31, 35, 36, 45] or PS models [10, 17,18,19, 28,29,30, 37,38,39,40,41,42,43,44, 48,49,50,51, 53,54,55], with two patient simulation models [11, 13] and one analysis comparing pharmacological costs and overall survival (OS) gains across treatment options [14]. One study from the USA did not report the type of model used [32].

Table 2 Included full-text publications and conference abstracts
Table 3 Included technology appraisals
Fig. 2
figure 2

Treatment comparisons in the identified cost-effectiveness analysis publications and technology appraisals (TAs)

Regarding the quality of the models, we identified weaknesses in data collection in all 15 unique models [10, 11, 13,14,15, 17,18,19,20, 22,23,24,25,26,27] published in full-text articles, particularly surrounding details of the design and results of the study providing effectiveness results (where based on a single study), as well as the patients from whom evaluations had been obtained. In two cases, an indirect treatment comparison was used to derive hazard ratios (HRs) versus the comparator of interest [18, 19]. We also had concerns regarding inflation adjustment and currency conversion of prices and the methods used for estimating quantities and unit costs. All but one publication provided sufficient information on health states, economic inputs (including unit costs and quantities), and the model used. In contrast, Giuliani and Bonetti [14] used a simple approach that included only pharmaceutical costs; no health states were included in the model. Analyses and results were generally well-reported, although information on choice of a discount rate and any issues regarding the generalizability of results to other settings or patient groups was lacking in more than half of publications. Detailed quality assessment is summarized in ESM-4.

3.3 Findings on Methodological Challenges

Table 4 summarizes findings and key methodological challenges, including switch, extrapolation, effect, DOT, assay, and utility.

Table 4 Methodology of included models

3.3.1 Switch

None of the models published as full-text studies carried out adjustments for treatment switching in the base case, although one explored this adjustment in a scenario analysis [17]. Another abstract [38] mentioned applying an adjustment in the base-case analysis. It was somewhat more common for models included in TAs to adjust for treatment switching. Three TAs [42, 45, 53] (two NICE TAs and one SMC TA) included an adjustment in the base case, and one [50] explored the impact of switching in scenario analyses; in another TA [51], it was the review group that carried out the analysis. The other TAs did not carry out switching adjustment or did not provide relevant information.

Two NICE TAs [41, 42] (an assessment of pembrolizumab combined with chemotherapy in patients with non-squamous NSCLC and an assessment of pembrolizumab monotherapy in PD-L1-positive patients with NSCLC) provided OS estimates using three adjustment methods for treatment switching: rank-preserving structural failure time (RPSFT), inverse probability of censoring weights (IPCW), and the two-stage method. In one NICE TA [42], the ERG expressed concerns regarding the validity of various adjustment approaches, considering the two-stage method to be most appropriate because patients could switch shortly after progression, thus creating a defined secondary baseline. In SMC 1239/17 [53] (an assessment of pembrolizumab monotherapy in patients with PD-L1 ≥ 50%), the two-stage adjustment approach was used in the base case; IPCW and RPSFT were tested as scenario analyses. Similarly, the two-stage adjustment was used in SMC 2127 (an assessment of pembrolizumab combined with chemotherapy in patients with non-squamous NSCLC) [54], with IPCW explored in a scenario analysis. All other identified TAs that adjusted for treatment switching, in the base-case or scenario analyses, used the two-stage method. However, none of the models that performed treatment switching adjustment in the base case provided a rationale for their choice of adjustment method.

3.3.2 Extrapolation

Eight models in published studies (including four adaptations) [9, 11, 16,17,18,19, 21, 34] and three TAs [43, 50, 54] used RWD for extrapolating OS. None of the conference abstracts mentioned external data for extrapolation or validation. Given the relatively short time that IOs have been a part of the NSCLC treatment landscape, RWD on the use of IOs are scarce. Hence, external data used in these studies were directly applicable only to the extrapolation of the standard chemotherapy arm. For extrapolation of the OS of the IO arm, additional assumptions were needed about the long-term relative efficacy of IO and chemotherapy. All but one of the models using RWD for OS extrapolation used Surveillance, Epidemiology, and End Results (SEER) data (n = 10, including model adaptations) [11, 16,17,18,19, 21, 34, 43, 50, 54]. The remaining study used the Swiss National Institute for Cancer Epidemiology and Registration (NICER) database [9]. Two studies (one using SEER [17] and one using NICER [9] data for extrapolation) used the US Flatiron database to validate the long-term OS extrapolation.

All models using external data for OS extrapolation extrapolated only from a specified timepoint onwards. Before that timepoint, the Kaplan–Meier (KM) curve from trial data, parametric extrapolations, or both were used. In one model [21], base-case OS curves were parametric, but in scenario analysis, a constant HR derived from SEER data was used to extrapolate OS from year 5 in both treatment arms. In NICE TA 600 [43] and a corresponding publication [34], as well as in Harding et al. [33] but not in its corresponding TA [41], authors used SEER data to derive mortality risks beyond the trial for the patients in the standard-of-care arm. In SMC 2187 [55], OS from 12 months onward in the standard-of-care arm was based on SEER data; in the treatment arm, a trial-based risk ratio was applied to the SEER-based projection.

3.3.3 Effect

Among the 15 models published as full-text studies, four [9, 11, 17,18,19] claimed to have constrained treatment benefit to 0 at a given timepoint in the base case. For example, in studies by Insinga et al. [18, 19], the same mortality risks based on SEER data were assumed for chemotherapy and for pembrolizumab in combination with chemotherapy after 5 years.

Long-term treatment benefit was not mentioned in conference abstracts, whereas the approaches in the TAs varied. Most TAs (n = 8) [41,42,43, 47, 51, 53,54,55] considered treatment-effect waning only in scenario analyses, and only two [44, 50] did so in the base case; two of three Australian TAs did not pursue this approach in either setup [48, 49]. Treatment-effect waning was considered in most studies by assuming treatment effect would last between 3 and 5 years; one TA [51] also explored a treatment-effect duration of 2 years. SMC 2187 explored 3-year and 5-year cutoffs in separate scenarios [54].

3.3.4 Duration of Treatment

In our analysis, 12 models [17,18,19, 41,42,43,44, 47,48,49, 51, 55] used the observed time-on-treatment distribution from the trial, whereas eight studies [10, 13, 15, 20, 22, 24, 25, 54] used progression-free survival (PFS) as a proxy for DOT. This information was not reported for the remaining models.

Regarding the treatment-stopping rules, 13 models for pembrolizumab [15, 17,18,19,20, 22, 25, 40, 42, 43, 47, 53, 54] applied 2-year stopping rules (35 cycles) in the base case, and one model [24] explored this option as a scenario for atezolizumab. Applying stopping rules to duration of IO use in a model built on trial data with longer DOT can raise concerns about the validity of using trial efficacy data without adjustment. None of the studies adjusted the efficacy data to address this concern.

3.3.5 Assay

Our review found that only one model compared therapies that had different corresponding PD-L1 assays, but as the patient population was not restricted based on PD-L1 level, concordance issues did not arise [11].

3.3.6 Utility

Two approaches in modeling health utilities were found. A total of 15 studies [10, 13, 15, 20, 22,23,24,25,26,27,28, 40, 48,49,50] modeled utilities by health status (i.e., progression free vs. progressed), whereas another four studies [11, 18, 19, 38] applied the time-to-death approach. Ten studies [17, 41,42,43,44,45, 51, 53,54,55] explored both. None of the models combined health status and time-to-death–based utilities in a single framework in the base case; however, this option was explored by the ERG in NICE TA 557 [41].

Most models primarily defined health states based on progression when applying state-based utilities, except for one model [15, 24], which defined health states based on treatment line. Time-to-death intervals were identical in all but one study pursuing this approach: ≥ 360 days, 180–360 days, 30–180 days, and < 30 days prior to death. In NICE TA 584 [44], time periods were slightly different: ≥ 211 days, 75–210 days, 35–75 days, and < 35 days prior to death. Eight models assumed identical utility values across treatment arms without accounting for adverse event (AE) disutility and without exploring the possibility of differentiation in a scenario analysis [11, 13, 17, 20, 23, 27, 48, 50]. In 17 models, in base-case or scenario analyses, utilities were different across treatment arms, by differentiation of base utility values or by application of AE disutilities [10, 15, 18, 19, 22, 24,25,26, 41,42,43,44, 47, 49, 52, 53, 55]. Four models [42, 43, 53, 55] applied age adjustment to utilities, of which one [43] applied sex-specific utilities.

Utility values corresponding to health states or time-to-death intervals were sourced only from published literature in 11 models [11, 13, 15, 20, 23,24,25,26,27, 35, 39]. EQ-5D data collected during clinical trials were used in 17 models [10, 17,18,19, 40,41,42,43,44, 47,48,49,50,51,52,53, 55]. One study mentioned adjusting literature-sourced utility values using data from European Organization for the Research and Treatment of Cancer Quality of Life Questionnaire Core 30 items from KEYNOTE-024 to reflect utility difference between treatment arms [22].

4 Discussion

To our knowledge, this is the first study to systematically investigate methodological challenges of CEAs for IOs as 1L treatment in NSCLC and how they were handled in published CEAs. We focused on key challenges, including switch, extrapolation, effect, DOT, assay, and utility. Some of these challenges may apply to CEAs of other anticancer treatments, whereas others are specific to IOs in NSCLC. The identified CEAs showed marked differences along some methodological aspects. However, this variability was usually confined to different choices from a limited number of traditional options as opposed to innovative approaches. Many challenges in IO CEAs could be attributed to lack of data, particularly because IOs are relatively new. For example, few utility data are collected after disease progression, and there is considerable uncertainty about the optimal treatment durations of IOs. In addition, some challenges, such as lack of concordance in PD-L1 assays, were not adequately addressed, again possibly because no concordance data were available.

PS and Markov cohort models are the dominant approaches in modeling the cost effectiveness of IOs for advanced NSCLC and were the only approaches in identified TAs. Among published studies, we identified two patient-level simulations [11, 13] and one study [14] that simply compared OS and pharmacological costs between the different arms of each trial.

In addition to OS, PFS was another key efficacy measure in all models reviewed. However, none of the models included response as a clinical outcome. Given the extended duration of response seen with IOs, it is possible, though not yet confirmed, that response may adequately act as a surrogate for OS in models of IOs. Response-based modeling has been employed in other indications but not yet in NSCLC. Gibson et al. [56] compared economic evaluations of nivolumab in combination with ipilimumab in patients with metastatic melanoma using a conventional three-state PS approach and a five-state PS approach, splitting pre-progression state by response status and post-progression state by 1L treatment type (IO vs. conventional therapies). The OS and PFS estimates from both approaches were identical, whereas the five-state approach provided more granularity in terms of response breakdown; consequently, response-specific utility data can be applied to differentiate between utility level of pre-progression patients with or without response. The incremental cost-effectiveness ratio (cost per quality-adjusted life-year) was lower with the five-state PS approach; however, conclusions could not be drawn that this lower cost effectiveness was due to inclusion of response in the model structure, given the presence of other confounding factors (e.g., the utility data employed in the two approaches were from different sources). From the reimbursement perspective, response-based models have not gained acceptance in TA submissions. The NICE ERG mainly criticized the insufficient justifications for a response-based approach over the conventional one, thus adding uncertainty from the response-based approach to cost-effectiveness estimates [57]. Further clinical research is required to understand the impact of response on OS, health-related quality of life, or medical resource use. If the relationship exists, then a response-based model can be an appropriate methodology to capture health economic implications of IOs.

The important role of PFS in CEAs of IOs raises further methodological challenges in comparing therapies across different trials with an unanchored comparison (e.g., comparison with a single-arm study). When the schedule of study visits for assessing progression differs significantly between trials, the estimated relative efficacy of treatments will be biased [58]. We recommend that future CEAs consider whether time-assessment bias may be present in the analysis and adjust for it if necessary, using published assessment-schedule matching techniques [58]. This concern can be more pronounced when assessing the cost effectiveness of new IOs in earlier cancer stages, in which PFS is more frequently used as a primary endpoint and the evaluation is conducted at an early stage when only a single-arm study is available (e.g., phase II data).

4.1 Switch

Treatment switching (i.e., patients switching from the control group to IOs following disease progression) occurs frequently in oncology trials, particularly as IOs have become part of the 2L treatment landscape. However, this approach risks underestimating the treatment effect of the experimental IO if no adjustment to the standard-of-care arm is made. It should be noted that, as IOs became widespread in later lines and the use of IOs after progression with chemotherapy became a standard practice in patients whose tumors expressed PD-L1 (e.g., in the UK and USA), it became unnecessary to apply a switching adjustment in model analyses. However, the switching pattern in the trial (i.e., treatments, proportion of switching) should adequately reflect clinical practice in the country of the analysis. In addition, to employ this assumption, the population of the analysis should represent the population indicated for the use of IOs in subsequent lines.

Many of the included models adjusted for treatment switching, but none provided rationales for the adjustment methods selected. This lack of adequate justification for method choice is in line with the findings of an SLR of published trials and industry submissions in oncology [59], which claimed that the quality of reporting on implementation of a switching adjustment was generally poor. In agreement with the recommendations of Sullivan et al. [59], future CEAs that employ treatment-switching adjustment should discuss whether the underlying assumptions corresponding to the chosen adjustment method are met in the specific application and should present a visual comparison of observed and adjusted OS curves. A comparison of different switching adjustment methods and recommendations for their application is available in published literature [59] and in a NICE technical support document [60]. The most appropriate adjustment method must be carefully evaluated against the assumptions required by each method, which depend on the data and characteristics of the trials [60].

4.2 Extrapolation

OS is one of the key parameters affecting cost-effectiveness results, yet OS extrapolation is particularly challenging for IOs given frequent delays in clinical benefit. This delay may result in an initially higher rate of progression or death than in the chemotherapy arm [61,62,63]. Related to this concern is the sustained response to treatment that may be experienced by surviving patients receiving IO. This sustained response is reflected in a plateau of the tail of the OS curve, making it difficult to credibly extrapolate OS based on trial data and parametric extrapolations only. According to HTA guidelines, external data can be used to inform long-term survival or assess the plausibility of extrapolated OS [64]. The balance in using parametric extrapolation based on trial data and using external data is the standard trade-off between internal and external validity that is present in many situations. Researchers must carefully assess whether the population and treatment options in external data are in line with those of the trial.

We identified seven independent models using RWD (all of which used SEER data) [11, 17,18,19, 43, 50, 54] and one adaptation using NICER data [9] for long-term extrapolation. We are not currently aware of any data that would be more suitable for the purpose, and perhaps for this reason, the use of SEER data for OS extrapolation was accepted in the reviewed NICE, SMC, and pCODR submissions, even though SEER data are based on the US population. It is important to acknowledge the limitations that come with SEER data, and we recommend that future assessments comment on how these limitations may affect results. A common limitation is that applying SEER data in model adaptation for a country other than the USA may not accurately reflect that country’s population. We recommend using geography-specific data whenever available. In addition, the SEER database contains patients, namely EGFR- and ALK-positive patients, who may live longer than the target population of IOs. In NICE TA 600 [43], the ERG highlighted the limitations of using SEER data even for extrapolation of OS for patients receiving chemotherapy, given that these patients did not have access to 2L IOs. Although this limitation is not specific to SEER, but is rather related to the limited time since the introduction of IOs in NSCLC, it is nevertheless important to assess its impact on estimations.

While OS extrapolation in general is found to be a key driver of incremental cost-effectiveness ratios (ICERs), analyses explicitly comparing scenarios with and without using RWD for OS extrapolation reported a relatively low difference between the ICERs. For example, in one model, the authors claimed that using parametric extrapolations instead of NICER data increased the ICER by 8.3% compared with the base-case value [9], whereas in pCODR 10101 [50], the ICER increased by only 2.6% with SEER data. It is interesting to assess the quantitative importance of extrapolation method on ICER, but caution is advised when generalizing these results because the use of different functional forms in the parametric approach could lead to different results. Also, in most cases, these reported differences in ICERs also encompassed the difference between implicit assumptions on long-term treatment effect between a scenario based on separate parametric extrapolations and one based on external data.

Finally, it is good practice to validate OS extrapolations against external data. We found only one model using RWD (i.e., US Flatiron data) for validation of its SEER-based OS extrapolation [17].

However, the lack of long-term data for patients receiving 1L IOs presents a challenge when using external data sources for OS extrapolation and validation. Currently, external data on chemotherapy-treated patients have been used to extrapolate OS for those patients, and assumptions on relative efficacy are applied to these curves to produce an OS extrapolation for IO-treated patients in the reviewed studies. As more and longer follow-up RWD on IOs become available for 1L treatment of advanced or metastatic NSCLC, it will be possible to differentiate extrapolations of OS for chemotherapy- and IO-treated patients [65]. Despite the plateau observed in the tail of OS curves of some IOs, we did not identify any studies using a cure model.

4.3 Effect

Without a full understanding of the long-term benefit of IO, estimating the extent and DOT benefit is a particularly challenging methodological issue to address in models. Lack of RWD on long-term benefits of IOs makes it challenging to develop robust assumptions for model development. Related to this challenge is how to model delayed clinical benefit and how to address truncated survival curves for patients receiving IO [56].

Although some evidence exists of sustained response following discontinuation of IOs, the extent of long-term relative treatment effect after cessation of IO is still uncertain given the limited availability of long-term data [66]. In the reviewed studies [11, 17,18,19, 41,42,43,44,45, 50, 51, 53,54,55], treatment-effect waning was carried out by setting the HR to 1.0 at a prespecified timepoint, as opposed to having a single OS curve for the time period starting at the cutoff. Given the lack of evidence on long-term outcomes, to assume that treatment effect stops after 3–5 years can be considered conservative. The cutoff timepoint is rather arbitrary for now, but more real-world evidence on patients receiving IO is accumulating and will inform this decision. It should be noted that uncertainty in long-term benefit may also be affected by patients receiving treatment for shorter periods than observed in clinical trials.

The assumption of treatment-effect waning affects the ICER to a varying degree depending on model structure and parameter values, including the shape of the OS curves. Therefore, the generalizability of estimates from any specific model is limited. Treatment-effect waning implies a lower efficacy of the IO, leading to a higher ICER when comparing IO with chemotherapy. The sooner the treatment effect is assumed to disappear, the higher the effect on the ICER. The quantitative importance of the difference between ICER values corresponding to scenarios with and without treatment-effect waning varies across models. One analysis experimenting with multiple cutoff points found that the ICER increased by 30% compared with the base case without treatment-effect waning if the treatment effect stopped at 5 years and by 79% if it stopped at 3 years [55]. A more moderate effect was observed in an SMC TA, in which assuming that the treatment effect discontinued at 5 years increased the ICER by 8% compared with the base-case value [52].

4.4 Duration of Treatment

Determining optimal DOT for IOs is still an ongoing challenge in clinical practice [67]. Treatment-stopping rules varied in clinical trials, ranging from continuation until disease progression or unacceptable toxicity to discontinuation after 2 years of treatment. Time to treatment discontinuation (TTD) and PFS were the most commonly used DOT measures in existing models. When data are available for distribution of TTD (i.e., KM curves representing proportion of patients remaining on treatment) for all treatments involved in the comparison, those data should be favored against PFS data. However, when patient-level data or the KM curve of TTD are not available (e.g., likely the case for comparators not included in the trial of the study drug), median DOT that are commonly reported in trial publications can be considered. However, to use median DOT in models, an assumption on the distribution will have to be made. PFS is often used as a proxy for TTD, especially for conventional anticancer therapies. Although this approach may be appropriate in markets with strict reimbursement policies and treatment-stopping rules, it risks underestimating the DOT of IOs given that patients may be allowed to continue IO beyond disease progression [68].

Based on the reviewed models, we found an evidence gap related to the DOT for IOs. Until long-term RWD on DOT are available to support estimations of DOT in the model, alternative approaches should continue to be explored. An analysis of TTD and PFS data for IOs, targeted therapy, and chemotherapy in NSCLC was provided by Blumenthal et al. [67]. Their results may also inform decisions around choosing the most appropriate approach for modeling TTD in any specific model. Inspired by the high observed correlation between TTD and PFS [67], one approach we propose for consideration relies on information being available on TTD and PFS curves for one or more treatments that are being compared. The relationship between TTD and PFS curve (e.g., HRPFS vs. TTD) from one treatment can be used to assume the same HR will apply to comparator treatments, thus estimating TTD curves for other treatments by applying HRPFS versus HRTTD to their PFS curves. This method assumes that the relationship between TTD and PFS for one treatment is applicable to other treatments. In a similar approach, an HR between the PFS curves of two therapies can be derived and applied to the TTD curve of one treatment to get a proxy for the TTD curve of the other treatment. This method also uses observed difference in PFS across treatments to estimate unobserved differences in TTD across treatments (i.e., assuming that relative treatment effect on PFS also applies to TTD). However, although the estimated TTD curve will mimic the shape of the observed PFS curve in the first approach, it will mimic the shape of the TTD curve corresponding to the comparator treatment in the second. Any assumptions should be carefully validated with clinical experts as, to our knowledge, these methods have not been applied in economic models of 1L NSCLC treatments.

4.5 Assay

CEAs comparing different IOs face an additional difficulty because each IO drug has a corresponding PD-L1 assay: PD-L1 IHC 22C3 pharmDx assay (Agilent; Santa Clara, CA, USA) for pembrolizumab, VENTANA PD-L1 SP142 assay (Roche Diagnostics; Basel, Switzerland) for atezolizumab, PD-L1 IHC 28-8 pharmDx (Agilent) for nivolumab, and VENTANA PD-L1 SP263 assay (Roche Diagnostics) for durvalumab or pembrolizumab. PD-L1 IHC 73-10 assay (Agilent) is under development for avelumab. These assays have different sensitivities, and there is no perfect concordance between their results [69, 70].

In our review, we identified only one model comparing different IOs (atezolizumab in combination with chemotherapy vs. pembrolizumab in combination with chemotherapy); however, the patient population was not restricted by PD-L1 level in the trials compared, so the concordance issue did not arise [11]. Evidence to understand comparability across different assays can be beneficial in mapping population and treatment efficacy among trials. In a study [69] with 81 NSCLC tumor samples, classification of patients by PD-L1 expression level yielded comparable results when using the PD-L1 IHC 22C3 pharmDx, PD-L1 IHC 28-8 pharmDx, or VENTANA PD-L1 SP263 assays. However, employing the VENTANA PD-L1 SP142 assay or the PD-L1 IHC 73-10 assay created a marked difference in classification. For any specified PD-L1 staining threshold, the ratio of patients classified as positive was lower with the VENTANA PD-L1 SP142 assay and higher with the PD-L1 IHC 73-10 assay than with the PD-L1 IHC 22C3 assay. Another recent study [70] compared PD-L1 staining with the PD-L1 IHC 73-10 assay and the PD-L1 IHC 22C3 pharmDx assay in clinical NSCLC samples and illustrated comparable staining between ≥ 80% cutoff for the 73-10 assay and ≥ 50% cutoff for the 22C3 assay. Detailed results on the comparability of PD-L1 assays are summarized in ESM-5. In economic analyses, differences in sensitivity should be accounted for, and scenario analyses should be carried out to explore the sensitivity of results to this factor.

4.6 Utility

Health utility inputs have an important impact on ICERs. Apart from the approach in modeling utilities, the quality of available data can also drive the approach taken. For example, for the time-to-death approach, there must be sufficient sample sizes in different time-to-death windows, and utility values must have face validity.

Justifications for choosing the time-to-death approach include that it can capture decline in health-related quality of life towards death and that the approach can distinguish between health states in a more refined way than health state-based utilities [43]. All but one of the models reviewed [44] applied the same time windows, and none of the papers provided justifications. All models reporting on the utility modeling method (n = 29 [10, 11, 13, 15, 17,18,19,20, 22,23,24,25,26,27, 35, 38, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55]) were published since 2017, with more than half (n = 17 [10, 11, 19, 20, 22,23,24,25,26,27, 43, 44, 48, 49, 51, 52, 54, 55]) published since 2019. Based on this limited time period, we cannot assess which approach is increasingly accepted in the literature. The lack of a statistically significant difference between utility values provides justification for pooling utility values across treatment arms.

Neither health state-based nor time-to-death utility models were clearly favored by the ERGs. In NICE TA 600 [43], the ERG expressed concerns related to time-to-death–based utilities due to potential overestimation of utilities for patients with longer time-to-death categories. In NICE TA 584 [44], the ERG confirmed that the time-to-death approach had more face validity than a utility-based approach for progression status. In NICE TA 557 [41], the ERG implemented two scenarios. In the first scenario, time-to-death utilities were complemented by a utility decrement for progressed patients, whereas in the second scenario, progression-based utilities were adjusted by a utility decrement for patients having < 360 days until death. Company submission in NICE TA 584 claimed to have explored separate utility models for patients on and off treatment but abandoned the approach because of a lack of clear statistical difference between coefficients estimated for the two models [44].

4.7 Limitations

This review is subject to some limitations. We only captured publications and TA submission documents available in English and published before January 2020. As a result, more recently published models, which may have addressed some of the challenges highlighted here, may not have been included. Although we did not review conference abstracts released before 2016, it is unlikely that doing so affected our conclusions because it is unlikely that models published at scientific conferences prior to that date have not been published in peer-reviewed journals. In addition, IOs have only recently been approved for 1L treatment of NSCLC.

Where possible, we avoided directly comparing methodological issues related to the model approaches across publications and TAs because of differences in the level of reporting.

The initial set of methodological topics to be explored in this review were specified a priori based on initial desk research and expert agreement. The list was subsequently expanded following insight gained through the data extraction process. Some other important methodological aspects of modeling IOs may nevertheless have been missed.

Finally, we did not systematically examine differences in methodology across subpopulations within the same model. Although including subgroup analyses in our review could change the count of models applying various reported methodological techniques, it would be unlikely to introduce additional methodological topics as we did not observe any additional challenges associated with subgroup analyses within our review.

5 Conclusions

Our review demonstrated that conventional oncology models (PS and Markov cohort models) were the most commonly used modeling approaches. Only a few published studies and none of the identified TAs used patient-level simulations. Treatment switching adjustment is gradually becoming less relevant as IOs gain ground in the 2L treatment of NSCLC. Only one-fifth of identified models used RWD for survival extrapolation or validation; however, we expect more models will use this approach as RWD for IOs with longer follow-up become available. Treatment-effect waning was explored in base-case or scenario analyses in most TAs but only a minority of published models. Following previously accepted TA submissions, most models imposed a 2-year stopping rule for DOT with IO. PD-L1 assay concordance was not discussed in any identified studies. As the landscape becomes more crowded and therapies are targeted to patient groups with different PD-L1 expression levels, assay concordance will become a critical aspect of modeling. Health state-based and time-to-death utility modeling were widely used, with a large fraction of models exploring both options.

Most models used similar basic approaches, but heterogeneity regarding methodological aspects was substantial. We encourage researchers from different sectors (e.g., HTA bodies, pharmaceutical industry, academia) to develop guidance on modeling approaches, foster the convergence of methods, and explore the possibility of collaborating to construct a scientifically sound and transparent model that could serve as a benchmark for future CEAs of IOs in the 1L treatment of NSCLC.