FormalPara Key Points for Decision Makers

Many existing CEAs of LDCT lung screening include only a small range of screening intervals, which can lead to a failure to identify the optimally cost-effective strategy.

Although some recent studies have considered alternative risk eligibility criteria for screening, the assumption of a common strategy for all those eligible means the potential benefits of risk stratification may be overlooked.

Policy makers considering adopting lung screening should be aware of the limitations within the LDCT CEA literature to date.

1 Introduction

Lung cancer is the most common cause of cancer mortality, responsible for 18% of all cancer deaths worldwide [1]. Low-dose computed tomography (LDCT) screening can detect asymptomatic lung cancer, enabling earlier intervention at a more treatable stage [2]. Both the US National Lung Screening Trial (NLST) and the Dutch-Belgian Lung Cancer Screening (NELSON) randomised controlled trials (RCTs) found a cause-specific mortality reduction and the NLST has also found an all-cause mortality reduction [2, 3]. Despite this, lung screening is not yet widely recommended. A recent survey of 21 high-income countries found LDCT screening is only recommended in the USA and Canada (annual screening in both cases), while the UK and Australia actively recommend against it [4,5,6].

Several existing reviews have examined the cost-effectiveness evidence for CT lung screening [7,8,9,10,11,12,13,14,15,16]. Peters et al provide a recent, detailed and comprehensive analysis of methodological variation between published analyses [15]. That review and Raymakers et al note the wide range in incremental cost-effectiveness ratio (ICER) estimates, varying from $US1464/quality-adjusted life-year (QALY) to $US207,000/QALY [8, 15]. Ngo et al provide a detailed review of utility weights in CEAs of lung screening [16]. Several reviews have commented critically on the methodological shortcomings within the literature, especially in older pre-NLST/NELSON studies [9, 11, 13, 15]. Both Peters et al and Raymakers et al report most analyses compare annual screening relative to no screening [8, 15], the significance of which this review examines in detail.

This review addresses two issues regarding modelling methods in CEAs of LDCT lung screening. The first is the breadth of choice of strategies compared within analyses. An aspect of strategy choice that has been recognised as particularly relevant to cancer screening is sufficient variation in the screening intervals compared. Specifically, Torrance et al note that to adequately assess the cost effectiveness of a given strategy it is important to compare against an alternative with a longer screening interval in order to estimate the incremental benefits of the strategy in question [17]. For example, to estimate the incremental benefit of an annual screening strategy it would typically be necessary to compare it to a biennial screening strategy rather than simply comparing to no screening.

Comparisons of costs and health effects of a given strategy versus no intervention generate what are termed average cost-effectiveness ratios (ACERs), while incremental comparison between the efficient set of possible policies generate ICERs. An ACER within the cost-effectiveness threshold is a necessary condition for a strategy to be cost effective, while an ICER within the threshold is a sufficient condition. So, while ACERs can have some usefulness in ruling out cost-ineffective strategies, ICERs are more useful when seeking optimal policies. In principle, health gain maximisation from a given budget constraint can be achieved when allocating resources according to ICERs, but not according to ACERs [18, 19]. For this reason, CEA methods guidelines recommend simulating as many feasible strategies as possible and to use ICERs rather than ACERs to assess their cost effectiveness [20,21,22].

A second methods issue particularly relevant to lung screening is the role of disease risk in determining eligibility. The cost effectiveness of screening typically improves as disease incidence increases within the screened population, all else equal. As lung cancer is associated with smoking and other risk factors there may be scope to optimise screening by offering more intense strategies to those at greatest risk. Some trials disaggregate participants by smoking history, which has informed modelling of various risk eligibility thresholds. Two broad alternative approaches to risk selection are observed in the CEA literature and have been examined recently [23]. One is the simulation of alternative strategies in separate strata-specific analyses. This implies strata-specific strategies are considered feasible. Alternatively, analyses may compare alternative screening policies over a range of alternative screening eligibility criteria with the assumption that all those eligible for screening are offered a common strategy. That is, although alternative risk eligibility thresholds are considered, there is no stratification in the provision of screening within those deemed eligible. It matters if CEAs assume that a single screening policy is offered to all eligible screenees, or if stratified policies are available whereby different screening strategies are offered to those at different risks. This is because, in principle, offering risk-tailored screening can achieve greater efficiency than simply varying the eligibility threshold for a common strategy [23].

While previous reviews have addressed methodological concerns in general, this study provides a focused analysis of both strategy comparison and risk stratification in CEAs of lung screening. The first study objective is to assess how broad a range of screening intervals have been assessed in CEAs to date and to examine the implications for ICER estimation and policy choice. The second objective is to characterise and assess what forms of risk stratification have been undertaken by CEAs of LDCT. This review deliberately does not address whether LDCT screening is cost effective or seek to aggregate or summarise cost-effectiveness estimates. It is hoped that this study will help refine CEA modelling methods in lung screening research, thereby reducing variance in cost-effectiveness estimates and clarifying optimal policies.

2 Methods

We conducted a systematic search of the PubMed, Embase and Web of Science databases to retrieve CEAs of LDCT lung cancer screening. The search was conducted in February 2022. The search strings are given in the electronic supplementary material Appendix.

Title and abstract screening was conducted by four reviewers (MF, KH, ON, MÓG). We excluded articles if they were reviews, assessed an irrelevant intervention or if they were not CEAs using model-based analyses. The restriction to model-based analyses excludes studies based directly on trial results. The rationale for this exclusion is that such analyses cannot vary the screening strategy employed in terms of screening frequency or age range. We excluded studies that did not numerically report estimated costs and health effects. We only included studies reporting health effects as either life years gained (LYG), quality-adjusted life years (QALYs) or other directly analogous measures of unadjusted or quality-of-life (QoL)-adjusted health gain. We only included analyses considering LDCT screening and those published in English. Records were reviewed independently with at least two reviewers assessing each record. Disagreement regarding inclusion was resolved by all reviewers who reviewed the articles together to reach consensus.

We extracted details of the screening strategies assessed and the outcomes reported. These data were recorded in spreadsheet software with each study assessed by one reviewer in each case. We recorded basic study information including the year of publication and national setting. We recorded what screening strategies were simulated in terms of the screening age ranges and screening frequencies. We only considered strategies assessed within the base-case analyses and not those within sensitivity analyses as we assume consideration of additional results for alternative parameter values would not meaningfully influence our analysis. We assessed what different risk subgroups were considered, if any. We considered subgroups differentiated by sex, smoking history or ethnicity all to represent differences in risk groups. Regarding age, when a study presented cost-effectiveness estimates for various age subgroups in separate analyses, we considered the analysis stratified by age. Conversely, when cost-effectiveness estimates were estimated incrementally between alternative screening age ranges such studies were not considered as stratifying by age. We recorded whether results were presented as LYG or QALYs, or both.

Our first stage of assessment examined the range of strategies compared within the analyses, the implications for the cost-effectiveness ratio estimates and the resulting policy recommendations. At the most basic level we assess how many analyses simply compared one screening strategy to no screening, thereby only permitting the estimation of an ACER. The next level of analysis applied to studies that compared multiple strategies. It considered how many studies varied the interval length and to determine how many studies could estimate ICERs based on comparisons of alternative intervals. Specifically, we examined how many analyses included comparators of 2 years and longer, thereby potentially offering the basis for incremental comparisons for annual screening and biennial screening. To examine the relevance of intervals of 2 years or longer we identified which studies found strategies with intervals of 2 years or more to be optimal. In addition, we considered the relevance of QoL-adjustment of health effects to the choice of strategies modelled. Our second stage of assessment examined if risk selection was considered in the analyses and considered how this was implemented. We determined if the studies identified subgroup–specific policies in separate analyses, implying strata-specific policies were considered feasible, or if they considered alternative screening eligibility criteria within a single analysis, implying one common screening strategy will be provided to all deemed eligible.

3 Results

3.1 Overview

Our search yielded 49, 56 and 91 titles from PubMed, Embase and Web of Science, respectively. Combining titles and removing duplicates resulted in 92 articles for abstract review, following which there were 52 articles for full text review. We found studies that were effectively duplicates in two cases. Results of a revision by Jaine et al from an earlier publication were published as a correction, so we used the latter analysis [24, 25]. Griffin et al is the academic publication of a health technology assessment by Snowsill et al [13, 26]. As the latter is more comprehensively reported we excluded the former. Three additional studies were found from citation tracking [27,28,29]. Ultimately, 33 articles were accepted for critical appraisal. A PRISMA flow diagram is presented in Fig. 1.

Table 1 details the reviewed studies. All were published since 2001 and over half were published since 2018. The studies are almost exclusively from high-income countries, with the exception of one Iranian and two Chinese analyses. The USA is the most frequent country of origin with 12 studies, Canada is next with 3 analyses.

To clarify the comparisons that have been made between strategies, Table 1 categorises the literature according to the number of intervals and age ranges simulated and how they are compared. Where multiple subgroups were reported in the original studies, we have reported results only for the first or most prominently reported for brevity. The risk subgroups column reports if alternative risk groups are considered in separate analyses (“stratified”) or if a single analysis is used to assess alternative eligibility thresholds under the assumption that all screened individuals receive a common strategy (“uniform strategies”). The table reports the smoking history categories considered, if any; whether or not health gains are QoL-adjusted; whether the resulting cost-effectiveness ratio is judged an ACER or ICER; the optimally cost-effective strategy identified in each study and its comparator; and, where relevant, the reported cost-effectiveness ratio of the optimal strategy.

The first category is of analyses that only consider a single screening interval, of which there are 23 (70%). These comprise studies with either only one age range (Category 1A, n = 13; 39%), or multiple age ranges (Category 1B & Category 1C, n = 6; & n = 4, respectively). The multiple alternative ages ranges considered in Category 1B represent different potential recipient subgroups of different ages compared independently (assessing policies over separate potential recipient cohorts of different ages), whereas in Category 1C they represent competing policy options compared to each other for a given birth cohort (assessing different screening ages within one potential recipient cohort). The second category is of analyses with multiple screening intervals with either one age range (Category 2A, n = 1) or multiple age ranges as alternative potential recipient subgroups (Category 2B, n = 1). The third category is of analyses with multiple screening intervals and multiple alternative screening age ranges in which the alternative age ranges are competing strategies compared to each other (Category 3, n = 8; 24%).

Fig. 1
figure 1

PRISMA flow diagram performed 3rd February 2022 with literature search protocol and exclusion criteria

Table 1 Studies categorised by number of screening intervals and age ranges compared

3.2 Comparisons and Cost-Effectiveness Ratios

As there is no variation in the screening frequency or the age range within the strategies presented in Category 1A, the cost-effectiveness ratios estimated by most are necessarily ACERs. There are four possible exceptions. Marshall et al and Whynes only consider one-time screening [30, 31], so the ICERs of the single strategy in these cases are necessarily synonymous with ACERs. Allen et al and Kowada consider alternative test modalities and so can estimate ICERs, but both do so on the basis of just one screening interval [32, 33]. The studies in Category 1B all estimate the ACERs of the given strategies relative to no screening. The four studies in Category 1C estimate ICERs between alternative screening ages, but do not vary the screening frequency to include intervals such as 2 years or longer [28, 34,35,36]. Within Category 2, Goffin et al estimate an ICER between annual versus biennial screening [37]. McMahon et al estimate an ICER for annual screening relative to no screening as the alternative one-time screening strategy does not lie on the efficient frontier and each of the alternative age ranges are considered separately [38]. All the cost-effectiveness ratios estimated by studies in Category 3 except Kim et al, are ICERs as they are incremental comparisons between both alternative screening intervals and alternative candidate age ranges. While Kim et al does consider alternative intervals and age ranges, the reported ICERs are actually ACER comparisons to no screening [39].

In total, 16 (48%) studies present ACERs rather than ICERs [25, 30, 39,40,41,42,43,44,45,46,47,48,49,50,51,52]. An additional 6 (18%) studies estimate ICERs of annual screening without assessment against biennial comparators [28, 32,33,34,35,36]. A further 2 (6%) studies are classified as reporting ICERs as they have assessed one-time screening per lifetime [30, 31].

Nine of the ten studies that considered multiple screening intervals assessed annual and biennial screening [13, 27, 29, 37, 39, 53,54,55,56]. In these cases, biennial screening offers a lower intensity strategy against which to assess the incremental cost effectiveness of annual screening. Four of these included comparator strategies with intervals longer than 2 years, thereby permitting assessment of the incremental cost effectiveness of biennial screening. McMahon et al considered one-time screening; Snowsill et al considered 1- and 3-times per lifetime screening; and Tomonaga et al included triennial screening; and Diaz et al consider screening intervals of between 1 and 5 years and one-time screening [13, 38, 53, 56].

Among the ten studies that assessed multiple intervals, five found screening intervals of 2 years or less to be optimal [13, 27, 29, 37, 53]. Therefore, had these ten studies not featured longer screening intervals of 2 years or more, half would have necessarily reached different conclusions regarding what strategy was optimal and half would not. Similarly, among the four studies that considered strategies with intervals longer than 2 years, two would have reached different conclusions had these intervals been omitted [13, 53], while two would not [38, 56]. Whether or not the omission of strategies would lead to the choice of a more intense strategy depends on the particular study. For instance, had Snowsill et al and Diaz et al simulated only annual screening, then an examination of the estimated costs and effects against the cost-effectiveness thresholds stated in each study would have shown that the former would have found that no screening was optimal, while the latter would favour annual screening [13, 53].

3.2.1 QoL Adjustment

Quality-adjusted life-years or other QoL-adjusted health effects are reported by 21 (63%) studies, while the remainder report LYG or other unadjusted measure. Of the 21 reporting QALYs, 8 report both QALYs and LYG.

The relevance of QoL adjustment to the choice of strategies compared is revealed by the few studies that vary the screening interval and report both LYG and QALY estimates. We present the results from two such analyses as examples. The efficient frontiers under LYG and QALY analyses from Toumazis et al are reproduced in Fig. 2A, B respectively [27]. The optimal strategy given a cost-effectiveness threshold of $US100,000 per unit of health gain is circled in each case. Annual screening is optimally cost effective under a LYG analysis, but biennial screening is optimal under a QALY assessment. The explanation being that the disutility arising from screening, which is greater for more frequent strategies, is sufficient to increase the ICER of annual screening to above the threshold in a quality-adjusted analysis. Analogous estimates by Snowsill et al are shown in Fig. 3A, B (some strategies omitted for clarity) [13]. The threshold is GBP30,000 per unit of health gain. The optimal strategy under the LYG analysis is three screens per lifetime but only one per lifetime under the QALY analysis. These examples show that the application of QoL adjustment can be sufficient to adjust the optimal screening interval.

Fig. 2
figure 2

Cost-effectiveness planes from Toumazis et al showing efficient frontiers and optimal policy options (circled) for the cost-effectiveness threshold of $100,000 per unit of health gain when health effects are estimated in terms of LYG (A) or QALYs (B)

Fig. 3
figure 3

Cost-effectiveness planes from Snowsill et al showing efficient frontiers and optimal policy options (circled) for the cost-effectiveness threshold of £30,000 per unit of health gain when health effects are estimated in terms of LYG (A) or QALYs (B)

3.3 Risk Stratification

Alternative risk strata are considered by 20 (61%) studies, 14 of which conduct conventional stratification whereby each subgroup is considered in separate analyses. These 14 studies are listed in Table 1 as “stratified” analyses under the risk stratification column [25, 29, 31,32,33, 38, 43, 44, 48,49,50,51,52, 54]. Examples include stratification by sex [52], smoking status [43] and sex, smoking status and ethnicity [25]. The six remaining analyses consider alternative risk-based eligibility criteria as competing strategies that are compared against each other within a single analysis under the assumption that a common strategy will be offered to all recipients [13, 27, 35, 36, 55, 56]. These are listed in Table 1 as assessing “uniform strategies”. For example, Snowsill et al consider alternative age ranges and screening intervals over three different cumulatively relaxed estimated risk thresholds. Tomonaga et al, ten Haaf et al and both studies led by Toumazis consider alternative screening age ranges and intervals over different smoking histories [27, 36, 55, 56]. Treskova et al consider alternative age ranges over different smoking histories [35].

4 Discussion

While previous reviews have examined the modelling evidence for LDCT screening, this analysis is the first to provide a detailed analysis of strategy comparisons and the role of risk stratification. We first discuss the findings regarding strategy comparison and ICERs. We then briefly consider the relevance of QoL adjustment. Finally, we examine the issues surrounding risk subgroup selection.

4.1 Strategy Choice and Incremental Comparisons

That a large proportion of studies only assessed one screening interval means that many published cost-effectiveness ratios are ACERs, even though most such studies describe the estimates as ICERs. As noted above, an ACER within the cost-effectiveness threshold is a necessary but not sufficient condition for cost effectiveness and cannot identify optimal policies. Decision makers seeking the optimally cost-effective strategy among potential alternatives therefore require an analysis that includes these potential alternatives and an incremental analysis among them.

Our review finds only nine analyses that have assessed both annual and biennial screening [13, 27, 29, 37, 39, 53, 54, 56, 57], meaning less than one-third of the reviewed literature offers an incremental basis for assessing annual screening. Of these, only three considered intervals longer than 2 years [13, 53, 56], meaning even fewer offer incremental evidence for biennial screening.

Our examination of studies with multiple screening intervals shows that the choice of strategies matters to policy recommendations. Half of those studies would have identified different policies as optimal had they omitted intervals of 2 years or more. Naturally, this can depend on study-specific factors such as the estimated cost and effects and the prevailing cost-effectiveness threshold. Given that it likely cannot be anticipated whether the inclusion of intervals of 2 years or more will influence the optimal policy prior to conducting a simulation, any analysis seeking the optimal screening policy should include such intervals.

We believe the lack of interval variation within existing studies is problematic. Low-dose computed tomography screening is not yet widely adopted, and decision makers may currently be reviewing evidence to inform policy. If policy makers are unaware of the limitations of comparisons of a single strategy versus no screening, then they may accept an ACER within the threshold as an endorsement of a strategy as cost effective without the awareness that other strategies could be superior. While the limitations of ACERs have long been recognised within CEA, these limitations may not be appreciated by policy makers. The differences in expenditure and screening capacity requirements between annual, biennial and triennial strategies are not trivial and it is important that decisions on screening intervals are well-informed. Furthermore, once a screening policy has been adopted it might become difficult to revise, especially if this means disinvesting to a less effective policy with a longer screening interval. Health economists have long recognised the need to include as many feasible strategies as possible in general, the relevance of varying intervals in screening analyses in particular, and, the importance of incremental analysis [17]. Given that concerns about strategy choice and ICER estimation are not novel, it is worth considering why few lung screening CEAs have considered biennial intervals and longer.

The choice of intervals simulated appears to be informed by those assessed within RCTs. The NLST considered three annual screens while NELSON assessed intervals of 1, 2 and then 2.5 years [58]. Other trials have primarily investigated annual or biennial screening [59]. While close correspondence of simulations to RCTs is understandable, we must appreciate the implications of not considering longer intervals. If CEAs are restricted to the trialled intervals, then we will lack outcome estimates for the comparator strategies required to make incremental comparisons. For instance, simulating annual and biennial screening strategies permits estimating ICERs for annual screening, but not biennial screening, as that requires outcome estimates for intervals of 3 years or more.

There are important constraints to acknowledge regarding simulating longer intervals. Modelling intervals longer than those trialled implies a heavier reliance on simulation evidence, which may not be acceptable to all. Furthermore, this implies considering strategies expected to be less effective than those trialled and recommended within emerging guidelines. Indeed, a 2.5-year interval has already been shown by RCT to be inferior to annual screening in terms of cancer detection [60]. Health economists might be comfortable contemplating less than optimally effective strategies for two reasons. First, their broad perspective that considers overall population health gain, net of the costs of screening in terms of other healthcare foregone. Second, their recognition of the need to simulate both policies of interest and their comparators. Others may be reluctant to consider less effective strategies, even as comparators. Although there is a clear CEA rationale for considering longer intervals, there are challenges in terms of evidence and acceptability.

4.1.1 Relevance of QoL Adjustment

A previous review of comparator strategies in cervical screening CEAs showed that omitting longer interval strategies leads to biased ICER estimates [61]. The degree of bias depends in part on the degree of convexity of the efficient frontier. The more convex the frontier, the greater the likely bias caused by comparator omission, all else equal. The degree of convexity depends on several factors, all of which vary between cancers. These include the length of the preclinical duration, test performance and the disutility from screen-related harms. Quality-of-life adjustment also influences convexity and so is relevant to concerns regarding comparator inclusion. The examples of Toumazis et al and Snowsill et al in Figures 2 and 3 show how the frontier becomes more convex with QoL adjustment. This is as expected since we anticipate more screen-related harms for more intense screening strategies and this accords with CEAs of other cancers [62, 63].

The relevance of QoL adjustment to considerations of strategy choice is illustrated by the only two studies (Snowsill et al. and Diaz et al [13, 53]) reviewed that both include comparator strategies with intervals longer than 2 years and QoL-adjusted health outcomes. Snowsill et al find that three screens per lifetime is optimal, while Diaz et al find triennial screening is optimal, both of which contrasts with the many other studies that conclude in favour of annual or biennial screening. Clearly, it would be unwise to over-interpret the relevance of two studies given the contingencies of context-specific factors including the UK and Spanish cost-effectiveness thresholds. Moreover, the finding of Snowsill et al, that screening annually is less effective than biennially, is not found in the four other studies appraising annual and biennial screening using QoL adjustment [27, 37, 39, 53]. Despite this, it seems important to consider the relevance Snowsill et al and Diaz et al indicate regarding longer intervals.

4.2 Risk Stratification

While many analyses have considered alternative risk subgroups in separate analyses, some recent studies have considered alternative risk eligibility criteria as competing policies compared within single analyses. Note that the latter approach implies a common strategy for all those screened. For example, McLeod et al stratify by ethnicity and sex, considering the cost-effectiveness of screening in four separate subgroup analyses of Māori and non-Māori men and women, respectively [50]. Conversely, Tomonaga et al consider various policies over a range of alternative eligibility criteria corresponding with those in the NLST and NELSON trials within a single analysis [56].

While breast, cervical and colorectal screening programmes have been tailored for specific risk subgroups such as those with known elevated disease risk [64,65,66], the general approach within population screening is to offer a common programme for all eligible individuals. The LDCT analyses that consider alternative risk thresholds but assume a common strategy for all those screened are therefore arguably consistent with common screening practice. Despite this, there may be particular reasons to consider risk-stratified policies in the lung screening context. In principle, the relationship between smoking history as a risk factor and disease risk offers a quantifiable basis for stratification. Adopting a “one size fits all” approach may be inefficient as it could lead to low-intensity strategies being withheld from low-risk groups and high-intensity strategies withheld from high-risk groups [23, 67]. Moreover, unstratified approaches may miss opportunity to improve health equity if they fail to identify how services can be tailored to the needs of specific groups. It is therefore notable that some more recent CEAs that consider multiple alternative screening strategies and eligibility criteria have not considered the potential for stratified policies. In principle, existing analyses could be extended to examine such stratification.

4.3 Limitations

This review is focused on issues of strategy comparison and stratification. Peters et al provide a comprehensive analysis of other issues regarding methodology that we did not address [15]. We did not consider the adequacy of parameterisation given RCT and epidemiological data or the models’ structural assumptions. Our analysis is limited to base-case results, meaning strategies considered in sensitivity analyses were not included.

5 Conclusions

The current CEA evidence on LDCT lung cancer screening is marked by little variation in the screening interval. This means many studies cannot offer incremental evidence of the cost effectiveness of potentially relevant strategies. Consideration of QoL-adjusted health outcomes reinforces the need to consider lower intensity strategies in simulation analyses. Some studies fail to fully examine the potential benefits of risk stratification. Addressing these modelling issues will require more data. Clarifying the appropriate modelling methods could greatly enhance the usefulness of evidence for policy makers and lead to considerable population health benefits.