FormalPara Key Points

Service evaluations and associated evidence do not tend to receive the same peer-reviewed scrutiny, governing oversight, time or budgetary allowances relative to research as defined from a study ethics perspective. However, as a vehicle for producing evidence, the same rigorous methods that are associated with conducting research should be considered if permitted

Guidance and checklists exist for conducting economic evaluations alongside randomised controlled trials and as part of health technology assessments. However, as service evaluations usually serve a different purpose, useful and fundamental methods may be overlooked when assessing ‘value for money’

It is difficult to suggest a single method to produce the ‘value for money’ evidence needed. However, evaluators need to consider the needs of the commissioner, but also what is required to produce the ‘best’ possible evidence (i.e. less uncertain and biased, but potentially costly) to inform the resource allocation decision problem

Evaluators should be transparent about the limitations of analyses that they conduct and should reflect on the impact that choices in the methods used may have on the results, conclusions and recommendations

1 Introduction

A consensus exists that policy and clinical decisions that affect the public and individual patients should be evidence-based [1, 2]. However, there is less clarity as to what constitutes ‘appropriate’ economic or ‘value for money’ (VfM) evidence, by which resource-allocation decision-making can be informed. In many cases within healthcare, particularly health technology assessments (HTAs) of new and existing medicines and treatments [3], the primary source of evidence may derive from research in the form of randomised controlled trials (RCTs). In the UK, these HTA processes are undertaken on behalf of the National Institute for Health and Care Excellence (NICE) and referred to as technology appraisals (TAs) [3]. NICE has influenced international processes, guidelines [4], and subsequently the HTA evidence base with Sculpher and Palmer [5] suggesting NICE could be considered a method innovator. Even before NICE’s influence, internationally RCTs are often considered the ‘gold standard’ vehicle for producing evidence related to causation such as ‘treatment effects’ [6]. In this article, we argue that RCTs and research more broadly, here defined as an exercise designed to inform decisions with clearly defined questions, aims and objectives to provide generalisable results [7], represent a subset of the possible relevant evidence base for a given decision problem. Indeed, in many cases conducting RCTs to gather evidence may not be appropriate, practical, affordable, ethical or even possible [8, 9].

In this article, we focus on interventional studies as non-randomised study designs including those conducted as service evaluations. From an ethics perspective, research is a vehicle for obtaining evidence with an ethical and underpinning legislative framework [7, 10]. As an alternative to research, service evaluations do not require research ethics (often referred to as ‘NHS ethics’ in the UK) [7]. Obtaining research ethics approval can be time consuming and may delay the start of a study [11]. Therefore, avoiding research ethic processes can be perceived as desirable in some circumstances (e.g. in the case of tight time and budget restrictions). However, an associated restriction for service evaluations is random allocation to treatment options is not permitted [7]. Additionally, the use of routine data is encouraged for service evaluations relative to primary data collection [7]. Service evaluations are not intended to provide generalisable estimates of intervention efficacy or effectiveness relative to an existing intervention, but they may be the source of the best evidence available. Although internationally alternatives to ‘research’ might be called or defined differently, evaluations that utilise non-randomised data are often used to inform decision-making such as by charities and local and national governing bodies [12, 13]. Non-randomised data and service evaluations present a different challenge to analysts than what is commonly associated with research, RCTs and HTAs. These evaluations usually serve a different purpose and require specific study design and analytic considerations. However, there are useful and fundamental methods associated with conducting research and HTA processes that can be appropriately applied if governing, time and budget restrictions allow.

This article provides an educational review of considerations when conducting VfM analyses in non-randomised studies including service evaluations. In doing so, we often make direct comparisons to RCTs and research as the current ‘gold standard’ to highlight relative strengths and weaknesses. We discuss to what extent there is a difference between ‘service evaluation’ and ‘research’ in terms of governing ethics and methodologies that could be used. We describe alternative methods for VfM analyses used depending on the resource allocation interests of the decision maker. Aspects to consider in terms of study design, statistical methods to control for confounding and bias, and how to quantify and describe uncertainty and opportunity costs to decision makers in any resulting VfM estimates are also described. The possible implications of producing VfM evidence when these aforementioned aspects are not taken into account are a key point for discussion. We provide a range of references for further reading, including a glossary of cross-referenced key terms and methods provided in Appendix S1 of the Electronic Supplementary Material (ESM). Therefore, this article should be used as a reference guide when generating VfM evidence from non-randomised data including service evaluations, rather than a technical document describing in detail specific methodologies.

2 Evidence-Based Healthcare Decision-Making: Are Randomised Controlled Trials the Only Choice?

Randomised controlled trials are considered the ‘gold standard’ for producing an evidence base related to causation, but their limitations have been noted [13, 14]. Validity comparisons of the evidence generated from non-randomised studies and RCTs have dispelled many misperceptions of the former as a viable option for producing an evidence base for healthcare decision-making [12,13,14,15]. Although the controlled nature of RCTs is often what enables them to produce explanatory results and have high ‘internal’ validity [16, 17], moving to real-world decision-relevant settings may be considered to improve the relative ‘external’ validity of such results [13].

Decision makers such as charities and local and national governing bodies are unlikely to commission RCTs in part because of their associated time and financial costs. In the UK, this includes Clinical Commissioning Groups (CCGs) who are clinician-led bodies responsible for commissioning healthcare services within a local area. Since the UK’s Health and Social Care Act of 2012 [18], CCGs have become increasingly responsible for the health needs of their district of responsibility. Furthermore, in the UK, nationally funded initiatives such as projects within the National Health Service (NHS) Test Beds programme [19] have relied on service evaluations to produce an evidence base on intervention effectiveness and VfM in localised areas. Although these are UK examples, the financial sustainability of healthcare systems internationally is reliant on the careful management and commissioning of the plethora of services and interventions that are available for a variety of decision makers to fund. The role of regional and local decision makers is explicitly mentioned in the Helsinki Statement on Health in All Policies (HiAP) [20, 21]. This call upon governments worldwide states that “health authorities at all levels (national, regional, local) are key actors in promoting HiAP” (p. 17), which includes “building knowledge by providing evidence of success and lessons learnt” (p. 18) [21]. The limited ability of many decision makers internationally to generate high-quality evidence and analyses necessitates the use of appropriate and timely approaches to inform their decision-making based on an accurate and relevant evidence base. Internationally, few decision makers have the available finances to produce evidence such as the National Institute for Health Research (NIHR) as the largest national funder of clinical research in Europe [22]. As such, there has been much international interest in alternatives to RCTs to produce an evidence base including related to VfM [23].

3 Research and Service Evaluations: Why are They Used and is There a Difference?

In the UK, service evaluations are generally used when the intervention of interest has been, or is about to be, implemented within a specific care setting. Unlike research, service evaluations are generally not based on well-formed aims and objectives to answer specific hypotheses to produce generalisable results. Instead, they are used to assess ‘what standard does this service achieve’ in a more general sense, with aspects of interest often being related to effectiveness and/or VfM.

As an overview of our perspective when assessing VfM, we do not believe there is a clear dichotomy between research and service evaluation other than from an ethics perspective and associated legislative underpinnings. Even from an ethics perspective, this dichotomy is not always clear. What defines ‘research’ from an ethics perspective internationally is complex, with aspects for consideration described by Gevers [10] including the seminal Declaration of Helsinki [24] for informing international human research ethics. From a UK perspective, Table 1 provides an overview of what the NHS Health Research Authority (HRA) Research Ethics Service (RES) considers ‘research’ relative to ‘service evaluation’. The HRA also has set standard operating procedures for Research Ethics Committees (RECs) to try and make processes more standardised and transparent including when judging what is ‘research’ [25]. A key point from Table 1 is that randomisation is only permitted within research. Therefore, a key consideration for service evaluations is conducting analyses using non-randomised data.

Table 1 Defining service evaluations relative to research from UK Health Research Authority guidance.

Suggesting what interventions can be evaluated within a ‘research’ ethics framework or otherwise (e.g. as a service evaluation), in the UK or internationally, is outside the scope of this article. However an evaluation is defined ethically, when it comes to evaluating VfM, we suggest the same rigorous considerations are required. This includes when developing the study design (without purposeful or random allocation of the intervention for service evaluations), analytical methods and reporting of evidence. We further discuss this perceived dichotomy between ‘research’ and ‘service evaluation’ from a UK perspective with some international considerations in Appendix S2 of the ESM.

4 Economic Evaluations and Partial Evaluations: Methods and Distinctions

Economic evaluation is widely used for the appraisal of healthcare programmes, taking into account both the costs and the consequences (i.e. effects or outcomes) of two (or more) alternatives [26]. There are multiple forms of economic evaluation. However, all are VfM analyses, with ‘costs’ representing an integral aspect of the evaluation process owing to the resultant opportunity costs (Appendix S1 of the ESM) from resources not being available for other purposes [27,28,29]. Traditionally, at the local level, there has been very little use of economic evaluation evidence, although there is suggestion this has increased overtime particularly in the UK [30, 31]. There is a need to explain and rationalise the purpose of economic evaluation to decision makers with a particular focus on commissioners of the evaluation and decision makers they represent. A commissioner is an individual (or group) who has a legitimate authority to make decisions, such as a representative of a local governing body being put in charge of identifying relevant experts to conduct the VfM analyses (e.g. health economists). From a commissioning perspective, requesting a VfM analysis may be more pertinent than requesting an economic evaluation. The term ‘value for money’ tends to mean something to commissioners more than ‘economic evaluation’, partly because it is a politically motivated and widely recognised term. The NHS Constitution for England states: “The NHS is committed to providing best value for taxpayers’ money” [32]. Thus, making ‘value for money’ a political ‘buzz word’ when discussing care funding and provision.

In the UK, cost-effectiveness analysis (CEA) in the form of cost-utility-analysis (CUA) based on quality-adjusted life years (QALYs) has become the most popular form of economic evaluation. This is partly because it is NICE’s ‘gold standard’ reference case, with NICE’s guidelines [3] informing HTA processes internationally [4]. There are several papers debating the extent that economic evaluations of different forms of interventions fit a typical HTA [33, 34], such that current guidelines may not be fully applicable, including: public health [35,36,37], antimicrobials [38], diagnostics [39], medical devices [40], genetics [41], digital [42], environmental [43], and service and delivery interventions [33, 44]. Whatever the intervention of interest, Drummond et al. [26] suggest that an economic evaluation would “explicitly consider the relative consequences of the alternatives and compare them with the relative costs” (p. 5). Anything else than the aforementioned “economic evaluation” is a “partial evaluation” (see Table 2).

Table 2 Overview of types of analysis focussed on costs and consequences.

Common economic evaluation methods include cost–benefit analysis (CBA), which uses individuals’ values for their outcomes to convert into a monetary unit, and CEA/CUA, which values outcome in natural units (usually health outcomes that for CUA are preference-based) [26]. Each of the aforementioned tend to represent VfM as a single outcome, normally as a ratio of outcomes relative to costs, e.g. incremental cost-effectiveness ratios (ICERs) and benefit–cost ratios (the equations for these ratios are presented in Appendix S3 of the ESM). Alternative economic evaluation methods include cost-minimisation analysis (CMA) and cost-consequence analysis (CCA). Cost-minimisation analysis is considered flawed based on its underlying assumption that outcomes can be equivalent between alternatives [26, 45] (a review on using CMA to inform NICE is underway [46]). Cost-consequence analysis offers flexibility for representing VfM in a disaggregated or aggregated manner [47,48,49,50], but using a single outcome for cross-comparison decision-making has comparative advantages and disadvantages [51, 52]. Budget impact analysis (BIA) [48, 53] is a ‘partial evaluation’ method that only accounts for costs to addresses the expected changes in the expenditure of a healthcare system after the adoption of a new intervention [48]. Budget impact analysis is recommended to be included alongside economic evaluation methods such as CEA [48] and CCA [54, 55].

Other cost-related methods used within service evaluations, commonly associated with evaluating public health programmes [56], include return on investment (ROI) analysis often based on cost savings as the ‘return’ (thus a partial evaluation) [57, 58] and social ROI analysis based on natural outcomes given monetary weights [59,60,61]. Whether social ROI analyses can be considered a full economic relative to partial evaluation depends on what costs are taken into account and if there is a comparison with an alternative (which are also factors for consideration with ROI analyses, with Sect. 5 including a relevant discussion related to costs). Often there is confusion between evaluation approaches because of how outcomes are presented to decision makers. However, certain approaches may contain the same information, but with a different presented outcome, e.g. as an ICER rather than a benefit–cost ratio (Appendix S3 of the ESM further discusses this aspect).

5 Costing Perspective: Intervention Costs, Future Costs and Other Considerations

The costs to include in a VfM analysis are dependent on what question needs to be answered for the decision context being informed [26, 28, 62]. Fundamentally, when a care service introduces an intervention or delivery model, the evaluation should include the direct costs of this aspect referred to as ‘intervention costs’ (Appendix S1 of the ESM). For example, if a new member of staff was introduced within a care system, the cost of this staff member over the time horizon of interest should be included. When comparing between two alternatives, incremental costs associated with the intervention relative to the alternative(s) assessed (e.g. usual or previous care model) are of interest. These direct intervention costs should always be included in the VfM analysis. However, what other costs should be included is the focus of considerable debate and research. de Vries et al. [63] have attempted to classify potential other costs into three categories, each described as ‘future costs’ (Appendix S1 of the ESM): (1) future related medical costs; (2) future unrelated medical costs; and (3) future non-medical costs. de Vries et al. [63] state that the literature suggests that inclusion of ‘related’ and ‘unrelated’ medical costs is required to obtain optimal outcomes from available resources irrespective of the costing perspective adopted. The inclusion of medical costs is referred to as the care-payer perspective (Appendix S1 of the ESM). However, ‘unrelated’ costs are typically difficult to define and thus often excluded/ignored [64, 65]. A case for also collecting non-medical costs as part of a societal perspective (Appendix S1 of the ESM) has been made by Jönsson [66]. A framework for including the societal perspective and associated complications is described by Walker et al. [67].

Obtaining future cost data can be costly, timely and resource intensive (depending on if data are readily available or not) [68]. As such for service evaluations, what future costs are included alongside the intervention costs may be limited to where benefits could be observed rather than accounting for opportunity costs across the wider care system. This may be more pertinent with ROI analyses, which often focusses just on intervention costs relative to beneficial returns without accounting for wider cost implications. For example, introducing set ‘inpatient bed days’ could reduce short-term hospital costs, which in a ROI analysis could seem beneficial, but this ignores wider morbidity, mortality, readmission and care costs associated with discharging patients too early. It seems reasonable to suggest that a full economic evaluation should attempt to account for future costs over an appropriate time horizon to capture resource use implications of the intervention of interest. The exact future costs to include have yet to be firmly established and thus may be informed by the decision maker’s perspective with potential implications for the evaluation (see Sect. 9). In some cases, there is also a suggestion to include ‘implementation’ costs, such that timely implementation of recommended interventions can provide health benefits to patients and cost savings to health service providers [69]. There are debates and complications with the inclusion of such costs [69,70,71], with a discussion on the economic evaluation of implementation strategies in healthcare by Hoomans and Severens [72].

6 Routine Data for Estimating Resource Use, Costs and Non-monetary Consequences

Primary data collection can be time consuming and costly, but may also have implications that consider ethical consideration (e.g. if talking to vulnerable groups). As such, for service evaluations using routinely collected data is recommended [7]. The data available for analysis may restrict the VfM method (Sect. 4) and costing perspective (Sect. 5), but could also inform the potential study design (Sect. 7).

Cost data for VfM analyses are estimated based on care resource-use data to which unit costs are attached (Appendix S1 of the ESM) [28]. There are self-reported and routinely collected resource-use data methods as described by Franklin and Thorn [68], noting their (and our) examples are based on routine data sources in England. If the consequences of interest are also related to resource use (e.g. cost per inpatient bed days avoided), then the source of routine data for costs and consequences may be the same. The range of resource-use information required will depend on the costing perspective (Sect. 5). However, different care services tend to collect their resource-use information on different electronic systems, which often do not tend to be linked at the patient or service level (e.g. primary care and hospital care) [68]. Although large linked databases may exist, their use has complications [68, 73]. In England, commissioners can supplement their local data flows with data from the Secondary Uses Service (SUS) [74]. Such data could be used for service evaluations, albeit with a variety of time, monetary, technical and information governance restrictions [68]. As such, routine data are recommended but often difficult to utilise [28, 68, 75].

For CEA, consequences are measured in natural units that could be based on routinely collected data (e.g. ‘cost per death avoided’ based on mortality data). For CUA, preference-based values for health-related outcomes are required to elicit the QALY (Appendix S1 of the ESM) [3]. Obtaining such preference-based values can be problematic if not routinely collected (e.g. routinely collecting the EQ-5D, as the NICE-preferred preference-based measure [3]). As an example of using indirectly collected preference-based data, Franklin et al. [76] suggest a method for attaching preference-based values to routinely collected, health-related events of interest (i.e. asthma exacerbations) to conduct a CUA. Preference-based values, often referred to as ‘utility’ values, could be sourced via the ScHARR Health Utilities Database (ScHARRHUD [77]: If clinical or condition-specific measures relevant to the intervention of interest are routinely collected which could be used for CEA, then a ‘mapping’ or ‘cross-walk’ algorithm may exist to allow the statistical prediction of utility values from that measure to conduct CUA. The purpose and procedures of statistical mapping are described by Longworth and Rowen [78], with a systematic review of mapping studies by Mukuria et al. [79], and an online database of mapping studies also currently available (HERC database of mapping studies [80]:

7 Statistical Considerations Based on Study Design, Underlying Data Attributes, and for Reporting Uncertainty

The desired form of VfM analysis should be accounted for at the study design stage, given that not all study designs are good vehicles for VfM analyses [26]. When time and finance are restricted, the study design might be dependent on data availability (Sect. 6). Value for money analyses are ‘analytical’ in nature because there is an intention to infer VfM as a causal factor related to an intervention compared to an alternative (e.g. usual or previous care model). As such, there will be the need to choose an analytical study design and associated appropriate statistical method(s). Hinde et al. [81] have explored the possible scenarios that could occur when seeking to conduct a quantitative evaluation of an intervention at the local level, specifically with regard to availability of evidence, the subsequent statistical method chosen and the resulting impact on ‘effectiveness’ evidence.

Analytic study designs as required for VfM analyses can be broadly classified as observational (e.g. case series, cohort, cross-sectional and case–control study designs [82, 83]), or experimental (e.g. before-and-after studies [84, 85], comparative/controlled trials and RCTs [82]). These are different to ‘descriptive’ studies, which could include describing costs, qualitative studies or cross-sectional surveys [86]. In analytic studies, participants are identified and observed, and characteristics including outcomes and costs are recorded. Additionally, for experimental studies, the setting should be equivalent across all participants, an intervention is used and is part of the assessment and there is an observation/evaluation of the effects of the intervention with causality being of particular interest (relative to association as a common interest in observational studies). When causality is of particular interest, there is a need to reduce chance, eliminate bias and account for confounding (Appendix S1 of the ESM). Although these aspects can be accounted for using statistical methods, good study designs reduce reliance on statistical methods with experimental studies generally regarded as being less susceptible to bias than observational studies.

Experimental designs such as comparative trials are generally preferred when inferring causality, with a preference for randomised trials [82]. Randomisation to treatment groups is preferred as the process reduces chance and bias in resulting study estimates, but RCTs themselves have limitations [13, 14]. In any case, randomisation to treatment groups is not ethically permitted outside of research (see Table 1 for a UK perspective). Therefore, non-randomised and historical control designs may be options for service evaluations. Historical controls alone have been shown to overestimate new treatment benefits [85, 87]. Authors such as Goodacre [84] have made a case why before-and-after studies without a comparison group and/or appropriate statistical methodology (e.g. interrupted time-series analysis, described later in this section) should be discouraged for evaluations. For non-randomised comparative trials (supplemented with or without historical data for both groups) particularly as part of a service evaluation, there are often difficulties when trying to recruit and perform primary data collection for a control group (or obtain relevant and necessary historical data retrospectively). Primary data collection and recruitment is expensive, time consuming, may have ethical considerations, and is thus often deemed undesirable by the service evaluation funder. Service evaluations could be conducted as ‘natural experiments’. Natural experiments are defined as: “naturally occurring circumstances in which subsets of a population have different level of exposure to a supposed causal factor, in a situation resembling an actual experiment where human subjects would be randomly allocated to groups” [88, 89]. Deidda et al. [88] have developed an economic evaluation framework when using natural experiments with a specific focus on public health interventions. As the framework was developed mainly to evaluate public health interventions, not all aspects of the framework may be relevant nor necessary for all service evaluations. For example, under ‘costs’, the framework suggests a societal perspective would be of interest; however, such a perspective may not be required/possible for all interventions to be evaluated (Sect. 5).

When economic evaluations are conducted as part of non-randomised study designs, the need to account for the non-randomised nature of the data is not always recognised [23]. There are suggested statistical methods/guidance to mitigate confounding and bias using observational [90] and ‘real-world’ data [91]. As an example, guidance by Faria et al. [90] provides an overview of a method described as ‘Matching’, which aims to replicate randomisation by identifying/matching control individuals who are similar to those receiving the intervention in one or more characteristic. Matching could be conducted within routinely collected datasets, assuming enough patient characteristics exist for matching, and subsequently used for the VfM analysis as part of a service evaluation (Sect. 6). There has been much interest in how such methods can be applied to improve VfM analyses (particularly CEA) using non-randomised data. As examples, using propensity score matching methods for CEAs has been explored by Manca and Austin [92]. Using regression-adjusted matching and double-robust methods for estimating average treatment effects in health economic evaluations has been explored by Kreif et al. [93]. The use of propensity score matching against other methods used in observational data such as difference-in-difference and regression models for (health) economic analysis has been explored by Crown [94]. Guidance on choosing an appropriate weighting mechanism for propensity score matching is described by Desai and Franklin [95]. These ‘matching’ methods are useful when there is interest in better defining a group for comparison to reduce bias. In comparison, interrupted time-series analysis, a statistical method using longitudinal data, has been preferred for single-arm before-and-after studies without a comparator [84] and has been used to inform modelling-based (Appendix S1 of the ESM) economic evaluations [96]. A short tutorial for using interrupted time-series to evaluate public health interventions is described by Bernal et al. [97], which outlines the data needed for interrupted time-series analyses. How to combine statistical methods that account for the non-randomised aspects of the data among other considerations pertinent to VfM analyses (e.g. comparison between alternatives, and accounting for costs and outcomes) is still an area for further research and guidance.

There are specific statistical considerations pertinent to VfM analyses that need to be accounted for alongside the non-randomised nature of the data. Two educational reviews already describe the use of utility data for CUA [98] and costs for CEA [28], both of which describe statistical considerations such as: assessing cost and consequence (utility) data and its distribution; baseline covariate adjustments; and dealing with missing data. We do not wish to repeat these educational reviews. Instead, we shall summarise a few key points and suggested statistical methods, focussing particularly on costs as a common factor in all VfM analyses. Controlling for baseline covariates (i.e. aspects that influence costs, e.g. age, frailty, health status) is a simple method for making adjustments to improve precision and correct for between-group imbalances [99], particularly for non-randomised groups. Regression-based methods are typically used to account for baseline covariates when making estimations. In the case of costs, the case has been made to use [28]: (1) parametric methods including ordinary least squares (OLS), generalised linear models (GLMs), extended estimating equations (EEE), multi-level models and generalised estimating equations (GEE) models; and (2) non-parametric methods including bootstrapping and the two-stage bootstrap. All forms of VfM should include unadjusted (i.e. observed) and adjusted analyses [28]. Although such methods have long been used as part of statistical analyses related to clinical outcomes [99], their use for VfM analyses has not always been fully recognised [28, 98].

There are also statistical methods that allow a better reflection of the uncertainty around estimates, which should be applied to costs and consequences. Common methods applied to economic evaluations include bootstrapping for within-trial evaluations and probabilistic sensitivity analysis (PSA) using Monte Carlo simulation for modelling-based analyses (Appendix S1 of the ESM). These methods allow the random resampling of the observed data over a specified number of iterations either non-parametrically or parametrically. The estimates from which can be presented to decision makers in cost-effectiveness planes and cost-effectiveness acceptability curves (CEACs) to indicate the probability of achieving a specified outcome over a range of monetary valuations of consequence outcomes [100], e.g. ‘cost-effectiveness thresholds’ [101, 102]. An alternative that allows for a quantification of a change in parameter(s) value if we are particularly unsure about point estimate value(s) is one- or multi-way sensitivity analyses, whereby point-estimate input values are changed (e.g. average intervention cost) and the resulting change in the outcome is reported (e.g. change in ICER value). An overview of the application of these methods to costs is described by Franklin et al. [28]. The use of such methods should be applied irrespective of the type of VfM analysis conducted, as they represent statistical methods to quantify and account for the uncertainty around the parameters associated with the VfM analyses that should be presented to decision makers.

Another aspect for consideration is using a relevant and appropriate time horizon. For service evaluations, particularly if informing policy decisions that require timely evidence, the ability to collect primary data over a relevant time horizon (whereby ‘relevant’ is dependent on the decision context) is potentially limited. There are various methods for extrapolating results beyond an observed time horizon, such as survival analysis [103] and other methods to account for censoring [104,105,106] dependent on the parameter of interest (e.g. longer-term mortality or costs). Economic modelling is often rationalised based on the inability to collect sufficient parameter information over a relevant time horizon within a single study to inform the decision problem and thus could be an alternative option [107]. However, modelling analyses and subsequent estimates will be driven by the data used to inform the model, e.g. a key model driver will be the input parameter estimate of intervention treatment effectiveness. If the intention is to use a service evaluation to produce the estimates on treatment effectiveness that will drive the model, then the aforementioned statistical methods described in this section will still be needed when estimating treatment effects from non-randomised data. Examples of CUA modelling studies born of a service evaluation include Franklin and Hunter [108] (fall-screening and fall-prevention intervention) and Hunter et al. [109] (major system change in acute stroke services). It should be noted, however, some decision makers may be interested in short-term costs and consequences (monetary or otherwise). For example, over 1 year because of yearly budget allocations (often associated with the ‘financial year’), rather than long-term planning dependent on the decision problem (see Sect. 9).

8 Quantifying the Value of Information

As stated by Sculpher, Claxton [9]: “It is argued that any framework for economic analysis can only be judged insofar as it can inform two key decisions and be consistent with the objectives of a health care system subject to its resource constraints. The two decisions are, firstly, whether to adopt a health technology given existing evidence and, secondly, an assessment of whether more evidence is required to support this decision in the future”. The methods described in this article so far relate to the aforementioned first point, but value of information (VOI) [Appendix S1 of the ESM] is associated with the second point. Value of information represents the monetary value of collecting more information that could inform an investment decision. There are three types of VOI worth considering: expected value of perfect information (EVPI), expected value of partial perfect information (EVPPI) and expected value of sample information (EVSI). An overview of these methods is described by Jackson et al. [110], with simplified descriptions provided in Appendix S1 of the ESM.

There are two key issues with VOI in general and specific to service evaluations. First, traditionally, VOI analyses are computationally complicated and time consuming. However, there are suggested methods [111,112,113] and free-licence software (e.g. Sheffield Accelerated Value of Information tool [112]: that can speed up and simplify the process, with a practical VOI guide by Wilson [114] and a description of emerging good practice VOI analytical methods by Rothery et al. [115]. Furthermore, although you need the output from a PSA or other Bayesian framework to be able to calculate VOI [111], parameter values from within-study VfM analyses can be placed within a simple model (e.g. decision tree) to run a PSA and subsequent VOI analysis (Appendix S1 of the ESM) [116]. The second issue is related to understanding the outputs from VOI, explaining the implications to decision makers and why they should pay attention to VOI. For trials, VOI can be particularly useful for pilot and feasibility studies, as they will place a monetary value on the worth of conducting the next stage trial design (e.g. RCT). When informing local or national decision makers, the purpose is to highlight the potential monetary consequence of beginning or continuing to invest in an intervention based on the current information available. As stated in Sect. 3: “service evaluations are generally used when the intervention of interest has been, or is about to be, implemented within a specific care setting”. As the decision maker may have already made the investment in an intervention, point estimates from any VfM analysis should confirm the already made decision to invest or not. However, such point estimates do not suggest if the service evaluation has provided enough information to inform the decision to invest in the future for as long as the investment decision is relevant (e.g. over the next 1–5 years). In addition, as many decision makers are responsible for a plethora of care interventions, another consideration is which interventions should be the focus of further evaluation in the future to check on their investment. Value of information can help prioritise and monetarise the investment in the service evaluation as well as the investment in the intervention. Decision makers may not be able to fully comprehend the impact of investing in an intervention or service evaluation based on the information provided to them, particularly related to the uncertainty around estimates. Value of information can quantify this aspect into a monetary value to be considered alongside other evidence provided.

9 Informing Decisions in Healthcare: A Discussion Related to Value for Money

Within this article, we have described a variety of matters to consider when conducting VfM analyses alongside non-randomised study designs including service evaluations. Reflecting on the UK, NICE has issued guidance on how to conduct economic evaluations for HTAs that were developed with allocative efficiency in mind across the whole NHS [3]. NICE’s processes and guidelines have influenced reimbursement agencies internationally specific to their HTA processes [4, 5]. However, such guidelines have been described to not always be practical nor relevant in every decision-making context [33, 35,36,37,38,39,40,41,42,43,44]. Such guidelines may also align more with research practices than service evaluations, whereby the former could include conducting expensive RCTs whereas the latter may be a more ‘budget and time’ conscious approach. There may be a need to move away from guidelines such as NICE’s HTA processes as the ‘gold standard’ for evaluating care interventions, but careful consideration and rationale need to be given when moving away from ‘gold standards’. This includes moving away from RCTs that dominant the HTA evidence base and CUA preferred by many reimbursement agencies internationally [4].

There are key differences between producing evidence for a reimbursement agency like NICE compared to decision makers that are part of government. For example, in the UK, NICE currently acts as an independent reimbursement agency for the NHS (although it was once a special health authority for the NHS), is not part of any government body, and NICE’s evidence review groups (as an external academic organisation independent of NICE) review the evidence that informs the HTA process [117, 118]. NICE also has principles [119] that align with the NHS Constitution [32], which is to provide “the best value for taxpayers’ money and the most effective, fair and sustainable use of finite resources” (NICE principle 7, point 22) [119]. NICE’s principle when rationalising its stance on cost per QALY states as part of its allocative efficiency objective: “[Cost per QALY] takes into account the ‘opportunity cost’ of recommending one intervention instead of another, highlighting that there would have been other potential uses of the resource. It includes the needs of other people using services now or in the future who are not known and not represented” (NICE principle 7, point 23) [119]. NICE has incorporated a level of independence between the evidence reviewers and final decision makers, while also producing guidance that aligns with its allocative efficiency objectives. In contrast, when producing evidence for local and national governing bodies, the decision maker may also be the commissioner of the evaluation, may have a narrower perspective when assessing ‘opportunity costs’ and a shorter time horizon of interest for the evaluation. Each of the aforementioned may not be good factors. The role of local relative to national government when providing healthcare has long been a point of debate internationally, with the World Health Organization reflecting on localised decision-making in their 1997 report “The role of local government in health: comparative experiences and major issues” [120]. More recently and focussed on the NHS, a question has been raised of “should local government run the NHS?” [121], which aligns with the powers given to local agencies within the Health and Social Care Act of 2012 [18]. The advantages for local government made by Furber [121] mainly focus on local government’s ability to deal with localised public health concerns and inequality issues, relative to national concerns including opportunity costs across the whole NHS budget. From a sceptic’s perspective, the extent to which obtaining good-quality unbiased estimates for a relevant time horizon is desirable relative to confirming an investment was ‘correct’ and wanting evidence to confirm this aspect can represent the different desires of local government agencies [122]. Additionally, as an example, localised decision makers may only wish to focus only on the opportunity costs in their jurisdiction of interest. For example, in England, an NHS foundation trust may only want to focus on care provided in hospitals as their care jurisdiction of interest. As such, their costing perspective may ignore other care services within which opportunity costs might be observed (e.g. primary care). Such decision makers may therefore opt to ignore other relevant opportunity costs that are recommended to be included by independent reimbursement agencies such as NICE. There is a case to be made that focussing just on the decision makers’ perspective may not always be the appropriate perspective to take if, on the whole, it may lead to inefficient resource allocation across the whole care budget.

Based on NICE and other international reimbursement agencies’ guidance, CUA is preferred for HTA [3, 4]. A key rationale for using CUA is it allows for comparable outputs in terms of economic evidence (i.e. cost per QALY estimates) for cross-care decision-making. Although the QALY framework is not perfect with a key debate questioning the concept of ‘a QALY is a QALY’, which enables cross-comparability [123, 124], using a single outcome metric such as the QALY still has its advantages albeit with the need for some suggested improvements [26, 125]. Using CUA is not restricted within service evaluations. Given the advantages of using a single metric and the stance of many reimbursement agencies internationally, perhaps cost per QALY analysis should be given priority across all care resource-use decision-making (noting we make this case with a UK perspective specifically in mind). However, it should be noted that this perspective might differ dependent on the care funding system incorporated including the use of social and private health insurance systems, rather than a tax-based system mainly used for the NHS. There has also been attempts and/or suggested frameworks to make CEA and QALYs applicable to more decision-making contexts. For example, equity concerns that are a key factor for consideration by many decision makers are suggested to be accounted for within distributional CEA for which there is a published tutorial [126], with a case study related to the UK Bowel Cancer Screening programme [127] and rotavirus vaccination programme in Ethiopia [128]. Furthermore, methods associated with estimating a new evidence-based cost-effectiveness threshold for NICE [27] (relative to NICE’s current non-evidence-based £20,000–£30,000 per QALY threshold [3]) has sprouted a range of other studies, debate [129,130,131] and associated methods for making CEA applicable to more healthcare decision-making contexts. For those unaware of cost-effectiveness thresholds, McCabe et al. [102] (specific to NICE) and Culyer [101] (more general concept) outline what they are and their uses. These methods for estimating cost-effectiveness thresholds are based on the concept of health opportunity costs, i.e. the health benefits that could have been achieved had the resources been used elsewhere in the healthcare system [132]. These methods incorporate wider opportunity cost concerns within CEA, QALYs and even disability-adjusted life years for use in low- and middle-income countries [133]. New approaches born from these methods include estimating social variation in the health effects of changes in healthcare expenditure [134]. Relatedly, Lomas [135] suggests a framework for incorporating affordability concerns alongside cost-effectiveness estimates highlighting an example that using a BIA alongside a CEA does not deal with such concerns. The “cost-effective but unaffordable paradox” [132] has been discussed in priority setting for global health programmes [136] but also in the context of high-income countries (e.g. UK and USA) [137, 138], which has relevance for local and national decision makers with finite budgets.

Despite the case for using a single metric and advances made in its potential use (with limitations), cost per QALY is rarely used within service evaluations. We suggest two key reasons: (1) the study design and data collection methodology means it is too difficult or not possible to collect the data to inform the CUA or (2) the CUA is not of interest to the decision maker for various reasons, including not understanding how to interpret QALY gains and associated ICERs. Detsky and Laupacis [139], in their paper ‘Relevance of Cost-effectiveness Analysis to Clinicians and Policy Makers’, suggest that: “In addition to the problems of looking at cost-effectiveness ratios individually, interpreting those ratios can be difficult for clinicians and decision-makers. It is not easy to understand what a QALY is” (p. 223). How best to explain the QALY including how it is (or could be) used to inform decision-making is certainly an area of interest. An HTA report on ‘The use of economic evaluations in NHS decision-making’ by Williams et al. [51] suggests: “Committee [including local] members raised concerns about lack of understanding of the economic analysis but felt that a single measure of benefit, e.g. the quality-adjusted life-year, was useful in allowing comparison of disparate health interventions and in providing a benchmark for later decisions” (p. iii). This report suggests the QALY could have a place in local decision-making if stakeholders better understood its construct and purpose. Within a service evaluation context, there is no guidance to suggest you have to, or how to, conduct a CUA, meaning it can be overlooked or avoided for the right or wrong reasons.

For service evaluations, if CUA is not desirable and/or possible, then other forms of VfM analyses may offer potential alternatives. Arguably, producing any form of economic evidence to inform decision-making is better than no evidence, provided the analysis and outcomes have been conducted and reported ‘appropriately’ (e.g. appropriate study design, statistical analysis, and uncertainty and opportunity cost reporting options). An issue seems to stem from the lack of guidance and monitoring of the use of VfM analyses when informing localised and even national decision-making as part of service evaluations, which is more common for research. For example, NICE’s evidence review groups provide independent reviews of evidence used to inform decision-making [117, 118], and the NIHR has an independent peer-review process pre-funding and at the reporting stage associated with its funding programme journals [140]. Evidence used by some decision makers and associated methods may not be properly peer reviewed, thus allowing for commissioners and decision makers to base their decisions on a multitude of evidence with varying quality. It should be noted that such peer-review processes themselves can be time consuming and costly, and thus may not be a practical nor cost-efficient option. However, the use of checklists such as Drummond et al. [26] and the Consolidated Health Economic Evaluation Reporting Standards (CHEERS) [141] checklists still have a place alongside VfM for service evaluations, as these standards should still enable cross-comparable evaluations with some suggested/recommended methods included.

There are practical examples as part of the debate for [142] (here focussed on the Cancer Drugs Fund) and against [143] using RCTs, the ‘against’ here being a comedic look at using RCTs to assess parachute use to prevent death and major trauma when jumping from aircraft. The use of RCTs is debated relative to other options including non-randomised, real-world, and observational data either with or without a comparison group, and using historical controls [13]. Further work should be conducted on how VfM methodology can be adapted to deal with these different study designs and data sources. For example, an economic model by Franklin et al. [96] combined treatment effect evidence from an interrupted time-series analysis using routine data with modelling methodology to conduct a CUA. Although this example represents one potential solution and relied on a natural experiment design to conduct the interrupted time-series analysis, further work is required to examine how to combine existing statistical methods for determining treatment effects in observational [90] and real-world [91] data with VfM methodology.

Multi-criteria decision-making (MCDM) and multiple-criteria decision analysis (MCDA) could deal with some limitations associated with single evaluation-based approaches when informing decision makers who require a range of information (not just related to VfM analyses) [144, 145], with good practice guidance of emerging methodologies [146]. Edwards and McIntosh [35] discuss a method called “programme budgeting and marginal analysis” for economic evaluation and prioritisation between public health interventions. They suggest programme budgeting and marginal analysis is an example of multiple-criteria decision analysis and describe steps for its use, but as we are not familiar with the approach, we suggest the interested reader refer to Edwards and McIntosh [35]. Some authors have called for the application of realist evaluation methodologies [147] to better explain cost-effectiveness mechanisms within more “explanatory economic evaluations”. Anderson and Hardwick [147] describe the premise, comparing aspects of both realist evaluations and economic evaluations. Although we approve of the idea of more explanatory economic evaluations, we believe the practical application and understanding of such methodology to inform decision-making is in its early days.

It is important to note that even as part of NICE-based decision-making, there is a range of evidence produced (e.g. via non-randomised studies) that requires appropriate methodologies that are not always taken into account. A review of NICE appraisals of pharmaceuticals (2000–16) by Anderson et al. [148] found variations in establishing comparative clinical effectiveness. Of 489 individual pharmaceutical technologies assessed by NICE, 22 (4%) used non-RCT data to estimate comparative clinical effectiveness, with the methods for establishing external controls including: 13 (59%) used published trials, 6 (27%) used observational data, 2 (9%) used expert opinion and 1 (5%) used a responder vs non-responder analysis. Only five (23%) used a regression model to adjust for covariates, indicating that fundamental statistical methodology is missing even from evidence presented to NICE. Interestingly, the authors did not observe a notable difference in the proportion of pharmaceutical technologies that received a positive recommendation from NICE based on RCT or non-RCT data (83% vs 86%). This suggests that even NICE recognises the need to use evidence from non-randomised study designs to inform decision-making, although the quality of such evidence still requires substantial scrutiny and appropriate statistical methodology. For an example of a HTA that uses considerable evidence from single-arm trials using statistical comparator techniques, see Llewellyn et al. [149].

10 Conclusions

The time and budgetary restrictions placed on decision makers might mean that service evaluations are required as a vehicle for producing VfM evidence. As such evidence tends not to be peer reviewed and without formal guidance related to VfM analyses for service evaluations, there is the opportunity for sub-optimal analyses to be carried out. Although NICE and other guidance specific to economic evaluations might not perfectly fit these analyses to inform all decision makers, there are some fundamental aspects that should be taken into account including study design, data collection methods and sources, statistical methods and reporting standards. The use of study designs and statistical methods to account for confounding factors and potential biases, and methods to control and report on uncertainty around estimates and opportunity costs, are important aspects to consider. In terms of costs, even if considered outside the scope of the decision maker, related future costs should be included in the evaluation alongside intervention costs to account for potential opportunity costs in a care system that obtains its funding from the same overall care budget. In terms of relevant outcomes and associated VfM method, although alternative VfM analyses than CUA might be considered more appropriate or practical to use, CUA should be given priority. Alternative methods could then be rationalised, but still reported to current standards expected from using the CHEERS checklist. Accounting for the time horizon of the decision problem is also important, which for longer time horizons could be accounted for using statistical and/or modelling-based methods. However, it is important to note that for some decision makers, the time horizon of interest may be more immediate short-term gains (i.e. over 1 year) than longer term planning. We suggest that the time horizon over which all costs and effects relevant to the decision problem occur should be considered for the evaluation, with estimates reported over a relevant short (e.g. 1 year) and long term (e.g. 20 years). Value of information methods can then be used to monetarise the decision uncertainty over the relevant time horizons.