FormalPara Key Points for Decision Makers

Compared with their older counterparts, newer Non-insulin antidiabetic drugs are cost effective for treating type 2 diabetes mellitus and show great promise in treating diabetes and its complications through indirect and direct effects. An overview of the recent findings in the cost-effectiveness literature and the underlying methodological choices in the decision-analytical models could aid decision makers in prioritisation.

This review found that as comparators, glucagon-like peptide-1 receptor agonists and sodium-glucose cotransporter-2 inhibitors were the favoured classes. Within these classes, semaglutide (injectable or oral) and empagliflozin were the favoured products, respectively. It is challenging to provide conclusions on the cost-effective option among these products due to different underlying methodological choices, sponsorship bias, and outdated information populating the model.

Decision makers face several difficulties when prioritising between the newer Non-insulin antidiabetic drug. The field of cost-effectiveness analyses in type 2 diabetes mellitus could benefit from using modelling practices, mainly treatment switch assumptions and risk equations, that better align with real-world practice and contemporary follow-up data for modelling treatment effects over time.

1 Introduction

Type 2 diabetes mellitus (T2DM) is a chronic progressive condition that poses a growing public health concern worldwide [1]. The estimated global total diabetes-related health expenditure for adults with diabetes will reach US dollars (US$) 1.03 trillion in 2030 and US$1.05 trillion in 2045 [1]. A UK model estimates that around 90% of the total healthcare costs for treating diabetes can be attributed to T2DM [2]. Furthermore, the costliest aspect of diabetes is the diabetes-related complications (DRCs), which carry great direct and indirect economic costs and a massive impact on health-related quality of life [3,4,5].

Several Non-insulin antidiabetic drugs (NIADs) are used for the pharmacological treatment of T2DM, where the management of hyperglycaemia is central. The main NIAD classes are commonly divided into older NIADs (oNIADs; e.g. metformin [MET], sulphonylurea [SU], and thiazolidinediones), and newer NIADs (nNIADs; e.g. dipeptidyl peptidase-4 [DPP4] inhibitors, glucagon-like peptide-1 [GLP1] receptor agonists, sodium-glucose cotransporter-2 [SGLT2] inhibitors) [6].

In 2018, a paradigm shift occurred where treatment guidelines in the consensus reports from the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD) [7,8,9,10,11,12,13,14,15,16] increasingly emphasise using products from the GLP1 and SGLT2 classes of pharmaceuticals. This emphasis was based on cardiovascular outcome trials, which were shown to have a protective effect against cardiorenal DRCs while treating hyperglycaemia and promoting weight loss [10]. Later updates have led to the recommendation of even earlier use of GLP1 or SGLT2 products, independent of HbA1c or first-line medication, if the patient is at high risk for atherosclerotic cardiovascular disease (ASCVD), chronic kidney disease (CKD), or heart failure (HF) [14].

Prevention or reduction of the occurrence of DRCs could lead to substantial improvements in the quality of life for patients with T2DM and a reduction in the increasing economic burden of the T2DM pandemic [3]. It thus becomes increasingly important to evaluate the value for money of the nNIADs [17]. One of the most commonly used methods for evaluating the cost effectiveness of diabetes treatment is the decision analytical model (DAM). Many factors complicate predicting how diabetes progresses over time, and DAMs can incorporate multiple sources of evidence to estimate how interventions differ over a long time horizon. Using many sources also allows DAMs to compensate for the short durations of clinical trials [18].

The ADA issued guidelines for diabetes modelling in 2004, highlighting seven different factors that complicate diabetes modelling: the long time horizon of disease progression, the involvement of multiple organ systems, the use of several types of medications that affect different outcomes, the vast array of complications that differ in terms of costs and how they affect quality of life, and the difficulties related to diagnosis [19]. These guidelines suggest that confidence in the models and their reliability can increase if transparency, validation, and the inclusion of different types of uncertainty are considered when constructing them. Adherence to these guidelines was investigated in a review from 2015, where it was found to be lacking but improving [20]. Methodological aspects of the models have also been reviewed [21,22,23], and Asche et al. have commented that clinically inconsequential changes in clinical parameters are given too much weight in DAMs and that time horizons in the models are too long [24].

One significant barrier to using nNIADs more than oNIADs is that the price has consistently been much higher [6, 10]. Despite the acquisition cost being much higher [11, 13,14,15], the literature on cost effectiveness has favoured the nNIADs [25,26,27,28,29,30,31] for second-line treatment. However, it remains unclear which of the nNIADs is the cost-effective choice, for which patients, and under which model assumptions [25]. An overview of the underlying methodological choices in the DAMs used to compare these nNIADs can provide decision makers with a better understanding of the basis of the economic results on which they base their decisions on choosing the appropriate nNIAD. We therefore aimed to provide an overview of the economic outcomes and differences in methodological choices, trends, and model assumptions by conducting a systematic review of the literature on cost-effectiveness analyses (CEAs) using DAM to compare nNIADs against other nNIADs, for the treatment of patients with T2DM.

2 Methods

This systematic review was conducted in 2021, updated in 2022, and adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [32].

2.1 Eligibility Criteria

2.1.1 Types of Studies

Studies were eligible to be included if they were CEA or cost-utility analyses that used any DAM to compare one or several nNIADs as interventions for the treatment of T2DM. The studies must have compared at least one nNIAD with another, either comparing different classes or products or the same classes and products within those classes (e.g., GLP1 class vs. SGLT2 class, or GLP1 product vs. SGLT2 product, such as semaglutide vs. empagliflozin).

Editorials, comments, conference abstracts, protocols, reviews, and studies that were not original research articles published in English were excluded.

2.1.2 Types of Intervention

The primary aim of the interventions in the CEAs must be managing the hyperglycaemic aspect of T2DM. The therapy in the intervention can be nNIADs monotherapy, combination therapy of nNIADs with oNIADs, nNIADs, or insulin (INS).

2.1.3 Types of Comparators

Eligible comparators include nNIADs of the SGLT2, GLP1, or DPP4 classes on the product level or aggregated class-level representations based on the pooling of data from the treatment effect of several products.

2.2 Information Sources and Literature Search Strategy

Data were collected by searching the PubMed, EMBASE, and EconLit databases. The initial search period was from 1 January 2018 to 8 October 2020. The search was updated twice and the final search date was 15 November 2022. The search was limited to start from 2018 to include studies from the year the 2018 ADA/EASD consensus report was published [10]. The literature search was first performed in PubMed using a combination of Medical Subject Heading (MeSH) terms and free-text keywords, and subsequently adapted to EMBASE and EconLit. In addition, reference lists in eligible studies were scanned to identify additional relevant articles. The entire search strategy is available in Online Resource Appendix A1.

2.3 Selection Process

2.3.1 Screening Process

First, titles and abstracts were screened for relevance by two authors independently (HVBL and EPJ), erring to the side of inclusion and blinding the decision process using the browser application Rayyan [33]. Second, HVBL and EPJ accessed the full text and reviewed the studies in-depth for final inclusion based on the predefined eligibility criteria.

2.4 Data Collected and the Collection Process

A spreadsheet for collecting data from the studies was developed and tested on all studies to ensure the validity of the extraction sheet. After an iterative process, the testing resulted in a standardised sheet for data collection. Data were extracted under three main categories: (1) basic study characteristics, including country of study, funding source, analysis perspective, time horizon, model type and name, subgroups analysed, and background medication and combination therapy of the modelled cohort; (2) main outcomes of the economic evaluation and sensitivity analysis (SA), such as cost, effect measure, incremental cost-effectiveness ratio (ICER) and currency, the key drivers of results, and types and results of SA; and (3) modelling parameters and assumptions, including the source of baseline characteristics of the modelled cohort, patient subgroup, treatment effect, adverse events (AEs), DRCs, risk equations, treatment switch, and treatment switch type. The term ‘main comparator’ was used when a comparator in the study was easily discernible as the one against which all other drugs were compared.

2.5 Quality Assessment of the Included Studies

The overall quality of the included studies was assessed using Drummond’s 10-point checklist for assessing economic evaluations [17], which consists of 10 essential questions with supporting sub-questions. HVBL assessed all studies and EPJ assessed a random sample of 10 studies. Any disagreement was resolved through consensus, while major disagreement resulted in EPJ assessing more studies until consensus was achieved. The answers to the 10 main questions are presented in Table 4, along with a summary of the main results. Each question was answered using one of the following four responses: ‘Yes’ (adequate), ‘Partial yes’ (partially adequate), ‘Unclear’ (cannot tell), ‘No’ (not adequate), and ‘Not applicable’. For the checklist questions that were subjective in nature, the ADA treatment guidelines [10, 11, 13,14,15] were used as a reference regarding the adequacy of the answers.

3 Results

3.1 Identified Studies

From the initial search, 890 unique studies were identified, of which 807 were removed after title and abstract screening, leaving 83 full-text articles. After reviewing the full texts, 50 studies met the inclusion criteria and were included for data extraction (see Fig. 1).

Fig. 1
figure 1

Flow-chart for study selection, adapted from the PRISMA-guidelines for the reporting systematic reviews [32]. DAM decision-analytical model, nNIAD Non-insulin antidiabetic drugs

3.2 Basic Characteristics

The results of extracted data from included studies can be found in Table 1, which summarises the general characteristics of the studies. The majority of studies (n = 30) were conducted in Europe [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63], of which many were conducted in the UK (n = 9) [37, 38, 40, 45, 46, 48, 51, 57, 62], and one-third of the studies were conducted in North America (n = 10) [64,65,66,67,68,69,70,71,72,73] and Asia (n = 8) [74,75,76,77,78,79,80,81]. Two studies were conducted outside these regions: one each from Colombia [82] and Iran [83].

Table 1 General characteristics of the studies

Only 9 of 50 studies were not sponsored by the industry [64, 71, 74,75,76, 78,79,80, 83]. Novo Nordisk funded the most studies (n = 23) [35, 37,38,39, 41,42,43,44, 47,48,49, 51,52,53, 57, 60,61,62,63, 68, 70, 73, 81], followed by Boehringer Ingelheim (n = 11) [40, 45, 46, 55, 56, 58, 59, 67, 69, 72, 77]. The remaining five studies were funded by AstraZeneca [34, 50, 54, 65, 82] and one each by Eli Lilly [36] and Janssen Scientific Affairs [66].

The vast majority of modelling studies (n = 24) [36,37,38,39,40,41,42,43, 45,46,47,48,49, 51,52,53, 56,57,58,59, 61,62,63, 77] used IQVIA's CORE Diabetes Model (CDM) [84, 85], followed by the Cardiff Diabetes Model [86] (n = 5) [34, 50, 54, 75, 82], the Swedish Institute for Health Economics Cohort Model for T2DM (IHE-DCM) [87] (n = 5) [35, 44, 60, 68, 81], the UK Prospective Diabetes Study (UKPDS) Outcomes Model 2 (UKPDS-OM2) [88] (n = 3) [64, 76, 79], the Chinese Outcomes Model for T2DM (COMT) [89] (n = 2) [74, 78], and one study each with the Economic and Health Outcomes Model of T2DM (ECHO-T2DM) [90] and the Discretely Integrated Condition Event platform [91]. One study used both the IHE-DCM and the ECHO-T2DM to produce and compare their results [73]. Eight studies did not state the model name, of which three studies [71, 80, 83] used Markov state transition models, three used individual patient simulations [55, 67, 72], and one each used a state transition cohort model with a competing risk approach [70] and a decision tree [65].

The majority of studies conducted analyses using the payer’s perspective (n = 31) [34, 36,37,38,39,40,41,42, 45,46,47,48,49,50,51,52, 55, 57, 59, 61,62,63,64,65,66,67, 69, 70, 72, 75, 77]. Less frequently, a healthcare sector perspective (n = 10) [56, 58, 59, 71, 74, 76, 78,79,80,81] and a societal perspective were used in six studies [35, 43, 44, 53, 54, 68]. Two studies used both a payer’s perspective and a societal perspective [60, 73], while the perspective employed by one study was unclear [82].

Few studies (n = 3) had a time horizon of 5 years or lower [65, 71, 82]. The majority had a time horizon of either 40 years (n = 12) [34,35,36, 44, 50, 54, 60, 68, 73, 75, 79, 81] or 50 years (n = 21) [37,38,39,40,41,42, 45,46,47,48,49, 51, 52, 56, 58, 59, 61,62,63, 76, 77], while some studies (n = 11) reported a lifetime horizon [43, 53, 55, 57, 64, 67, 69, 70, 72, 74, 78]. The remainder had a time horizon of 30 years [66] or 10 years [80, 83].

The most frequently used background medication for the modelled cohorts at baseline was MET (n = 25) [35, 37, 38, 42, 46, 50, 52, 56,57,58,59, 62, 65, 66, 68, 69, 72,73,74,75,76, 78, 79, 81, 82], while most of the remaining studies (n = 23) reported MET combined with a range of other classes of medications [34, 36, 39,40,41, 43,44,45, 47, 48, 51, 53,54,55, 60, 61, 63, 64, 67, 70, 71, 77, 80, 83]. Sometimes the background medication was described as ‘Standard of Care’ (SoC) or ‘1–2 oral antidiabetic drugs’. Two studies had unclear reporting on the background medication [71, 80]. With regard to combination therapy, most studies (n = 22) involved dual or triple therapy, while some studies (n = 19) had dual therapy and the remaining eight studies had triple therapy or above [36, 54, 59, 69, 72, 75, 76, 83]. In one study, the background medication was unclear [80].

3.3 Cost-Effectiveness Results and Uncertainty

This section first describes how the comparisons between the classes are presented and then the methodological choices regarding the CEAs. An overview can be found in Table 2.

Table 2 Cost-effectiveness and uncertainty results

3.3.1 Overview of Comparisons between the Newer Non-insulin Antidiabetic Drugs

In the included studies, the GLP1 and SGLT2 classes were the main comparators in 31 and 16 studies, respectively, while one study used DPP4 exclusively [78]. The remaining three studies compared alternatives where it was difficult to discern which was the main comparator [75, 76]. No studies evaluated the cost effectiveness of combining the two most frequently compared classes—SGLT2 and GLP1. The dosage of one or more comparators was mentioned in 35 of 50 studies. When a single product was represented as two different comparators with different dosages, the comparator with the higher dosage was always favoured. The results of the comparisons are presented in three sections based on the most frequent comparisons between classes of nNIADs: (1) GLP1 compared with GLP1 or another non-SGLT2; (2) SGLT2 compared with SGLT2 or another non-GLP1; and (3) GLP1 compared with SGLT2. The study, which compared different products from the DPP4 class, found alogliptin to be cost effective [78]. To simplify the descriptions of the comparators in Sects. 3.3.23.3.4, the nNIAD mentioned first is considered the main comparator unless otherwise stated. The results of the comparisons are briefly summarised below. Table 2 provides further details on the comparisons and the economic results in general.

3.3.2 Glucagon-Like Peptide-1 (GLP1) Compared with GLP1 or Another Non-Sodium-Glucose Cotransporter-2 (SGLT2)

Comparisons of products within the GLP1 class occurred in 14 studies, and comparisons of the GLP1 class of products against non-SGLT2 products occurred in six studies.

In 10 studies, injectable semaglutide was found to be primarily dominant but cost effective against other GLP1 products [39, 42, 43, 45, 49, 53, 63, 69, 80, 82]. In the one study where oral semaglutide was compared with injectable semaglutide, oral semaglutide was dominant against it and all other comparators [70]. The three remaining comparisons were between other GLP1 products and gave conflicting results [36, 50, 83]. For the six comparisons with GLP1 and other classes as comparators, injectable semaglutide was dominant compared with dulaglutide and sitagliptin [47], and cost-effective compared with INS glargine [43]. Exenatide was found to be cost effective against INS glargine and liraglutide [34], while liraglutide was cost effective against sitagliptin [37]. IDegLira, a combination of liraglutide and INS degludec, was categorised as an nNIAD of the GLP1 class if compared with nNIADs or similar combinations of nNIADs and INS. IDegLira and liraglutide, combined with INS, were found to be cost effective or dominant compared with other GLP1 products combined with INS, GLP1 as a class, and multiple forms of INS treatment [35, 39].

3.3.3 SGLT2 Compared with SGLT2 or Another Non-GLP1

Comparisons of products within the SGLT2 class occurred in four studies, and in six studies, SGLT2 products were compared with SGLT2 or non-GLP1 products.

When only SGLT2 products were compared, empagliflozin was found to be cost effective against dapagliflozin [55, 67] and dominant against canagliflozin [67], while canagliflozin was dominant against dapagliflozin in two studies [66, 74]. For the six comparisons with SGLT2 and other classes, empagliflozin was found to be cost effective against sitagliptin [69] and saxagliptin [40], while dapagliflozin was found to be dominant in one study against the DPP4 class and cost effective in another [54, 82]. Finally, dapagliflozin in combination with MET was the dominant strategy in a study comparing combinations of dapagliflozin and saxagliptin [76], and one class-level comparison found SGLT2 to be cost effective against DPP4 [80].

3.3.4 GLP1 Compared with SGLT2

Direct comparison of GLP1 and SGLT2 occurred in 19 studies. Aggregating the results to a class-level perspective, GLP1 was cost effective against SGLT2 in nine comparisons and not cost effective in three, while SGLT2 dominated GLP1 in six comparisons. Additionally, one study found SGLT2 as a third-line choice to be the cost-effective option in a set of comparisons where introducing GLP1 as a third-line choice was dominated.

From the product-level perspective, semaglutide (oral or injectable) and empagliflozin were the most frequently used main comparators. Oral semaglutide was found to be cost effective in five of seven comparisons with empagliflozin [51, 53, 58, 60, 61, 63, 64] but was not cost effective in two comparisons [58, 64]. Injectable semaglutide was found to be cost effective in all comparisons with empagliflozin [49, 57, 59] except one [59]. Injectable semaglutide was also found to be cost effective against canagliflozin [73]. Empagliflozin dominated liraglutide in all their comparisons [45, 56, 72, 77]. When empagliflozin was compared with injectable semaglutide and oral semaglutide, it dominated the latter [71] but was not cost effective against the former [46]. Dapagliflozin was compared with liraglutide once and dominated [65]. One study compared injectable semaglutide and empagliflozin [71] with no easily discernible main comparator and found the former to be cost effective.

3.3.5 Willingness to Pay and Incremental Cost-Effectiveness Ratio (ICER) Estimates

In the majority of studies (n = 21), a willingness-to-pay (WTP) threshold similar to that used in the UK (GBP20–30,000/quality-adjusted life-years [QALYs] or equivalent value in Euros or local currency) was utilised [34, 36,37,38,39,40, 45,46,47,48,49,50,51,52,53,54,55, 57, 61,62,63], followed by the World Health Organisation (WHO)-recommended WTP threshold of one to three times the gross domestic product per capita (n = 14) [41, 56, 58, 59, 74,75,76,77,78,79,80,81,82,83]. The remaining studies used a North American WTP threshold (n = 9) [64,65,66,67,68,69,70,71, 73] or a Swedish threshold (n = 3) [35, 60], and one study each used thresholds specific to The Netherlands [43] and Slovakia [42], while the last study did not state a threshold [72].

A total of 116 ICER estimates for base cases and reported scenarios were extracted from the included studies. Most of these ICERs were in the northeast quadrant of an incremental cost-effectiveness plane (n = 65; higher costs and effects), while the remaining were located in the southeast quadrant (n = 54; lower costs, higher effects), except one that was found in the southwest quadrant (lower costs, lower effects). When comparing the base case and the scatterplot resulting from the probabilistic sensitivity analysis (PSA), there was a clear consensus about the cost effectiveness and location of the estimates, indicating ICER estimates that were robust to changes in parameters chosen in the respective analyses.

3.3.6 Effect Measures and Key Drivers

All studies used QALYs and life-years (LYs) as effect measures, except for five studies [50, 60, 65, 80, 83] that only used QALYs. The majority of the included studies emphasised presenting the DRC output of the models and how this output affected the results, while six studies [50, 65, 76] did not report DRC output. Among the studies that reported DRC output, 19 reported both DRC incidence and the time until onset [36,37,38,39, 41,42,43, 47,48,49, 51,52,53, 57, 61,62,63, 66, 81], while 24 only reported incidence [34, 35, 40, 44, 45, 54,55,56, 58, 60, 64, 67,68,69,70, 72,73,74,75, 77,78,79,80, 82]. Two studies focused explicitly on the effect of the treatments on CVD-related DRCs, with one including hospitalisation from HF [46] and included CVD-free LYs [69]. One study included event-free survival as an effect measure [70].

The reporting of key drivers was based on an explicit description of the key drivers of the results of the economic analysis, taken from a dedicated section. If not described explicitly, the authors interpreted the key drivers as the results reported as most important for their results. The authors observed a general tendency to emphasise the effect of the nNIADs on time until the onset of DRCs and reduction of the cumulative incidence of DRCs, or both, as being key in achieving cost effectiveness over the comparators. It was not possible to discern whether key drivers of the results were reported in nine studies [50, 59, 71, 75, 76, 79, 80, 82, 83].

3.3.7 Sensitivity Analysis

All included studies conducted both deterministic sensitivity analysis (DSA) and PSA. The included studies either directly reported parameters that had the greatest impact on the ICER or supplied graphs or tables from which it was possible to extract the information. We extracted information on the parameters that affected the results most, second most, and third most, according to the degree to which the change in the parameter affected the ICER. These parameters were further grouped into broad categories. Among these categories, the most frequently occurring first-, second-, and third-most sensitive parameters were changes to the time horizon of the analysis, changes to the treatment effect of the comparators, and differing assumptions regarding the cost of the comparators, respectively. Four studies [47, 50, 54, 56] reported that the conclusion of dominance remained unchanged (see Table 2 for more details).

3.4 Model Parameters and Assumptions

This section presents the key model parameters and assumptions from the included studies, along with notable differences between the studies. A detailed presentation can be found in Table 3.

Table 3 Model parameters

3.4.1 Source of Baseline Characteristics of the Modelled Cohorts

The baseline characteristics of the modelled cohorts were sourced from randomised controlled trials (RCTs) in most studies (42/50), some of which focused on cardiovascular outcomes. Real-world data from observational studies or data from other CEAs were only used in eight studies [39, 50, 64, 65, 69, 72, 80, 83]. Baseline tables describing the characteristics of the modelled cohort were unavailable in one study [38, 39, 49, 50, 67, 71, 74, 78, 80, 83]. The characteristics of the cohorts in the studies were most frequently sourced from the SUSTAIN 7 [92] (n = 10) and PIONEER 2 (n = 10) trials [93], followed by EMPA-REG OUTCOME [94] (n = 7), and PIONEER 3 (n = 5) [95]. These trials tested the efficacy of injectable semaglutide, oral semaglutide, and empagliflozin.

However, there was considerable variation in the patient eligibility criteria across these trials. For example, SUSTAIN 7 excluded patients with HF, CKD, and retinopathy, while PIONEER 2 excluded patients with renal impairment, retinopathy, and pancreatitis, and EMPA-REG OUTCOME only included patients with established CVD (see Table 3 for more details on which trials were used as the basis for the modelled cohorts).

3.4.2 Subgroup Analysis

Explicit subgroup analysis was performed in 11 studies. Four studies analysed a subgroup with body mass index (BMI) ≥30 [41,42,43, 53]; two of these studies performed their base-case analysis on a cohort with BMI ≥35 [41, 42], while the other two performed a separate subgroup analysis on individuals with BMI ≥30 [43, 53]. A subgroup analysis on individuals with and without a history of DRCs, primarily CVD, was performed in four studies [69, 71, 72, 80]. One study used four different subgroups from the EXTRA study, and the two remaining studies used a cohort with different characteristics than the base-case cohort to test their results [38, 75].

3.4.3 Model Inputs: Treatment Effect and Adverse Events

The treatment effect and AEs were sourced from multiple sources, with RCTs being the most common (n = 36), followed by network meta-analyses (n = 16), indirect treatment comparisons (n = 8), or other CEAs (n = 6). The studies frequently used treatment effects from the SUSTAIN 7 (n = 9), PIONEER 2 (n = 9), EMPA-REG OUTCOME (n = 8), and PIONEER 3 (n = 6) trials.

The most frequently used treatment effects were changes in HbA1c (n = 47), systolic blood pressure (n = 41), total cholesterol (n = 33), and high-density lipoprotein (n = 33). Similar treatment effects were used between the studies, although those with SGLT2 as the main comparator had a broader range of treatment effects than those with GLP1 as the main comparator. Five studies with empagliflozin as the main comparator [46, 55, 56, 67, 72] and four studies with oral semaglutide [60, 64, 70, 81] used treatment effects specific to the reduction of CVD as an additional separate part of the treatment effect parameters. One study comparing the DPP4 class of products also included CVD-related effects [71]. Studies with GLP1 as the main comparator focused more on hyperglycaemia than those with SGLT2.

Most studies included drug-related AEs, and the most frequently used were variations of hypoglycaemic events: severe (n = 35), non-severe (n = 33), severe nocturnal (n = 13), and non-severe nocturnal (n = 12). Only one study, with GLP1 as the main comparator, included diabetic ketoacidosis [52]. For more information on treatment and adverse effects, see Table 3; for the sources of model input, see Online Resource Appendix A2.

3.4.4 Key Model Assumptions

Two key model assumptions used in the included studies are reported in this section: (1) the risk equations used to predict the development of DRCs over time; and (2) the time until, and conditions leading to, the treatment switch. There were notable differences in how these assumptions were applied in the studies (see Table 3 for details).

The UKPDS 68 [96] and 82 [88] risk equations were the most frequently used. However, there were differences in what risk equations were used in the base-case analysis, which depended on the class of the main comparator. Among the 27 studies using UKPDS 68 in the base-case scenario, 20 studies had GLP1 as the main comparator [34, 37,38,39, 41, 42, 47, 48, 50,51,52,53, 57,58,59, 61,62,63, 68, 81] and six studies had SGLT2 as the main comparator [45, 46, 54, 66, 72, 82], while one study [75] had no main comparator. Of the 20 times UKPDS 82 was used in the base-case, eight studies had SGLT2 as the main comparator [40, 45, 46, 56, 66, 67, 69, 72, 77], 10 studies had GLP1 as the main comparator [35, 39, 58,59,60, 64, 68, 73, 79, 81], and one study [76] had no main comparator. In eight of the studies mentioned above, both risk equations were used for the base-case; half of these studies had GLP1 as the main comparator [58, 59, 68, 81] and the other half had SGLT2 as the main comparator [45, 46, 66, 72]. GLP1 was the main comparator in all of the 18 studies that used these risk equations in the SA, with three using UKPDS 68 [60, 68, 73], 13 using UKPDS 82 [37,38,39, 41, 47, 48, 51, 53, 57, 61,62,63, 82], and two using both [43, 49]. Only one study with SGLT2 as the main comparator used UKPDS 82 in the SA [82].

Explicit reasoning for the choice of risk equation was only present in a few cases. Three studies with SGLT2 as the main comparator cited model fit as the reason [40, 45, 56], while eight GLP1 studies cited advice from model proprietors as the reason [37, 38, 47, 49, 57, 61,62,63]. Different aspects of the risk equations were sometimes used simultaneously. For example, in the study by Neslusan et al. [66], the change in HbA1c, systolic blood pressure (SBP), and lipids was modelled using UKPDS 68, macrovascular complications were modelled using UKPDS 82, and microvascular complications were modelled using a combination of four different risk equations [97,98,99,100]. Six studies either did not use risk equations or had unclear reporting of their use [55, 65, 71, 78, 80, 83]. For details on the risk equations used, see Table 3.

Regarding treatment switch, it occurred either at a predefined period after treatment initiation of the main comparators [36,37,38,39, 41,42,43, 47, 49, 62, 68,69,70, 72, 79, 81], when the HbA1c levels of the cohort reached a certain threshold [34, 35, 40, 46, 48, 50,51,52, 54, 57,58,59,60, 62,63,64, 66, 73, 75, 82], or a combination of the two [44, 45, 53, 56, 77]. These variations in treatment switch occurred in 16, 20, and 5 studies, respectively. The time to treatment switch was shorter for studies with GLP1 as the main comparator (n = 20, mean = 3.95 years, standard deviation = 2.7 years) than for the SGLT2 studies (n = 3, mean = 10 years, standard deviation = 2.6 years). At the point of treatment switch, the comparators were either discontinued and replaced with a different medication, often INS [34,35,36,37,38,39,40,41,42,43,44,45, 47,48,49,50,51, 53, 54, 56, 57, 60,61,62,63, 66, 68, 70, 73, 76, 77, 81, 82], continued while different medications could be added [46, 58, 59, 64, 69, 72], or continued for one step and then discontinued at a later step [52, 75], as reported in 33, 6, and 2 studies, respectively. Studies with GLP1 as the main comparator had the comparators discontinued 84% of the time, while those with SGLT2 had the comparators discontinued 47% of the time. Nine did not report a treatment switch [55, 65, 67, 74, 76].

3.4.5 Model Outputs—Diabetes-Related Complications

In the conducted CEAs, the most common output related to DRC was a summary of incidence or onset, most frequently both. Only two studies did not include DRCs as the output [53, 65], while the majority had similar items within the category. Overall, the difference in DRC outputs between the studies was small when present. SGLT2 studies reported HF and cardiovascular death more often and had more varied outputs relating to nephropathy, while GLP1 studies had more varied outputs regarding hypoglycaemia and retinopathy (see Table 3 for all extracted data).

The following categories represented the DRCs: CVD, nephropathy, neuropathy, retinopathy, AEs, and others. The most-reported DRC category was CVD, represented by stroke, myocardial infarction, HF or congestive HF, peripheral vascular disease, angina, ischaemic heart disease, CV death, composite major adverse cardiac events, and transient ischaemic attack. The nephropathic complications were represented by microalbuminuria, end-stage renal disease, haemodialysis, renal transplant, peritoneal dialysis, gross proteinuria, macroalbuminuria, composite renal outcome, renal injury, renal failure, progression of albuminuria, nephropathy, and acute kidney injury. Similar to the treatment effects, AEs as the model output were almost exclusively hypoglycaemia, divided into subcategories of severe and non-severe, and sometimes further divided into non-nocturnal and nocturnal. AEs associated with certain classes were also present, such as genitourinary infection for the SGLT2 class and gastrointestinal issues for GLP1. The neuropathy group consisted of amputation, ulcer, and neuropathy. Macular oedema, blindness, cataract, vision loss, background and proliferative retinopathy, and non-proliferative retinopathy were included in the retinopathy group, whereas laser treatment, gangrene, and revascularisation were included in the ‘others’ category.

3.5 Quality of the Included Studies

The answers to the essential questions of the quality assessment checklist are provided in Table 4. Answers to questions 1, 3, 6, 7, and 8 were adequate in almost all studies; however, answers to questions 2, 4, 5, 9, and 10 had some inadequacies. The description of comparators (question 2) was frequently partially adequately described due to the omission of relevant comparators to the research question (e.g., GLP1 products were compared only with GLP1 products and no SGLT2 products). Additionally, most studies lacked a societal perspective in their analyses (question 4). The accuracy of cost and consequence measurement (question 5) was often unclear due to the use of data that may not represent current treatment practice (i.e., the UKPDS risk equations) and the frequent use of short times to treatment switch, which may not reflect modern treatment practice. The heterogeneity of patient groups was rarely explored when the studies addressed uncertainty (question 9). Finally, the discussion sections (question 10) had inadequacies, mainly regarding comparison with other studies and generalisability.

Table 4 Quality assessment via Drummonds 10-point checklist

4 Discussion

4.1 Main Findings

This systematic review aimed to provide an overview of the cost-effectiveness outcomes and the underlying methodological choices of the most recent studies published. In the included studies, SGLT2 and GLP1 products were the comparators of choice among the nNIADs. Empagliflozin and semaglutide (oral and injectable) were the most frequently used main comparators and were found to be cost effective compared with other products within their respective classes. This aligns with the recent ADA reports recommending using products from the SGLT2 and GLP1 classes based on the patient’s risk level and treatment goals [15, 16]. However, only 38% of studies compared SGLT2 and GLP1 directly. The lack of direct comparisons could be due to a lack of direct comparison data, although several of the included studies have performed indirect treatment comparisons or network meta-analyses, or used existing comparisons from the literature.

The cost effectiveness findings can be seen from both a class and a product level. From the class-level perspective, one class was not favoured over the other, but GLP1 never dominated SGLT2. On the other hand, SGLT2 dominated GLP1 in six comparisons. The dominance of SGLT2 over GLP1 seems limited to the older GLP1 products, as only one of the dominant results was between empagliflozin and the newer line of GLP1 products, oral semaglutide [46]. From the product perspective, semaglutide (oral or injectable) was more frequently the main comparator against empagliflozin (n = 11) than vice versa (n = 2) and was cost effective in the majority of these comparisons.

Sponsorship bias may be inferred from the fact that all industry-funded studies found their product to be cost effective or dominating, except for two studies that compared their competitors’ product against their own (i.e. injectable and oral semaglutide were compared with empagliflozin), and found the competitor to be not cost effective [58, 59]. In the non-industry-sponsored studies where SGLT2 and GLP1 were compared, oral semaglutide was found to be cost effective against all comparators except empagliflozin [64]. In contrast, the other study found injectable semaglutide to be cost effective against empagliflozin [71].

There were considerable regional differences in the choice of main comparators, with Europe favouring GLP1, the Asian region favouring SGLT2 slightly more, and the North American region seemed to favour both equally. Most included studies were conducted in Europe, possibly explained by its increased focus on prioritising healthcare resources. However, Asia and North America may benefit more from increased prioritisation of healthcare resources, as the current and projected incidence of T2DM is much higher than in the European region [1].

It can be seen as a shortcoming that only 11 studies explicitly defined the subgroups they modelled. Presenting results for different subgroups might help differentiate SGLT2 and GLP1 products more. Emphasising the differences in their respective effects might be especially important since the Danish Medicines Counsel has adopted the simplified assumption that their effects are equivalent [102]. The modelled cohorts were almost exclusively in a broad category described as patients with T2DM uncontrolled on one or several first-line medications. Most studies with empagliflozin as the main comparator based their cohort on the EMPA-REG OUTCOME trial, which only included patients with established CVD and focused more on explicitly incorporating the reduction of cardiorenal DRCs as separate treatment-effect parameters. Including the effect as separate parameters could lead to more accurate models since GLP1 and SGLT2 have demonstrated a reduction of DRCs that occur independent of the antihyperglycaemic effect [15]. However, one of the included studies argued that including both the direct and indirect effects carries the risk of double-counting, which could overestimate the effect [61]. The majority of the included studies focused on translating the differences between the comparators in terms of how they changed common physiological markers (HbA1c, SBP, cholesterol, etc.) into differences in DRCs over time. Some have critiqued the approach of translating these frequently minor differences into large effects over a time horizon that might be too long [25].

Regarding differences in the inclusion of AEs, this review highlighted the omission of diabetic ketoacidosis as an AE when SGLT2 was a comparator. It is a rare but well-established and potentially deadly AE, especially for INS-dependent T2DM [15, 94, 103,104,105,106,107], that included only one study [52] with GLP1 as the main comparator included. These differences between the models, which in some cases seem to be dependent on the class of the main comparator, are problematic, and decision makers need to pay special attention to this, as the results of the economic analysis could be greatly affected by them.

Most studies (48%) of DAMs used in the health economic evaluations were based on the CDM from the IQVIA™ [84, 85]. Still, the high representation of this model does not necessarily reflect its superiority compared with other models available. Studies with GLP1 as the main comparator use this model 61% of the time, while those with SGLT2 use it 33% of the time. The Mount Hood meetings exemplify that there are several complex simulation models specific to diabetes, each with own its strengths and weaknesses [108,109,110,111].

Development of the physiological parameters over time in the context of the disease is primarily governed by risk equations that represent assumptions about the progression of the disease. The present review focused on which risk equations were used and whether it was used for the base-case analysis or the SA. The most commonly used risk equations were UKPDS 68 and 82, which are based on the UKPDS study [96] from 1977 to 1997, while UKPDS 82 uses 10 years of follow-up beyond that [88]. While these risk equations are currently widely used, contemporary studies may be able to produce more accurate risk equations that better reflect modern treatment practices. The choice of risk equation seemed connected to the class of the main comparator, and the reasoning behind the choice was rarely explicit in the studies. The GLP1 studies that were explicit in their reasoning cited recommendations from the model proprietors (i.e. of the CDM). The authors of this paper could not find any such recommendations in the model validation paper, where the UKPDS 82 has been incorporated since 2014 [88]. The three studies with SGLT2 cited model fit, and reference to model proprietor recommendations was absent, even though they also used the CDM. We recommend that the reasoning behind the choice of risk equations is always stated explicitly, as it may be difficult to explain why using UKPDS 68 was the better choice since the UKPDS 82 risk engine is an updated and improved version based on a larger dataset over an extended period, with more significant predictors and event types, and, in general, predicted fewer DRC events [88].

Another key assumption relates to the time from initiation of the comparators in the study until they are either switched out, additional medication is added, or are replaced by rescue therapy. Studies with GLP1 as the main comparator and predefined time to switch often cited a report that states that the mean treatment duration of GLP1 was 29.35 months [112]. However, treatment guidelines known to the authors of this paper recommend lifelong treatment with the nNIADs, and only stopping if their use was contraindicated. Models that use a short preset time-to-treatment switch where the comparator was replaced with INS might not be representative of real-world practice, and the assumption might lead to misinforming decision makers about the long-term consequences of utilising that comparator.

The quality of the included studies was, in large part, adequate. The inadequate answers were about the omission of relevant comparators, lack of societal perspective, and the accuracy of measuring costs and consequences. These quality issues may however be a limitation of available data and not a lack of consideration from the authors of the included studies. However, the methods used in other studies were rarely discussed, except for some recent studies [58, 59, 61,62,63, 72, 73], where the authors directly commented on the advantages and disadvantages of their own and others’ methods.

4.2 Connecting Main Findings to Other Studies

Other reviews in this area find that SGLT2 and GLP1 appear to be cost effective for patients with T2DM uncontrolled on MET and other background oNIADs. Ruan et al. [30] focused on DPP4 and found it to be cost effective as a second-line treatment compared with sulfonylurea (SU) and INS, but not SGLT2 and GLP1. Bagepally et al. [28] found GLP1 to be cost effective compared with SU, DPP4, and TZD in high-income countries, while Bagepally et al. [29] found SGLT2 to be cost effective against SU, but not DPP4, in high-income countries. Zozaya et al. [31], Rahman et al. [26], and Yoshida et al. [27] focused on SGLT2 and found the class to be cost effective against SU, TZD and DPP4. Additionally, Yoshida et al. found SGLT2 to be cost effective against α-glucosidase inhibitors, INS, and SoC. However, Yoshida et al. and Zozaya et al. cautioned that the heterogeneity of the studies included in their studies made it challenging to determine in which treatment scenarios nNIADs were most cost effective.

The lack of CEAs or systematic reviews focusing on DPP4 seemingly conflicts with a global report regarding NIAD market shares that found DPP4s to be the most commonly used nNIAD, followed by GLP1 and SGLT2 [113].

4.3 Limitations of the Study

The present review has several limitations. The literature search was restricted to a brief period, and it could be argued that not including non-DAM methodologies is a limitation as valuable information was lost, even if DAMs are considered the best tool to inform decision makers. It could also be argued that systematic reviews should always strive to provide a unified conclusion based on the literature instead of a descriptive overview and broad qualitative synthesis of the results. However, the diversity of the methodological choices, settings, and the high risk of sponsorship bias studies, makes attempting to provide a conclusion on which nNIADs are cost effective challenging to interpret for decision makers. Only including studies that compare nNIADs with other nNIADs was also a limitation. Previous reviews conclude that nNIADs were cost effective compared with oNIADs; however, given the high cost of nNIADs, this is probably limited to second-line treatment. One study [114] has estimated that SGLT2 and GLP1 need a price reduction of 70% to be cost effective compared with MET. The choice of quality assessment checklist in the present study could be considered a limitation as it was not model-specific; however, model specifics are reported as part of the main data extraction. An additional limitation is excluding non-English-language papers since this could likely have excluded many papers from the Asian region. Finally, it is a limitation of this review that we included all published papers as separate studies without discussing whether some papers should be interpreted as adaptations of the same modelling study to different settings. This distinction is not trivial however but underlines the same conclusion that simply counting the number of published papers in favour of a certain conclusion about cost effectiveness does not necessarily count as ‘more evidence’ of this finding.

4.4 Strengths of the Study

The present review has some key strengths. Extracting a broad range of information about the assumptions and input and output parameters of DAMs can provide valuable insight into the underlying methodological choices. In a field as complicated as modelling the cost effectiveness of T2DM treatment, it might be beneficial to go beyond only reporting a combination of the economic outcomes and using a checklist that mainly evaluates the included paper’s quality on an ordinal scale of high to low quality. More information for decision makers might provide a better foundation for making decisions.

4.5 Suggestions for Future Research

More studies should be conducted directly comparing the products of the SGLT2 and GLP1 classes. Additionally, the combination of SGLT2 and GLP1 as a single treatment intervention was not examined in any of the included studies, but combining these two highly effective drugs using different biological pathways [15] may be worth exploring. However, the most benefit could be gained from conducting more studies for middle- to low-income countries due to their higher prevalence and incidence of T2DM. If the Danish Medicines Counsel is right in their assumption that the SGLT2 and GLP1 can be seen as equivalent in effect [102], SGLT2 is likely to be a good choice for these countries.

However, concluding the cost effectiveness of nNIADs for treating T2DM must be cautiously approached due to the differences between the studies, lack of transparency, and the sponsorship bias outlined in the present and previous reviews. We advise future reviews aiming to conclude on this to take this into account and to specify in what context their conclusion applies. For systematic reviews to be better able to derive conclusions, we suggest establishing better practices and consensus for conducting CEAs evaluating the cost effectiveness of T2DM treatments with a DAM. The central clinically relevant model assumptions, parameters, and comparators must be included, and the explicit reasoning for choices must be provided and presented as clearly and transparently as possible. Implementing a base-case cohort in DAMs, based on a representative sample of the population large enough to examine the effect on different subgroups, could greatly benefit the ability of these studies to reflect the consequences of more usage of the nNIADs in a real-world context in contrast to the highly selected populations of the RCTs used in the included studies. Finally, more observational research is needed into the different aspects of the treatment switch assumption regarding the time until INS is added to, or replaces, the nNIADs and the proportion of addition compared with substitution. Involving clinicians could aid in addressing many of these issues, especially in clarifying and validating the underlying methodological choices in terms of how they reflect and can be transferred to a real-world context and understood by decision makers.

5 Conclusions

There are several challenges associated with drawing conclusions about which of the nNIADs is the cost-effective option for the second-line treatment of T2DM. First, the data used to model disease progression may not reflect modern treatment practice. Second, the lack of reasoning behind key methodological choices, some of which may conflict with treatment guidelines. Third, the assumptions regarding the time until changes to the comparators may not be representative of current practice. Finally, some methodological differences seem to depend on the class of the main comparator in the analysis, and sponsorship bias may also affect results.