Cardiovascular risk prediction in type 2 diabetes: a comparison of 22 risk scores in primary care settings

Aims/hypothesis We aimed to compare the performance of risk prediction scores for CVD (i.e., coronary heart disease and stroke), and a broader definition of CVD including atrial fibrillation and heart failure (CVD+), in individuals with type 2 diabetes. Methods Scores were identified through a literature review and were included irrespective of the type of predicted cardiovascular outcome or the inclusion of individuals with type 2 diabetes. Performance was assessed in a contemporary, representative sample of 168,871 UK-based individuals with type 2 diabetes (age ≥18 years without pre-existing CVD+). Missing observations were addressed using multiple imputation. Results We evaluated 22 scores: 13 derived in the general population and nine in individuals with type 2 diabetes. The Systemic Coronary Risk Evaluation (SCORE) CVD rule derived in the general population performed best for both CVD (C statistic 0.67 [95% CI 0.67, 0.67]) and CVD+ (C statistic 0.69 [95% CI 0.69, 0.70]). The C statistic of the remaining scores ranged from 0.62 to 0.67 for CVD, and from 0.64 to 0.69 for CVD+. Calibration slopes (1 indicates perfect calibration) ranged from 0.38 (95% CI 0.37, 0.39) to 0.74 (95% CI 0.72, 0.76) for CVD, and from 0.41 (95% CI 0.40, 0.42) to 0.88 (95% CI 0.86, 0.90) for CVD+. A simple recalibration process considerably improved the performance of the scores, with calibration slopes now ranging between 0.96 and 1.04 for CVD. Scores with more predictors did not outperform scores with fewer predictors: for CVD+, QRISK3 (19 variables) had a C statistic of 0.68 (95% CI 0.68, 0.69), compared with SCORE CVD (six variables) which had a C statistic of 0.69 (95% CI 0.69, 0.70). Scores specific to individuals with diabetes did not discriminate better than scores derived in the general population: the UK Prospective Diabetes Study (UKPDS) scores performed significantly worse than SCORE CVD (p value <0.001). Conclusions/interpretation CVD risk prediction scores could not accurately identify individuals with type 2 diabetes who experienced a CVD event in the 10 years of follow-up. All 22 evaluated models had a comparable and modest discriminative ability. Graphical abstract Supplementary Information The online version contains peer-reviewed but unedited supplementary material available at 10.1007/s00125-021-05640-y.


Introduction
CVD treatment initiation and intensification in clinical practice are guided by risk prediction algorithms. The UK National Institute for Health and Care Excellence (NICE) guidelines pragmatically recommend the use of the QRISK2 risk prediction tool in people with and without diabetes. The American College of Cardiology/American Heart Association (ACC/AHA) recommends estimating the 10 year risk of CVD using the Atherosclerotic Cardiovascular Disease (ASCVD) risk score [1]. Contrary to this, the European Society of Cardiology (ESC) does not recommend a specific CVD risk prediction tool, and instead stratifies individuals into three categories based on risk factors including: presence of target organ damage, number of risk factors, diabetes duration and age [2].
Despite major advances in treatment, people with type 2 diabetes remain at high risk of CVD, the main cause of morbidity and mortality in this population [3]. There is, however, considerable heterogeneity in risk [4], supporting the need for risk-stratified management. With over 300 published CVD risk prediction tools [5], many of which have not been validated in individuals with type 2 diabetes, nor directly compared within the same patient population, it is unclear which CVD score performs best in people with diabetes. Previous comparisons only partially addressed this question, due to focusing on non-representative individuals with diabetes enrolled in drug trials [6], focusing on a relatively short follow-up [7] or using a modest sample of individuals [8], and with all focusing on a small subset of available scores, without exploring performance to predict CVD outcomes more relevant for those individuals. Quite apart from the greater CVD risk, even at a given level of individual risk factors, it is evident that the initial presentation of CVD in individuals with diabetes differs from that of the general population, with greater representation of heart failure (HF) and of peripheral artery disease (PAD), while haemorrhagic strokes are less frequent [9]. General population scores, and indeed many designed for people with diabetes, have focused largely on the prediction of CHD and stroke only.
Our aim was to inform the use of risk scores in clinical practice by quantifying the validity of existing risk scores in predicting standard CVD (CHD, stroke, PAD), as well as a broader definition of major CVD outcomes (CVD+) that includes HF and atrial fibrillation (AF), as these are more frequent outcomes in diabetic populations [10,11]. Additionally, we explored the scores' predictive performance against individual disease types: stroke, CHD, AF and HF. We also compared the performance of both bespoke CVD risk scores for individuals with diabetes and CVD risk scores for the general population. The latter are preferred for clinical practice as a single tool is simpler to deploy. While it is assumed that diabetes-specific scores may perform better in people with diabetes, no formal comparison with general population scores has been undertaken before. We first performed a literature review to identify CVD risk prediction scores, and subsequently validated these in a large UK-based electronic health records (EHR) dataset. We also performed key subgroup analyses, stratifying by sex, age, CVD history and treatment.

Methods
Literature review A literature search for CVD risk assessment tools was performed using MEDLINE [12]. The search strategy focused on key words including 'CVD', 'type 2 diabetes', 'risk assessment' or 'risk score' and names of known risk scores. Please see the electronic supplementary material (ESM) Methods section and ESM Fig. 1  Individuals with type 2 diabetes in this dataset were identified based on a CALIBER phenotyping algorithm (https:// www.caliberresearch.org/portal/phenotypes), harmonising and combining data from general practitioner (GP) records and HES; see ESM Table 1. The CALIBER phenotyping algorithms have been extensively validated, as described previously [14]. As primary care registration is close to universal in the UK and free at the point of delivery, with reimbursements based on correct entry of diagnostic codes, type 2 diabetes case ascertainment based on UK primary care data can be considered as highly accurate [15].
Cardiovascular outcomes Individuals were followed up from their initial type 2 diabetes diagnosis until their first cardiovascular event, death, end of study (5 February 2018) or 10 year follow-up landmark, whichever occurred first. Individuals with a previous record of AF were excluded due to the inability to differentiate between ongoing vs recurrent AF events in EHR. Individuals with any other pre-existing CVD event were excluded from the main analyses, and considered in subsequent subgroup analyses of performance in participants with pre-existing CVD at the time of diagnosis.
A CVD event was defined as the first occurrence of fatal or non-fatal myocardial infarction (MI), sudden cardiac death, ischaemic heart disease, fatal or non-fatal stroke, or PAD since diagnosis of type 2 diabetes. We additionally defined CVD+ as including HF and/or AF: 'CVD + AF + HF'. Additionally, we explored performance against individual CVD components: CHD, stroke, AF and HF. Detailed CALIBER [13] endpoint definitions are provided in ESM Table 5.
Statistical analysis Models were evaluated on discrimination (using Harrell's C statistic [16]) and calibration (calibrationin-the-large [CIL] and calibration slope [CS] [17]); see ESM Methods. We note that when predicting the occurrence of a binary outcome (such as a disease) at a single moment in time, the C statistic is identical to the area under a receiver operating characteristic (ROC) curve [16]. The C statistic varies from 1.0 (perfect discrimination) to 0.5 (random chance). It has been suggested that a C statistic below 0.70 indicates inadequate discrimination, between 0.70 and 0.80 acceptable discrimination, and between 0.80 and 0.90 excellent model discrimination [18]. Missing data were addressed using multiple imputation, and, for comparison's sake, the results were compared with those obtained from a complete-case deletion dataset. Models were evaluated both before and after recalibration, a process whereby a model's intercept and slope are updated to adapt a risk score to a different local setting, a similar but distinct outcome or both. Here, the risk scores were independently recalibrated to predict all six of the CVD endpoints described. To prevent model overfitting, recalibration was performed in a 10% (16,887) independent training sample, which is an ample sample size to estimate the two coefficients (the intercept and slope) necessary for model recalibration. The remaining 90% (151,984) of the dataset was used to compare like-with-like model performance of the uncalibrated and recalibrated models. Subgroup performance was explored for CVD history, sex, age and statin usage at the time of diagnosis. The discriminative ability of two models was formally compared by testing the difference in C statistics using the test data. The net reclassification index (NRI) for CVD was calculated using the test data (after model recalibration in the training data) to compare Systemic Coronary Risk Evaluation (SCORE) CVD against the following scores: QRISK2, QRISK3 and Cardiovascular Health Study (CHS) Basic.

Literature review
All the risk scores incorporated classic CVD risk factors, such as age, sex, blood pressure and smoking status. Twenty risk scores included information about lipids. The scores that included a proportion of individuals with diabetes typically included type 2 diabetes (presence/absence) as a predictor, but did not include diabetes-specific risk factors such as diabetes duration and glycaemic status (which are often used in diabetes-specific scores). The total number of predictors in these risk prediction models ranged from six (SCORE [30]) to 19 (QRISK3 [33]) (ESM Fig. 2, ESM Table 10).
Individual characteristics and 10 year CVD outcomes The baseline characteristics of the individuals are presented in Table 1 and ESM Tables 2, 6, 7. The mean age was 59.  Table 11).
Predicting cardiovascular risk in individuals with type 2 diabetes Results obtained from the complete case-analyses (see ESM Results) were similar to results from the multipleimputation analysis. Nevertheless, because the complete-case analysis slightly overestimated model performance ( Table 14). Models designed to predict stroke and/or HF did not substantially underperform compared with CVDderived models. The scores almost uniformly underestimated the risk of CVD+ (CS: from 0.41 to 0.88, CIL: from −1.50 to 2.69) (ESM Table 14 (Fig. 2). Despite observing reasonable external calibration, models had more difficulty discriminating between individuals who experienced an event within 10 years of follow-up and those who remained event free: the C statistic ranged from 0.62 to 0.67 (95% CI 0.67, 0.67) for SCORE CVD (Fig. 3). Similar patterns of discrimination were observed when predicting CVD+, with this combined endpoint showing a minimally improved C statistic (from 0.64 to 0.69) compared with CVD, and again with SCORE CVD having the largest C statistic (0.69 [95% CI 0.69, 0.70]). Testing for the pairwise difference in C statistics (ESM Fig. 7) indicated that SCORE CVD outperformed all other scores aside from the ASCVD, Finrisk CVD and SCORE CHD. SCORE CVD also performed better than the nine diabetes-specific scores (Fig. 3, ESM Fig. 7). A net reclassification comparison (applied after model recalibration, see below) showed that  Q1 and Q3 refer to lower and upper quartiles, accordingly a The denominator for smoking status is 129,526 individuals, after excluding individuals with missing information SCORE CVD performed slightly better than QRISK2 and QRISK3 by assigning a lower risk to individuals who did not experience CVD in the available 10 years of follow-up ( Table 2, ESM Table 16). We observed that scores with more than ten predictors did not necessarily outperform scores with fewer variables: QRISK3 (19 variables) for CVD+ had a C statistic of 0.68 (95% CI 0.68, 0.69), compared with a C statistic of 0.69 (95% CI 0.69, 0.70) for SCORE CVD (six variables) and a C statistic of 0.69 (95% CI 0.69, 0.69) for the Framingham 1998 score (seven variables). Similar results were obtained for the CVDonly outcome. The scores derived from individuals with diabetes did not outperform scores derived in a population of non-diabetic individuals (Fig. 3).

Predicting individual CVD endpoints and model recalibration
We additionally evaluated the ability of these 22 rules to predict individual CVD components: CHD, stroke, AF and HF. Here, we observed that, similar to CVD, predictions for CHD were slightly overestimated (ESM Figs. 3,5). This overestimating was more severe when using these models to predict stroke, AF and HF (ESM Figs. 4,6). Nevertheless, when considering AF and HF, the discriminative ability of these models slightly improved (C statistic ≥0.70) relative to CHD and CVD (ESM Figs. [8][9][10][11]. Recalibrating the 22 models using the 10% training dataset considerably improved calibration (CS: from 0.96 to 1.04) (ESM Figs. 12-14, ESM Tables 13, 15), with most rules showing near-perfect calibration in the test data. Given that most of these 22 rules were not designed to predict stroke, AF or HF, it was somewhat surprising to see that recalibration markedly improved performance for these endpoints as well, (Fig. 4). For example, after recalibration, QRISK3 could predict HF risk (CS: 1.09 [95% CI 1.02, 1.17]) and AF risk (CS: 1.08 [95% CI 1.00, 1.16]) remarkably well (Fig. 5).
Subgroup analyses Next, in individuals with type 2 diabetes without CVD+ at baseline, we explored the discriminative ability of these CVD scores in subgroup analyses of age, sex and statin usage (Fig. 6, ESM Fig. 15). Subgroup changes in performance were shared across the various scores, where discriminative ability was lower for men and statin naive and older individuals (significance interaction tests indicated by a solid line in Fig. 6, ESM Tables 17-19).
We additionally performed similar subgroup analyses for individuals with type 2 diabetes irrespective of their baseline

Discussion
We validated 22 cardiovascular risk scores for their ability to predict a range of macrovascular endpoints in a cohort of 168,871 people with type 2 diabetes. We report several unique findings. First, with discriminative abilities below 0.70 (C statistic), the scores performed universally poorly, which was compounded in individuals with diabetes and established CVD at baseline (C statistics close to 0.50). Second, diabetesspecific scores did not appear to be superior to scores derived for the general population, and in fact were outperformed by, for example, the general population SCORE CVD rule. Third, scores with many additional features did not outperform those with fewer and more readily available (in primary care) predictors. Finally, a simple recalibration step markedly improved score performance, repurposing scores intended to predict any CVD or CHD to accurately predict stroke, AF and HF risk (see Fig. 4). We externally evaluated two risk prediction scores widely used in the UK (QRISK2 and QRISK3), which had good discriminatory ability in the general population (C statistics for QRISK2 of 0.82 in women and 0.79 in men, and for QRISK3 0.88 in women and 0.86 in men), and found considerable attenuations in their discriminative ability when applied in individuals with type 2 diabetes (e.g., C statistic 0.66 [95% CI 0.66, 0.67] for QRISK2). This poor performance may be somewhat surprising given that the QRISK scores were derived in a similar, but independent, sample of English    .001 for CVD and CVD+), we note that the magnitude of the difference in the C statistic was small and is unlikely to have clinical implications. This may be better appreciated when considering the net reclassification analysis: comparing SCORE CVD with QRISK2 (non-event NRI 0.009, ESM Table 16) and with QRISK3 (non-event NRI 0.017, Table 2) resulted in very modest improvements. The predictive performance of the risk scores was markedly poorer in individuals with pre-existing CVD at the time of type 2 diabetes diagnosis (C statistic ranged from 0.50 to 0.54 for the CVD outcome). Most of the prediction models were developed in individuals without clinical manifestations of CVD and were not validated in people with established CVD. Moreover, these risk scores lack predictors that are of particular importance to individuals with established disease, such as time since the first diagnosis of CVD, history of CVD and renal function [36]. The improved performance of the RECODE score (C statistic >0.70), when considering all Estimates are based on imputed data. Depicted performance is based on 90% of the data used for external validation, independent of the 10% hold-out sample used to recalibrate the models. The observed 10 year risk (y axes) is plotted against the mean predicted 10 year risk (x axes) within groups defined by quintiles of predicted risk participants with type 2 diabetes irrespective of their history of CVD, is likely related to its development in participants with diabetes with similar mixed histories of CVD. Here, we show that the inclusion of people with a mixed history of disease at the time of diabetes diagnosis, combined with appropriate modelling choices, may improve (instead of worsening) score performance.
The necessity for a diabetes-specific CVD score has often been discussed [7] and revolves around the need to account for excess risk unexplained by conventional risk factors, and the desire to include diabetes-specific variables such as HbA 1c and diabetes duration, which are known observationally to predict CVD risk [37]. It has been suggested that a diabetesspecific score can better deal with exposure and outcome associations specific to individuals with type 2 diabetes [25]. Despite these arguments, we did not observe a general benefit of diabetes-specific rules (including diabetes-specific variables) compared with scores derived in samples with a mixture of individuals with type 2 diabetes and the general population. This suggests that the presence of other risk factors which may often correlate among themselves, such as diabetes duration and HbA 1c , does not markedly contribute to model performance. Similarly, we note that despite comparing risk prediction tools derived over the course of more than three decades, during which health and healthcare have generally improved, the almost uniform performance of these scores in the present contemporary sample of individuals with type 2 diabetes illustrates that these healthcare changes have not affected the external performance of these models.
Due to the inherent limitations of EHR data, some predictor variables were infrequently measured (ESM Table 3), which we attempted to address through multiple imputation. Possibly this reliance on imputed data biased results; however, we did not observe a meaningful difference in performance between complex models such as QRISK3 (requiring 19 variables) and more straightforward models such as SCORE CVD and CHD (requiring only six variables). Furthermore, while medical history and prescription data could be readily extracted from prior to the time of type 2 diabetes diagnosis, measured risk factors, such as blood pressure, were extracted  using a window of 12 months before and 1 week after diagnosis. While this does reflect data availability in a real-world setting, medical professionals intending to use the risk prediction tools will likely actively measure key variables, especially if these are readily obtained, such as BMI and blood pressure. Thus, we may have underestimated the true performance of these risk prediction scores in a more ideal setting. Performing analyses in primary care records, where such risk scores are deployed in practice, provides an appropriate platform for validation studies. We acknowledge that such data involve issues with missing data and coding errors, but this better reflects their 'true' value in contrast to the more artificial situation of a research cohort study or clinical trial. Furthermore, the use of real-life clinical data takes a 'whole population' approach, whereas cohorts and trials apply restrictions to entry.
The calibration (agreement between observed and predicted risk) was generally reasonable for all scores and could readily be improved by recalibrating the models in an independent training set. This recalibration was also successful in repurposing models to predict endpoints outside their intended use. Here, we reiterate that all performance metrics, discrimination (C statistic) and calibration, were estimated in an independent test dataset fairly assessing performance without the over-optimism observed when calculating these metrics in the same training data used to recalibrate the scores (or when deriving a model de novo). The near-optimal calibration additionally highlights that the recalibrated models were not overfitted and utilised a sufficiently large training sample size, which would otherwise result in over-or under-estimating the true risk in an independent test dataset. This is perhaps most clearly shown by comparing the calibration plots presented by van der Leeuw and colleagues [8] derived in a small sample of 584 individuals with type 2 diabetes with the calibration observed in the current analysis using more than 16,000 individuals with diabetes (Figs 2, 4, 5). Previous studies have typically focused on any CVD outcome, or individual endpoints such as CHD and any stroke. Here, we show that such models can be used to identify individuals with type 2 diabetes at increased risk for the composite endpoint CVD + HF + AF (since AF and HF occur much more frequently in these individuals [38]). While we showed reasonable out-of-the-box calibration, recalibration improved performance to near-perfect agreement and we propose that recalibration is more frequently considered before applying any model to local settings. Given the modest sample size (a few hundred cases) required to accurately recalibrate a model [39], combined with the increased availability of EHR data, such recalibration could be readily applied by healthcare commissioners at a local level. The adjustment of prediction models to local settings, so-called model updating (recalibration), typically requires a fraction of the time and data (often 100-200 cases should be sufficient [39]), and provides an attractive and efficient alternative to the derivation of a completely new model for each local setting, particularly if one also considers the need for independent replication data to fairly assess model performance. Furthermore, reclassification analyses similar to the ones presented here presuppose that the models are reasonably calibrated. To facilitate model recalibration to local clinical settings, we have appended a straightforward computer application (https://gitlab.com/cvd_in_t2dm/ recalibration), which we are committed to support and adjust pending local requirements.
In summary, we have shown that CVD risk scores derived in the general population performed worse in people with type 2 diabetes. CVD risk scores derived in individuals with diabetes did not in general perform better, emphasising the difficulties of accurately predicting CVD in a relatively high-risk population. The performance of the scores was also similar for the wider outcome definition of CVD, which includes HF and AF which occur more frequently in individuals with diabetes.
Acknowledgements This study was carried out as part of the CALIBER © programme (https://www.ucl.ac.uk/health-informatics/caliber). CALIBER, led from the UCL Institute of Health Informatics, is a research resource consisting of anonymised, coded variables extracted from linked electronic health records, methods and tools, specialised infrastructure, and training and support. This study is based in part on data from the Clinical Practice Research Datalink obtained under license from the UK Medicines and Healthcare products Regulatory Agency. The data are provided by patients and collected by the NHS as part of their care and support. The interpretation and conclusions contained in this study are those of the authors alone. Copyright © (2020), re-used with the permission of the Health & Social Care Information Centre. All rights reserved.
This study and its results have been published previously in medRxiv (https://doi.org/10.1101/2020.10.08.20209015) and an abstract was presented at the ESC Congress in August 2021.

Data availability
The data that support the findings of this study are available from the Clinical Practice Research Datalink (https://www.cprd. com/), but restrictions apply to the availability of these data, which were used under license for the current study, and so they are not publicly available. HDR UK CALIBER Phenotype Library: https://portal. caliberresearch.org/.

Funding
KD is supported by a PhD studentship from the National Productivity Investment Fund-MRC Doctoral Training Programme (grant no. MR/S502522/1). FWA is supported by UCL Hospitals NIHR Biomedical Research Centre. JG is supported by BHF grant FS/17/70/ 33482. AFS is supported by BHF grant PG/18/5033837 and the UCL BHF Research Accelerator AA/18/6/34223. NC is supported by MRC Unit grant MRC_UU_00019/1.