Background

Chronic kidney disease (CKD) presents a substantial burden of disease worldwide [14], with an increasing number of people being diagnosed [5, 6]. A 2010 study of 2.8 UK adults reported a 5.9 % prevalence of stage 3–5 CKD [7]. In the UK, costs related to CKD care in 2009–2010 were estimated around £1.45 billion (1.3 % of the National Health Service (NHS) budget) [8] – costs that are set to rise steeply [6, 8].

Early detection of CKD, and identification of patients at increased risk of developing CKD, can improve care by guiding preventive measures to slow disease progression, initiating timely referral to nephrology care, and supporting better allocation of resources [9]. Yet, despite worldwide efforts to improve detection [10], CKD often remains undiagnosed in its early stages [5]. Currently, most CKD clinical surveillance relies on estimated Glomerular Filtration Rate (eGFR) from serum creatinine testing [10]. In the UK, national clinical practice guidelines recommend systematic monitoring, in the primary care setting, of eGFR in patients with CKD risk factors (i.e. diabetes, hypertension, cardiovascular diseases, or use of particular medications) [11]. In addition, eGFR has been calculated routinely in UK NHS laboratories since 2006, where at least age, sex and creatinine variables are available – so CKD may be picked up in a variety of clinical contexts. Nevertheless, the value of universal clinical/opportunistic screening for CKD remains unclear [12].

Risk prediction models can extend the clinical screening toolkit from measured to predicted disease, affording more timely intervention, for example, to reduce risk factors [13]. Several models have been developed to predict CKD onset, but most have not been validated outside the setting in which they were developed [14, 15]. Therefore, the portability of these models to other populations, risk environments and healthcare settings has yet to be demonstrated. Furthermore, comprehensive head-to-head comparisons of these purportedly alternative models are lacking in the literature [1416]. Only one comparison of two CKD prediction models in a small cohort was published to date [17].

The aim of this study was to externally validate and compare the performance of previously published models for predicting 5-year CKD risk using routine healthcare records from a UK population with well-studied, high quality electronic health records.

Methods

Reporting

The reporting of this external validation study follows the TRIPOD statement [18, 19], which is a set of recommendations for the reporting of studies describing the development, validation, or updating of prediction models [18, 19].

Literature review

Two recent systematic reviews identified prediction models on CKD onset and CKD progression [14, 15]. From these reviews, we selected models predicting CKD onset that could be used in primary care. Models were excluded if (1) they were developed for a specific subpopulation (e.g. HIV patients [20]); (2) the covariate coefficients and regression formula were not reported in the original study; or (3) they had more than one predictor not routinely collected in UK primary care (more than one predictor for which we had > 70 % missing data in our dataset).

Where available, we included simplified scoring systems accompanying the included prediction models. Such systems typically produce an integer score for each patient, where higher scores represent higher predicted risk but there is no relationship with absolute risk.

Validation cohort

Outcome

The outcome of interest was onset of CKD within 5 years. Existing models employ various definitions of CKD [14, 15]. For our study, we followed international guidelines [21] and considered a recent study [7] reporting UK CKD prevalence based on primary care records. We defined CKD as (1) the presence of at least two consecutive eGFR values below 60 mL/min/1.73 m2, as calculated with the Modification of Diet in Renal Disease (MDRD) formula [22], over a period of 3 months or longer; or (2) the presence of a CKD Stage 3–5 diagnostic code.

We were unable to incorporate albumin-creatinine ratio (ACR, a predictor of kidney damage [23] noted in international guidelines [21]) because ACR data are available only for selected groups of patients at risk of CKD, such as those on diabetes care pathways.

Data source

We used linked, anonymised data from the Salford Integrated Record (SIR) up to the end of 2014. SIR is an electronic health record (EHR) that has been overlain on primary and secondary care clinical information systems for over 10 years in the city of Salford (population 234 k) – an early-adopter site of healthcare IT in the UK. SIR includes patient records submitted by all 53 primary care providers and the one secondary care provider for this population, stored as Read codes versions 2 and 3 [24]. The data cover all primary care, some of secondary care – focused on long-term conditions management – and all results from biochemical testing across primary and secondary care.

Study population

Salford is a relatively deprived population with a high burden of disease, where the EHR data have been used extensively to study the population’s health and care. Like all English localities, Salford’s primary care is measured and remunerated under the Quality and Outcomes Framework, including counts of the mean number of conditions per registered patient, where Salford falls in 61st centile [25].

We included all adults (aged 18 years or older) registered with a Salford practice with at least one record in SIR between April 1, 2009, and March 31, 2010 – the financial year. We looked at the financial rather than calendar year to take account of the Quality and Outcomes Framework, which might have influenced the quality of data recorded by GPs [26, 27]. For all retrieved patients the entry date was the date of the first record in the financial year 2009. Included patients were followed until December 31, 2014, or censored when they moved outside of Salford or died.

We excluded patients with CKD stage 3–5 before study entry, which was determined by diagnostic codes and eGFR measurements (following our definition of CKD onset).

We also defined a cohort of patients with complete follow-up data, consisting of patients who either developed CKD in the study period or had at least 5 years of follow-up. We used this cohort to validate models derived with logistic regression, which requires complete follow-up data.

Predictors and missing data

We used Read codes retrieved from clinicalcodes.org [28] to extract clinical and laboratory variables from the SIR database. Clinicalcodes.org is a repository of Read codes used in previously published articles; we used Read codes from five studies [2933] (see Additional file 1 for full list of adopted Read codes). For comorbidities, such as hypertension and peripheral vascular disease, we identified any related diagnostic Read code before the patient’s study entry date. If the type of diabetes was not specified in the diagnostic code or contradicting codes were present (i.e. diabetes type 1 and type 2 for the same patient), we assigned ‘type 1’ to patients with the first diabetes code before 35 years of age, and ‘type 2’ to all other diabetes patients. For medications, such as nonsteroidal anti-inflammatory drugs or hypertensive medications, we looked for at least two prescriptions in the 6 months prior to entry date. Finally, for laboratory tests, we selected the most recent result within 12 months before the entry date.

Since more than 90 % of the population in Salford is of White British ethnicity [34], we considered patients without a recorded ethnicity code as White British. We imputed values for predictors using multiple imputation by chained equations with 10 iterations to minimise the effect of selectively ignoring those with any missing data (using the mice package in R [35]).

Data analysis

We implemented models developed by logistic and Cox proportional hazards (CPH) regression formulas using published coefficients and intercept or baseline hazard provided. For the QKidney models [36] we used the information from svn.clinrisk.co.uk/opensource/qkidney – a web-based calculator written in C (re-coded in R language as per Additional file 2). For simplified scoring systems, we computed the total simplified score for each patient in our dataset. In addition, if the original model was a logistic regression and the intercept was not reported, we estimated it from information about CKD prevalence and predictors summary measures (mean for continuous variables and prevalence for binary variables) in the development population.

We assessed the performance of the models and the associated simplified scoring systems in terms of discrimination and calibration. Discrimination is the ability of a model to distinguish between patients who do or do not develop CKD. Discrimination was assessed by calculating the area under receiving operating characteristic curve (AUC) and Harrell’s c-index [3739]; 95 % confidence intervals (CIs) for the AUC and c-index were calculated from 500 bootstrap iterations. We evaluated calibration by calculating the mean absolute prediction error (MAPE), calibration slope, and by calibration plots. MAPE is the average difference in predicted and observed onset of CKD, expressed by a number between 0 and 1, with values closer to 0 indicating better performance [40] (see Additional file 3 for details). Calibration slopes are regression slopes of linear predictors fitted to the external validation dataset [41]. The optimal value is 1, with values smaller than 1 reflecting overfitting of the model. Calibration plots compare mean predicted risks with mean observed outcomes for subgroups with similar predicted risks. A model is considered to be well calibrated if the plot follows the 45° line from the lower left corner to the upper right corner of the plot. In our analysis, we created calibration plots using the R package PredictABEL [42].

For the simplified scoring systems, we compared sensitivity, specificity and positive predictive value (PPV) obtained by using the decision-making threshold that was reported in the original publications, as well as using the optimal threshold for our study population as calculated with Youden’s method [43]. If a study did not present any risk score or we could not use the proposed simplified score because of more than one missing predictor in our dataset, sensitivity, specificity and PPV were evaluated for the full model instead.

To interpret the performance of included models we used the framework for external validation from Debray et al. [44]. Therefore, we assessed the extent to which the case-mix of the development datasets and our validation dataset were similar, by comparing the mean linear predictor of models in the two cohorts. Since individual patient data of the development datasets were not publicly available, the mean linear predictor was calculated as the sum of the intercept and the product of model coefficients and predictors’ prevalence (for binary variables) or mean (for continuous variables) provided in the summary statistics of original studies. In order to assess how accurate the mean linear predictor calculation based on the summary statistics was, in our validation dataset we also calculated the mean linear predictor by calculating the mean and standard deviation (SD) of the linear predictor from the individual patient data.

Finally, to evaluate the clinical impact of implementing the models in practice as screening tools, we performed two analyses. First, we performed decision curve analysis evaluating how different threshold probabilities alter the false-positive and false-negative rate expressed in terms of net benefit [45]. When carrying out a head-to-head comparison of different prediction models on the same population, the interpretation is straightforward – at each clinically relevant probability threshold, the model that has the highest net benefit is preferable. Models are also compared to the extreme choices of designating all and no patients at high risk of developing CKD. Second, for each model, we evaluated the potential implementation of a CKD prevention high-risk approach [46] based on the model’s prediction by calculating the proportion of observed CKD cases in our dataset within the highest tenth of predicted risk (i.e. the 10 % of patients with highest predicted risks).

Data manipulation and statistical analyses were performed using R software (www.r-project.org).

Sensitivity analyses

We performed several sensitivity analyses. First, since the risk of developing CKD in the asymptomatic general population is low [47], we also validated each of the models in patients with established CKD risk factors at entry date. Following the UK National Institute for Clinical Excellence (NICE) guidelines on early detection of CKD [11], these risk factors were use of calcineurin inhibitor drugs, lithium, or nonsteroidal anti-inflammatory drugs; diabetes mellitus; hypertension; acute kidney injury in the previous 2 years; history of cardiovascular disease, renal calculi or prostatic hypertrophy, systemic lupus erythematosus, or haematuria; and family history of kidney disease. Second, as most models in our study used a single measured renal impairment to define CKD, we repeated the analysis while using a more inclusive definition of CKD onset as the presence of a CKD 3–5 diagnostic code or a single eGFR measurement below 60 mL/min/1.73 m2. Third, we considered patients who died during follow-up as if they developed CKD, because mortality is frequently attributable to CKD and most risk prediction models do not account for death as a competing risk. Fourth, we calculated eGFR by using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula [48] and repeated our main analysis (e.g. CKD defined as impaired eGFR for at least 3 months or CKD 3–5 diagnostic code). Fifth, we repeated our main analysis by using a prediction horizon of 4 instead of 5 years. Finally, we repeated the analyses omitting individuals with any missing observation.

Results

CKD prediction models included for external validation

Figure 1 depicts the model inclusion process. Of the 29 models identified by Collins et al. [14] and Echouffo-Tcheugui and Kengne [15], 18 were developed with the aim of predicting CKD onset. We excluded three models because of incomplete reporting of regression models (regression coefficients not fully reported) in the original paper [49] and one model because it was developed in a specific sub-population (namely HIV patients) [20]. We excluded a further seven models for which we had more than one missing predictor in our dataset, including missing data for eGFR, urinary excretion, and c-reactive protein [50]; missing post-prandial glucose, proteinuria and uric acid [51]; missing eGFR and quantitative albuminuria [52], and finally, we excluded two models because of missing eGFR and low levels of high-density lipoprotein cholesterol [52, 53], respectively. The final set consisted of seven models (five logistic regression models and two CPH regression models) and five simplified scoring systems [36, 5156]. Table 1 describes the details of the included models, and Additional file 3: Tables S1, S2 and S3 provide the population characteristics of the development datasets, the regression coefficients, and the simplified scoring systems.

Fig. 1
figure 1

Procedure to identify and select CKD prediction models

Table 1 Details of studies developing CKD prediction models that were included for external validation

All models were developed outside the UK, with the exception of QKidney® [36] (www.qkidney.org), which was developed on a large population from England and Wales selected from general practices using the EMIS EHR. All included models used a different definition of CKD, but the majority used an older definition based only on one impaired eGFR measurement. Time horizons in original papers were different to our 5-year definition, with the exception of QKidney® [36], which, however, allowed other time horizon options (1-, 2-, 3- and 4-year). For three models, the prediction time horizon was not specified [5456]. However, we could derive from study duration and data collection procedures in the original publications that the time horizons were 1 [56], 2 [54] and 9 [54] years, respectively. For the remaining models, the reported time horizons were between 4 and 10 years [51, 52, 54].

Predictors included in the models were largely based on known CKD risk factors (hypertension, diabetes mellitus, or history of cardiovascular disease). The only biomarkers included were systolic and diastolic blood pressure, and body mass index. Multiple imputation of missing values was applied to these variables, along with deprivation, haemoglobin (i.e. to calculate presence of anaemia) and smoking. In these predictors, missing values ranged from 1.8 % to 70.0 %, with a median value of 46.0 %. Conversely, we excluded proteinuria as a predictor from our analyses due to 94.6 % missing data (Table 2); therefore, the models by Bang et al. [54] and Kwon et al. [55] had one missing predictor. Finally, three of the included models, which derived a simplified scoring system [53, 55, 57], did not report the intercept of their underpinning logistic regression model, and therefore we estimated the intercepts from the prevalence of CKD and predictors’ summary statistics in the original studies.

Table 2 Patients with complete and incomplete follow-up data stratified for CKD onset; values are numbers (%) unless indicated otherwise

Study population characteristics

Figure 2 shows the cohort selection process. There were 187,533 adult patients with at least one record in the financial year 2009 in our database, of which 178,399 remained after applying our exclusion criteria, with 6941 patients (3.9 %) that died before developing CKD. There were 162,653 patients (91.2 %) who had complete follow-up data. Overall, there were 6038 incident cases of CKD during the study period. Tables 2 and 3 describe the characteristics of cohorts with complete and incomplete follow-up.

Fig. 2
figure 2

Cohort selection

Table 3 Prevalence of CKD risk factors (as expressed in NICE guidelines) stratified for CKD onset; values are numbers (%) unless indicated otherwise

External validation

Table 4 presents the results of the external validation, namely discrimination and calibration. AUC values ranged from 0.892 (95 % CI, 0.888–0.985) to 0.910 (95 % CI, 0.907–0.913) for patients with complete follow-up data, and the c-index values for the two CPH models on the full cohort were 0.888 (95 % CI, 0.885–0.892) [51] and 0.900 (95 % CI, 0.897–0.903) [36], respectively. Simplified scores showed similar performance to the models from which they were derived. MAPE was below 0.1 for all models, with the only exception of Thakkinstian et al. [56], for which the MAPE was 0.179 (standard deviation (SD), 0.161). Calibration plots (Fig. 3) and related calibration slopes (Table 4) on the complete follow-up data showed similar figures to the MAPE analysis. Thakkinstian et al. [56] confirmed a tendency for over-predicting risk with a calibration slope of 0.44 (95 % CI, 0.43–0.45). Conversely, the only models that were well-calibrated to our population were the ones by Bang et al. [54] and QKidney® [36] with calibration slope values of 0.97 (95 % CI, 0.96–0.98) and 1.02 (95 % CI, 1.01–1.04), respectively. All other models over predicted risks (i.e. calibration slopes ranging between 0.53 [ 95 % CI, 0.52–0.53] and 0.68 [ 95 % CI, 0.67–0.69] ), with the exception of the model by Kshirsagar et al. [53], which predicted lower risk and had a calibration slope of 1.74 (95 % CI, 1.72–1.76).

Table 4 Discrimination, MAPE and calibration slopes of included models in patients with complete follow-up data (all models and risk scores) and in the full validation cohort (Cox proportional hazards regression models only)
Fig. 3
figure 3

Calibration plot of predicted and observed risk for the cohort of patients with complete follow-up. On the bottom a rug plot in the form of histogram shows the distribution of the predicted values

Table 5 reports the PPV, sensitivity and specificity for each of the simplified scoring systems. In this analysis we included the full QKidney® [36] model as it does not have an associated simplified scoring system. We also included the full O’Seaghdha et al. [52] model because we could not implement their scoring system: multiple predictors had 70 % or more missing values in our dataset. For two scoring systems (Chien et al. [51] and Thakkinstian et al. [56]), the best threshold in our population was different than the threshold proposed in the development study. For QKidney® [36] and O’Seaghdha et al. [52], who did not report a threshold in the development study, the optimal threshold in our population was 0.017 (SD, 0.002) and 0.086 (SD, 0.010), respectively. In terms of performance, models showed similar performance, with a PPV, sensitivity and specificity of approximately 0.145, 0.86 and 0.80, respectively.

Table 5 Positive predictive value, sensitivity and specificity for simplified scoring systems when applying to the threshold that was proposed in the development study and best threshold on our dataset, calculated using the Youden’s method [43]

The distributions of the linear predictors in the development datasets and the validation dataset, calculated as proposed by Debray et al. [44], are shown in Table 6. For all models, the mean of the linear predictor in the validation dataset was lower than in the development datasets: we found mean differences between 0.2 and 0.6, except for the model of Thakkinstian et al. [56], which had a difference of 1.5. There were few differences between the mean linear predictors computed on our dataset using summary statistics compared with individual patient data.

Table 6 Mean linear predictor, calculated in development datasets and in our validation dataset (patients with complete follow-up data only)

The threshold probability associated with the highest tenth of predicted risk varied from 0.0692 for QKidney® [36] to 0.4256 for the model developed by Thakkinstian et al. [56]. When applying these thresholds to select the 10 % of patients with highest predicted risks, QKidney® [36] identified 64.5 % of all patients that developed CKD during the study period. Proportions for the other models ranged from 48.0 % for the model from Thakkinstian et al. [56] to 64.0 % for the model of O’Seaghdha et al. [52].

Decision curves for the cohort of patients with complete follow-up are presented in Fig. 4. The models by Bang et al. [54] and QKidney® [36] had the best performance. At predicted probability thresholds lower than 0.5, their net benefit was greater than all other models and greater than strategies labelling all patients at high risk (black line) or none at high risk (grey line). For predicted probability thresholds greater than 0.5, Bang et al. [54] and QKidney® [36] were equivalent to the choice of not labelling any patient as high CKD risk (grey line).

Fig. 4
figure 4

Decision curve analysis for the cohort of patients with complete follow-up

Sensitivity analyses

The sensitivity analysis conducted on patients with CKD risk factors showed comparable calibration and MAPE (Bang et al. [54] and QKidney® [36] were the only well-calibrated models), with an overall decrease in discrimination of about 0.1 (Additional file 3: Table S4) compared to our main analysis. Specifically, AUC values on patients with complete follow-up ranged from 0.756 (95 % CI, 0.749–0.762) to 0.801 (95 % CI, 0.795–0.808), while the c-index values for the two Cox regression models were 0.755 (95 % CI, 0.749–0.761) [51] and 0.775 (95 % CI, 0.769–0.781) [36], respectively. The performance of the simplified scoring systems was worse compared to the models from which they were derived.

The sensitivity analysis in which CKD was defined by the presence of only one eGFR measurement lower than 60 mL/min/1.73 m2 or a diagnostic code for CKD 3–5 led to a higher prevalence of CKD onset (5.2 %, n = 8854), with an overall predictive model performance that slightly decreased (Additional file 3: Table S5), especially in terms of calibration. CKD onset prevalence was also higher (3.9 %, n = 6988) when we calculated eGFR by using the CKD-EPI formula, with an increase in absolute numbers of approximately 1000 cases and an average age in this group of 76 years (SD, 8.1). Overall performance was similar to our main analysis, and only the model by Bang et al. [54] was well-calibrated in this sensitivity analysis (Additional file 3: Table S8). As expected, we witnessed an increase in CKD onset prevalence (7.6 %, n = 13,652) when we counted patients that died during follow-up as if they developed CKD; however, that did not lead to changes in discriminative performance of the models (Additional file 3: Table S6). Conversely, calibration improved for all models that were over-predicting CKD in our main analysis. In the analysis restricted to patients with complete data on all predictors we found an overall decrease in performance of about 0.08 for AUCs and c-index (Additional file 3: Table S7), while the sensitivity analysis that used a 4-year time horizon showed similar discriminative performance to our main analysis, but worse calibration for all models except QKidney® (Additional file 3: Table S9).

Discussion

We externally validated and compared seven published models for the prediction of CKD onset [14, 15], using a recent 5-year window with well-studied EHR data, typical of UK NHS primary care and chronic disease management. All models discriminated well between patients who developed CKD compared with those who did not. Five models had an associated simplified scoring system, each of which had a similar performance to its parent model. Only two models were well-calibrated to the risk levels in our population [36, 54]. Among the 10 % of patients with highest predicted risks, 48.0 % to 64.5 % actually developed CKD.

Two key strengths of this study are (1) its large sample size and (2) its cohort being based on a geographically-defined population rather than tied to a particular EHR, which minimizes selection bias at enrolment. In addition, whilst five out of seven models had already been externally validated [17, 36, 51, 54, 55, 58] and two had been mutually compared [17], our study is the first comprehensive head-to-head comparison of multiple CKD prediction models on a large independent population.

Three previous UK-based studies [36, 58, 59] have externally validated QKidney® [36] and reported a c-statistic of 0.87, good calibration and similar proportions of identified CKD cases among the 10 % of patients with highest predicted risks. Although each study externally validated QKidney® [36] using UK primary care EHR data, our study extended the validation. Collins et al. [59] and Hippisley-Cox and Coupland [36, 58] adopted the same inclusion criteria as in the original development study [36] (i.e. patients aged between 35 and 74 years), CKD definition (i.e. eGFR measurement <45 mL/min/1.73 m2, kidney transplant, dialysis, nephropathy diagnosis and proteinuria) and stratification by sex. However, the present study included all adults (aged 18 years and over) and used a more robust definition of the outcome.

A previous study compared the models from Chien et al. [51] and Thakkinstian et al. [56] in mixed-ancestry South Africans [17]. The present study found that these models underestimated CKD risk in this population, while in our external validation both models over-predicted CKD risk. A likely explanation is the difference in CKD onset prevalence between the development cohorts, the cohort from the Mogue et al. [17] dataset, and our cohort. Specifically, the study population from Mogue et al. [17] had a much higher prevalence of CKD cases than these development cohorts, while our study population had a lower prevalence.

The included prediction models and simplified scoring systems had remarkably good discriminative ability in our dataset, with better performance than in most of the original studies. This is, on the one hand, surprising because models usually perform similarly or worse in external validation. On the other hand, we used a more robust definition of CKD, requiring impaired eGFR (eGFR < 60 mL/min/1.73 m2) for at least 3 months, rather than the one used in most of the original studies [5155], which looks at CKD measurements in isolation. The latter definition inflates incidence of CKD diagnosis [60] and therefore leads to a poorer signal-to-noise ratio and a decrease in model performance [61], as shown in our sensitivity analysis (Additional file 3: Table S5). Another advantage of our definition, which is based on the international Kidney Disease: Improving Global Outcomes (KDIGO) guidelines [62], is that it is closer to the definition of CKD currently used in UK clinical practice. Along the same lines, we used the MDRD formula to calculate eGFR, which is currently used in UK clinical practice. We also performed a sensitivity analysis to investigate whether using the CKD-EPI formula [48] would have led to different results, which confirmed the findings from Carter et al. [63] that the CKD-EPI formula calculates lower eGFR values than the MDRD formula for older patients.

In the complete case analysis, and in the analysis restricted to patients with established CKD risk factors, there was a marked decrease in discriminative performance. In both cases, further to the decrease in sample size, a plausible explanation is that these analyses increased the differences in case-mix between development and validation datasets. The complete case analysis considers only patients without missing predictors, who are more likely to have had healthcare contacts related to their disease. As in the cohort with established CKD risk factors, this excludes a large group of healthy patients, and thus leads a quite different population than the one for which the models were developed. Based on our findings it seems that a different model is needed for patients with established CKD risk factors. Such a model could use other information that is not routinely available in the majority of the low-risk population, like creatinine levels.

We observed an over-prediction of CKD risk by the majority of models, which can be explained largely by differences in case-mix between our validation cohort and the development populations. First, the incidence of CKD in most development datasets was higher than in our validation cohort. As a consequence, the baseline CKD risks calculated (i.e. model intercepts) in the development datasets were too high for our population. Furthermore, as the mean linear predictor analysis showed, our population appeared to be healthier (i.e. lower mean predictor values) than the populations used in the development studies. We also found, in some models, unexpectedly large coefficients for some covariates. For example, three of the included models [5355] had coefficients for covariates such as anaemia or peripheral vascular disease that were either comparable or larger than more well-established CKD risk factors like diabetes or hypertension. Finally, another possible explanation of the models’ poor calibration is the adoption of a slightly different definition for some predictors in this study, in concordance to the ones used in the NHS, when compared to the original studies.

No calibration problems were found for the models by Bang et al. [54] and QKidney® [36]. However, we left out an important predictor from the model by Bang et al. [54], proteinuria, because it was missing from our dataset. Because the model is well calibrated now, we expect that it would have over-predicted risks if proteinuria had been present. QKidney® [36] was originally developed in the UK primary care (England and Wales), and it was the only model for which the analysed time horizon (5 years) was the same as in the development paper. Therefore, a good calibration was expected. This was confirmed by the fact that we obtained similar mean linear predictors in our dataset to the ones reported in the original development study (Table 6).

Overall, the only model that could conceivably be applied in our population without recalibration was QKidney® [36]. QKidney® consistently outperformed all the other models in terms of both discrimination and calibration, and its performance is comparable to existing validation studies [36, 58, 59]. The model could be used via the web calculator (www.qkidney.org) or directly integrated into EHRs.

From a methodological perspective, there is room for improvement in CKD prediction modelling. First, future studies should consider to use the CKD definition provided by the international KDIGO guidelines [62]. This should also be used to re-estimate the CKD risk prediction models already available. Second, none of the models included in our analysis accounted for death as a competing risk. We recommend that authors of future models use methodologies [64, 65] to do so. Third, authors should take advantage of the new opportunities offered by EHR databases to develop and validate future CKD prediction models [66]. Particularly, besides the possibility of accessing larger sample sizes and to have more predictors, EHRs give the opportunity of observing repeated measurements and account for changes over time of patient’s relevant conditions and biomarkers [66, 67]. This is particularly important in CKD, where comorbidities and biomarkers like creatinine play a key role.

Our study has several limitations. First, we excluded 11 models identified from the two reviews [14, 15] because they included variables not present in our data. However, these models were qualitatively less applicable to our prediction population/context than those included. Second, we removed proteinuria from the models by Bang et al. [54] and Kwon et al. [55] because proteinuria was rarely available for patients in our dataset, and this has likely impaired the estimated performance of these models. Third, we could not reproduce the exact KDIGO definition of CKD because ACR is not routinely collected in UK primary care. Again these limitations are unlikely to influence the implications of our findings for current practice. Finally, we had missing values for ethnicity and considered patients for which there was no ethnicity information recorded as if they were of White British ethnicity. Poor recording of ethnicity is an acknowledged issue in the NHS [68]. However, because of the regional nature of our data, which covers only the city of Salford (England, UK), where white prevalence is higher than 90 % [34], we believe that this did not affect our findings.

Conclusion

To conclude, we have provided an independent, external validation of CKD prediction models with data that will soon be available in most parts of the UK. All included models had good discriminative performance, but most of them were poorly calibrated. Although no model was ideal, QKidney® [36] performed best, and could support a high-risk approach to CKD prevention in primary care. This study underlines the need for ongoing (re)calibration of clinical prediction models in their contexts of use.

Abbreviations

ACR, Albumin-creatinine ratio; AUC, area under receiving operating characteristic curve; CKD, Chronic kidney disease; CKD-EPI, Chronic Kidney Disease Epidemiology Collaboration; CPH, Cox proportional hazards; eGFR, estimated Glomerular Filtration Rate; EHR, electronic health record; KDIGO, Kidney Disease: Improving Global Outcomes; MAPE, mean absolute prediction error; MDRD, Modification of Diet in Renal Disease; PPV, positive predictive value; SD, standard deviation; SIR, Salford Integrated Record