Background

Diabetes mellitus, type 2 diabetes in particular, is a growing epidemic worldwide with developing countries currently paying the highest toll [1]. In 2013 there were approximately 382 million individuals with type 2 diabetes, and this number will surge to approximately 592 million by 2035 [1]. This rapid rise of diabetes will result in an even greater and more profound burden which developing countries are not equipped to handle. Type 2 diabetes in developing countries is further characterized by a low detection rate with a high proportion of people being undiagnosed. Strategies are therefore needed for early detection and risk stratification such that treatment measures can be implemented to prevent the onset or delay the progression of related complications.

The use of multivariable risk prediction models has been advocated as practical and potentially affordable approaches for improving the detection of undiagnosed diabetes. Accordingly, guidelines, including those of the International Diabetes Federation, increasingly promote the use of reliable, simple and practical risk scoring systems or questionnaires and derivatives for diabetes risk screening around the world [2, 3]. During the last two decades, numerous diabetes prediction models have been developed. However, only a few models have been externally validated, and generally not in developing countries [4, 5]. Consequently, many developing countries have to rely on prediction models developed in other populations and not necessarily validated in their context. However, issues relating to differences in case-mix across populations, inherent to the development of models, can severely affect the applicability of a model in different settings [6, 7].

This study aimed to validate and compare the performance of selected common models for predicting prevalent undiagnosed diabetes based upon non-invasively measured predictors, in mixed ancestry South Africans.

Methods

Study population and design of study

The Cape Town Bellville-South study data served as the basis for models validation [8]. Bellville-South is located within the Northern suburbs of Cape Town, South Africa and is a traditionally a mixed-ancestry township formed in the late 1950s. According to the 2001 population census, its population stands at approximately 26,758 with 80.48 % (21,536) consisting of the mixed ancestry individuals [22]. The study was approved by the Ethics Committee of the Cape Peninsula University of Technology (CPUT/HW-REC 2008/002 and CPUT/HW-REC 2010) and Stellenbosch University (N09/05/146).

The Bellville South Study was a cross-sectional study conducted from mid-January 2008 to March 2009 (cohort 1), and from January 2011 to November 2011 (cohort 2). The target population for this study were subjects ≥ 35 y. Using a map of Bellville South obtained from the Bellville municipality, random sampling was approached as follows: first, the area was divided into six strata; second, within each strata the streets were classified as short (≤22 houses), medium (23–40 houses) and long (≥40 houses) streets based on the number of houses. Two of each respective streets were randomly selected from each strata. In those instances where the numbers of houses were too few, a short or a medium street was randomly selected and added to such a stratum. The result was a total of 16 short streets representing approximately 190 houses, 15 medium (approximately 410 houses) and 12 long streets (approximately 400 houses). From the selected streets, all household members meeting the selection criteria were invited to participate in the study. One thousand subjects who met the criteria were approached and 642 participated in the study. In addition, community authorities requested that willing participants outside the random selection area should benefit from the study. Therefore volunteers (304 in 2008–2009 [cohort 1), and 308 in 2011 [cohort 2]) from the same community, but who were not part of the randomly selected streets or did not meet the age criteria, were also included.

Recruitment strategy

Information regarding the project was disseminated to residents through the local radio station, community newspaper, brochures and fliers; the latter bearing information about the project and distributed through school children and taxis by the recruitment team. Additionally, a ‘road show’ strategy that involved a celebrity suffering from diabetes from the same community was also used, especially in the targeted streets. Recruited subjects were visited by the recruitment team the evening before participation and reminded of all the survey instructions. These included overnight fasting, abstinence from drinking alcohol or consumption of any fluids in the morning of participation. Since the participants were required to bring in an early morning mid-stream urine sample, they were provided with a sterile container as well as instructions on how to collect the sample. Furthermore, participants were encouraged to bring along their medical/clinic cards and/or medication they were currently using.

Identification of prediction models

Existing prediction models were obtained from a systematic review by Brown et al. [9]. The search strategy from Brown’s paper was re-run in PubMed for the time-period up to April 2014, to identify possible new models. The following string search was used, as per Brown et al.: ((“type 2 diabetes” OR “hyperglycaemia” OR “hyperglycemia”) AND (“risk scores)).” Selected models were only those developed to predict the presence of undiagnosed diabetes. We focused on models developed using non-invasively measured predictors which were available in the Bellville-South cohort database. Models were excluded if they were developed for male and female individual separately.

Outcome and predictors’ definition and measurements

The main outcome was newly diagnosed type 2 diabetes from the standard oral glucose tolerance test (OGTT), applying the World Health Organisation (WHO) criteria (i.e. fasting plasma glucose ≥ 7.0 mmol/L and/or 2 h plasma glucose ≥ 11.1 mmol/L) [10]. At the baseline evaluation conducted between 2008 and 2011, participants received a face-to-face interview administered by trained personnel to collect data on personal and family history of diabetes mellitus, cardiovascular disease (CVD) and treatments; habits including smoking, alcohol consumption, physical activity and diet; demographics and education.

Clinical measurements included: height, weight, hip and waist circumferences and blood pressure (BP). BP measurements used a semi-automatic digital blood pressure monitor (Rossmax MJ90, USA) on the right arm, in sitting position, after a 10 min rest. The lowest value from three consecutive measurements 5 min apart was used in the current analysis. Weight to the nearest 0.1 kg was determined on a Sunbeam EB710 digital bathroom scale, with each subject in light clothing, without shoes and socks. Height to the nearest centimetre was measured with a stadiometer, with subjects standing on a flat surface. Body Mass Index (BMI) was calculated as weight per square meter (kg/m2).

Blood samples were collected and processed for a wide range of biochemical markers. Plasma glucose was measured by enzymatic hexokinase method (Cobas 6000, Roche Diagnostics, USA). High density lipoprotein cholesterol (HDL-c) and triglycerides (TG) were estimated by enzymatic colorimetric methods (Cobas 6000, Roche Diagnostics, USA).

Assessment of model performance

The original selected models were validated for the overall data and subsets using the formulas, without any recalibration. The predicted probability of undiagnosed diabetes for each participant was computed using the baseline measured predictors. The performance was expressed in terms of discrimination and calibration. Discrimination describes the ability of the model’s performance in distinguishing those at a high risk of developing diabetes from those at low risk [11]. The discrimination was assessed and compared using concordance (C) statistic and non-parametric methods [12].

Calibration describes the agreement between the probability of the outcome of interest as estimated by the model, and the observed outcome frequencies [13]. It was assessed graphically by plotting the predicted risk against the observed outcome rate. The agreement between the expected (E) and observed (O) rates (E/O) was assessed overall and within pre-specified groups of participants. The 95 % confidence intervals for the expected/observed probabilities (E/O) ratio were calculated assuming a Poisson distribution [14]. We also calculated 1) the Yates slope, which is the difference between mean predicted probability of type 2 diabetes for participants with and without prevalent undiagnosed diabetes, with higher values indicate better performance; and 2) the Brier score, which is the squared difference between predicted probability and actual outcome for each participant with values ranging between 0 for a perfect prediction model and 1 for no match in prediction and outcome [11, 13]. To determine optimal cut-off for maximising the potential effectiveness of a model, the Youden’s J statistic (Youden’s index) was used to determine the best threshold [15], with sensitivity, specificity and percentage of correctly classified individuals determined for each threshold. The main analysis was done for the overall cohort and for subgroups defined by sex, age (<60 vs. ≥60 years) and BMI (<25 kg/m2 vs. ≥25 g/m2).

Sensitivity analysis

To improve performance and eliminate differences in diabetes prevalence between the development population and the test population, models were recalibrated to the test-population-specific prevalence using intercept adjustment [16]. The correction factor calculated is based on the mean predicted risk and the prevalence in the validation set and is the natural logarithm of the odds ratio of the mean observed prevalence and the mean predicted risk [16]. To assess the potential effect on model performance of validation studies from complete case analysis, we also assess the discrimination of model across five datasets after application of multiple data imputation procedures to fill missing data.

Results

Identification of prediction models

Five non-invasive prevalent diabetes prediction models were selected for validation following the screening process; the Cambridge Risk Score [17], Kuwaiti Risk Score [18], Omani Diabetes Risk Score [19], Rotterdam Predictive Model 1 [20] and the simplified Finnish Diabetes Risk Score [21] (Fig. 1). Table 1 summarizes the models’ characteristics. All models included age as a predictor, while a range of other predictors were variably combined in models. These included: sex, BMI, use of antihypertensive medication, family history of diabetes, waist circumference, past or current smoking and the use of corticosteroids. Additional 1: Table S1 comprises of the full equations for each of the models.

Fig. 1
figure 1

Flow diagram of selected studies

Table 1 Overview of the included prevalent diabetes risk prediction models and their performance for the original model and the intercept adjusted model

Participants’ characteristics

A total of 1256 participants were examined in the Bellville South studies, including 173 with a history of diagnosed diabetes who were excluded. A further 346 participants were excluded for missing data on predictors or outcome variable. Therefore the final dataset comprised of 737 participants, of whom 580 (78.70 %) were female. In the Additional file 2: Table S2, we compare the profile of participants in the final sample vs. that of participants excluded for missing data. Excluded participants comprised more men (27.2 vs. 21.3 %, p = 0.012), were more likely to display a better lifestyle profile for alcohol intake (18.8 % vs. 28.1 %, p <0.001), smoking (31.8 % vs. 43.8 %, p < 0.001), lower family history of diabetes (all p ≤0.001), higher systolic blood pressure (126 vs. 123 mmHg, p = 0.009) and lower triglycerides (1.4 vs. 1.5 mmol/l, p = 0.043); although absolute differences were mostly clinically trivial.

The baseline profile for men and women included in the study is described in Table 2. The mean baseline age was 51.2 years overall, and 53.5 and 52.1 years, respectively in men and women (p = 0.311). The BMI (p < 0.001) waist circumference (p = 0.024) and fasting blood glucose (p = 0.036) were significantly higher in women, while smoking (p <0.001) and alcohol consumption (p <0.001) were frequent among men.

Table 2 Characteristics comparison of participants with valid data between male and female

Prediction of prevalent undiagnosed diabetes in the overall sample

A total of 130 participants (17.6 %) had prevalent undiagnosed diabetes. This prevalence was similar in men vs. women (14 % vs. 18.6 %, p = 0.220) (Table 2). Table 1 and Additional file 1: Table S1 shows the discrimination for the selected prediction models in their original form in the overall sample. Discrimination was modest-to-acceptable and similar between models, with C-statistics (95 % CI) ranging from 0.64 (0.59–0.69) for the Rotterdam model to 0.68 (0.63–0.73) for the Kuwaiti model (all p > 0.05 for c-statistics comparison; Additional file 3: Table S3). At the total population level, the absolute risk of prevalent diabetes was acceptably estimated by the Omani model, overestimated by 81 % (9–152 %) by the Cambridge model, underestimated by 74 % (61–87 %) by the Finnish model and marginally underestimated by the Kuwaiti and Rotterdam models (Table 1). The calibration curves are shown in Fig. 2 and supplemental Fig. 2. There was a systematic risk underestimation across the continuum of predicted probability by the Finnish and Rotterdam models, a selective upper strata risk overestimation by the Cambridge and Omani models, and a combination of both lower strata risk underestimation and upper strata risk overestimation by the Kuwaiti model. Comparison of the C-statistics from the development study and the models’ performance in this population shows a drop in performance of all the models. Other performance measures are shown in Table 1.

Fig. 2
figure 2

Calibration curves in the overall cohort for the models before (upper panel) and after the intercept adjustment (lower panel). A Cambridge Risk Score, B Kuwaiti Risk Score, C Omani Diabetes Risk Score, and D Simplified Finnish Diabetes Risk Score and E Rotterdam Predictive Model 1. Calibration describes the agreement between the probability of undiagnosed diabetes as estimated by the model and the recorded frequencies of the outcome. The ideal calibration is graphically represented by the dotted diagonal line at 45°. Participants are grouped into percentiles across increasing predicted risk. The vertical lines at the bottom of the graph depict the frequency distribution of the calibrated probabilities of diabetes. E/O, expected/observed ratio

Prediction of prevalent undiagnosed diabetes in subgroups

The performance of the original models across subgroups was parallel to that in the overall dataset (Table 3). When comparing patterns of predictions across complementary subgroups, only stand-alone differences were seen in performance for a subgroup, which was not carried through all performance measures. Estimates of C-statistics were broadly similar across complementary subgroups, except for the Omani and Finnish models across BMI subgroups, whereby lower estimates were always found in the overweight/obese subgroup. The pattern of the overall calibration (E/O) across complementary subgroups varied substantially across models. For instance, across gender subgroups, the overall diabetes risk was acceptably and equally predicted by the Omani model, equally underestimated by the Kuwaiti and Finnish models, equally overestimated by the Cambridge model, but acceptably estimated in men and underestimated in women by the Rotterdam model (Table 3). Other performance measures across subgroups are shown in Table 3.

Table 3 Discrimination and calibration statistics for diabetes risk model performance in subgroups of participants by gender, age and body mass index (BMI)

Performance of the intercept adjusted models

As expected, intercept adjustment yielded acceptable agreement between predicted and observed prevalent diabetes rates at the total population level. A perfect agreement was also observed across the continuum of the predicted probability by the updated Rotterdam model. However, despite some attenuation, selective upper strata risk overestimations were apparent for other models.

Model performance at the optimal threshold

The performances of models at the optimal thresholds are shown in Table 1. As expected, the optimal threshold probability for our sample varied across models and for the same model between the original and intercept adjusted versions. The sensitivity at the optimal threshold ranged from 61 % for the Kuwaiti model to 85 % with the Omani model, the specificity from 42 % (Omani model) to 65 % (Rotterdam model), and the proportion of participants correctly classified from 50 % (Omani model) to 64 % (Rotterdam model).

Model performance after multiple imputation of missing data

The discrimination (c-statistic) of models across five datasets obtained after multiple imputation of missing data was very similar: 0.69 (0.64–0.73) for the Cambridge model, 0.69 (0.65–0.74) for the Kuwaiti model, 0.65 (0.61–0.69) for the Omani model, 0.65 (0.60–0.69) for the Rotterdam model and 0.66 (0.62–0.70) for the Finnish model. The values were also very similar to those from the validation of models on dataset comprising only participants with complete data (Table 1).

Discussion

To our knowledge, this is the largest and most comprehensive validation study of prevalent diabetes prediction models in a sub-Saharan African population. In the Bellville South cohort, the selected existing prediction models based upon non-invasive measured predictors had modest-to-acceptable discriminatory ability to predict prevalent undiagnosed diabetes, both overall and within subgroups. Simple intercept adjustment had mixed effect on the calibration performance of the models, while none of the models was significantly better than other models to be uniquely recommended for use in this setting. At the optimal probability thresholds, the best performing model would correctly classify only about 2/3rd of the population, indicating the existing scope for further improving the models’ performance in this setting.

The need for diabetes screening programs is imperative in the reduction of the worldwide burden of complications from diabetes in undiagnosed individuals. In view of the large and continuously growing burden of diabetes the Centre for Disease Control strongly advocates for diabetes screening programs. In its most recent guidelines for type 2 diabetes screening and diagnosis, the International Diabetes Federation has recommended that each health service should decide on programs to detect undiagnosed diabetes based on the prevalence and the resources available in that region [3]. In areas with limited care, such as developing countries, the detection programs are suggested to be opportunistic and should be limited to high-risk individuals. The World Health Organization African region promotes the screening of at-risk individuals in Africa in healthcare settings and social gatherings [22]. Risk assessment scores are feasible and cost-effective and can be considered, but applicability must be certain, with the required tests available in the area and the validation of that risk score in the population.

With the exception of the Kuwaiti model [18], all other models assessed in our study have been validated externally. The most validated appeared to be the Cambridge model [17], with c-statistics ranging from 0.67 to 0.83 across validation studies [23-27]. With a c-statistic of 0.67 in the Bellville South data set, the Cambridge model performance in this population fell to the bottom end of other validation study results. Similarly, the Finnish model’s discrimination performance (c-statistic: 0.67) also compared with lower c-statistic’s from validation studies [23, 27, 28]. The Rotterdam model mirrored the validation study results (0.64 vs. 0.63–0.65) [23, 27, 29], while the Omani model underperformed (c-statistic: 0.66) when compared to the only validation study the authors are aware of (c-statistic: 0.72) [28].

Through an attempt to improve calibration with simple intercept adjustment, the E/O ratios for all models were improved. Despite the expected decision that no model was ready for immediate implementation, the Rotterdam Predictive Model 1 showed the best improvement in calibration following this adjustment. A review by Brown et al. in 2012 [9] of 17 undiagnosed Type 2 diabetes risk scores, which included all five models discussed here, determined that performance was not associated to the number of predictors in the model. Overall, validation studies showed a drop in model performance when tested in a new population, with the Rotterdam model having the lowest validation performance range, when compared to the other models. This was echoed in our results for the original Rotterdam model validation. The possible reasons to explain the drop in the performance of diabetes prediction models in new population, some of which apply to our study, have been extensively discussed elsewhere [30].

At the optimal probability threshold, the models tested in our study would at best correctly detect two-thirds of participants, with diagnostic performance mostly similar to those from published studies [25, 30]. This indicates the existing scope for improving the performance of diabetes prediction models in our setting. This could be done by adopting or developing models enriched with predictors to improve the predictive accuracy. Such an approach however, has to be balanced against the fact that the number of predictors and the complexity and cost of their measurements are severe limitations for their uptake in routine practice [30]. What is probably needed the most in resources limited settings like Africa is evidence to confirm that the introduction of diabetes prediction models in routine practice will improve early detection of diabetes by healthcare practitioners, and the outcome of those diagnosed with diabetes in the long run.

The results of this study were strengthened by the diagnosis of diabetes based on OGTT, thus limiting the risk of misclassification. The age distribution was wide, including a vast majority of the high-risk population. A potential limitation of the study was the exclusion of some risk scores due to the necessary information being unavailable. The fewer number of males in the final dataset could have played a role in the performance of the models, owing to the significant difference between the genders in BMI, a predictor in four out of the five models. No power estimation was done, in the absence of consensus methods for sample size estimation in model validation studies. However, studies have suggested that at least 100 events and 100 non-events were the minimum required samples for external validation studies [31]. These requirements were largely met in our main analysis. Our study participants comprised a subset of randomly selected individuals and subset of self-selected participants from the same community. In the absence of any influence on participants’ selection of a prior knowledge of the association between relevant study outcomes and predictors included in tested model, any differential effect of the sample selection strategy on the discriminatory performance of tested models, is very unlikely. The prevalence of screen-detected diabetes in our randomly selected participants alones has been estimated to be 18.1 % [32], which is very close to the 17.6 % found in combined sample, suggested the absence of a differential effect on the calibration performance of models. The total number of participants with screen-detected diabetes in the combined sample precluded reliable stratified analyses to investigate and confirm the assumptions above. Finally, a substantial number of participants were excluded from the main analyses due to missing data on predictors included in models or on the status for prevalent undiagnosed diabetes. However, participants with complete data were mostly similar to those with missing data, particularly regarding the distribution of key predictors included in models such as age, gender and measures of adiposity. Therefore, differential effect on the model performance of validation based on complete case analyses, is very unlikely. Indeed, in sensitivity analysis, the discriminatory performance of models was very similar across multiple imputed datasets, and not appreciable different from the performance based on complete case analysis. Furthermore, variables with high frequency of missingness were likely to be those that are very difficult to accurately measure in routine setting like family history of diabetes, and therefore, less indicated for uncritical inclusion in models for predicting diabetes across settings [33, 34].

Conclusions

Our findings highlight the performance variation of models differs across different populations, particularly calibration. This low performance can be explained by the obvious lack of transportability due to the differences in development and validation population characteristics and the affect case-mix difference has on model performance. With no model development in the mixed ancestry population of South Africa, selection of generalizable models for validation was limited. There is a great clinical need for a unique, robust and convenient tool for identifying undiagnosed diabetes and predicating future diabetes quicker and more economically in this South African population. Through efficient application of prediction models’ improvement procedures, the final model would improve risk assessment specific to this community. With no acceptable validated model, unique model development is possibly the best way forward.