FormalPara Key Points

External validation lacks context, which inhibits understanding of model performance.

Iterative pairwise external validation provides contextualised model performance across databases and across model complexity.

1 Introduction

External validation has been identified as an essential aspect of the development of clinical prediction models and a key part of the evidence-gathering process needed to create impactful models that are adopted in the clinic [1]. Currently, the majority of prediction models are not externally validated; where they are, they are poorly reported [2].

A major issue preventing the external validation of models is the lack of interoperability of healthcare databases [3]. There are two main problems to solve: databases use different coding systems (e.g. International Classification of Diseases, Tenth Revision and Systemized Nomenclature of Medicine—Clinical Terms) and have different structures [4]. A solution to this is to (1) convert each database into a common format to improve syntactic interoperability and (2) standardize to common vocabularies to improve semantic interoperability.

Standardization of the format and vocabulary of these databases allows for the development of standardized tools and a framework for conducting prediction research [5, 6]. Using these standard tools and conducting research according to open science principles [7] removes many difficulties associated with externally validating prediction models. Some challenges remain, including the interpretation of results in the context of the new database. Furthermore, important privacy concerns often need to be respected in the development process [8]. For example, many data owners are unable to share patient-level data, so any development process must be able to cater for this [9].

1.1 Performance Contextualization

Traditionally, a prediction model is trained on one database using predictors selected by domain experts, and this model is then validated on other databases [10, 11]. These models often consist of a limited number of predictors [12]. Recently, data-driven approaches have been used to leverage all the information in electronic health records (EHRs), which can result in models with many predictors. The question is, how do we decide whether the model works well in other databases? For this, the standard approach is to compare the discriminative performance and model calibration with the performance obtained on the training data [13,14,15]. Any performance drop could be because the model was too tuned to the training data to properly transport to unseen data, i.e. the model was overfit or needs recalibration. However, the performance achieved could also be similar to that of a model that is trained on that same database. In other words, the model performs as well as possible in the context of the available data in that database. We need a model development approach that provides this context. Furthermore, simpler models are preferred as they are easier to implement, and—as such—understanding the performance gain compared with the baseline of using only age and sex is valuable to contextualize the performance of the more complex model [16, 17].

In this article, we introduce iterative pairwise external validation (IPEV), a framework to better contextualise the performance of prediction models, and demonstrate its value when developing and validating a prediction model in a network of databases. The use case for this model is the prediction of the 1-year risk of heart failure (HF) following the initialisation of a secondary drug to treat type 2 diabetes mellitus (T2DM). As described in detail in a literature review [18], the pathophysiological connection between diseases and their frequent adverse interactions should affect treatment choice [19]. The 2019 American Diabetes Association guidelines [20] recommend that patient treatments should be stratified according to an established or high risk of HF. Specifically, the guidelines state that thiazolidinediones should be avoided in patients with HF and that sodium–glucose co-transporter-2 inhibitors are preferred in patients at high risk of HF. The guidelines appear to be trending towards a more personalized treatment strategy [21, 22]. As such, there is an opportunity to use risk prediction to further personalize treatment in the intermediate steps before treatment with insulin. This use case presents the opportunity to both evaluate IPEV and simultaneously create a potentially clinically impactful model.

2 Methods

2.1 Analysis Methods

2.1.1 Iterative Pairwise External Validation

IPEV is a new model development and validation procedure that involves a two-step procedure. The first step is to create two models per database: a model with only age and sex as covariates, which serves as a baseline for what a simple model can achieve, and a more complex data-driven model that assesses the maximum achievable performance. The second step is then validating these models, both internally and externally, in the other databases. A diagram of this process can be seen in Fig. 1.

Fig. 1
figure 1

Rotation of databases for model development and external validation in the iterative pairwise external validation method

2.1.2 Candidate Covariates

Two sets of covariates are used to develop models. One set consists of only age and sex and is used to create a baseline model. The other set is used to build a more complex data-driven model and consists of age, sex, and binary variables indicating the presence or absence of comorbidity (based on presence of disease codes) any time prior to index and of procedures and drugs that occurred in the year prior to index date. The binary variables constructed are for any condition, procedure, or drug in the patient’s history. For example, if a diagnosis of liver failure is recorded in a patient’s medical records prior to the index date, then we create a candidate binary variable named ‘liver failure any time prior’ that has a value of 1 for patients with a record of liver failure in their history and 0 otherwise.

The use of these two sets of covariates shows the achievable performance for a simple set of covariates that can then be used to assess any added value of a more complex model. This gives a context to the performance gains relative to the increased model complexity.

2.1.3 Evaluation Analysis

For performance analysis, we consider the area under the receiver operating characteristic curve (AUC) as a measure of discrimination. An AUC of 0.5 corresponds to a model randomly assigning risk, and an AUC of 1 corresponds to a model that can perfectly rank patients in terms of risk (assigns higher risk to patients who will develop the outcome compared with those who will not). For calibration assessment, we use calibration graphs and visually assess whether the calibration is deemed sufficient.

2.2 Proof of Concept

Predicting the 1-year risk of developing HF following initiation of a second pharmaceutical treatment for T2DM was selected as a proof of concept. This case study could help inform treatment decisions by comparing an individual patient’s risk of HF with the known safety profiles of the different medications.

2.2.1 Data Sources

The analyses were performed across a network of five observational healthcare databases. All databases contained either claims or EHR data from the USA and were transformed into the Observational Medical Outcomes Partnership common data model (OMOP CDM), version 5 [23].

Table 1 describes the databases included in this study. The complete specification for the OMOP CDM, version 5, is available at https://ohdsi.github.io/CommonDataModel/cdm531.html.

Table 1 Database characteristics

2.2.2 Cohort Definitions

2.2.2.1 Target Cohort

The target population consisted of patients with T2DM who were treated with metformin and who became new adult users of one of sulfonylureas, thiazolidinediones, dipeptidyl peptidase-4 inhibitors, glucagon-like peptide-1 receptor agonists, or sodium-glucose co-transporter-2 inhibitors. The index date was the first prescription of one of these secondary treatments. We required all subjects to have a T2DM diagnosis based on the presence of a disease code and use of metformin prior to the index date. Patients with HF or patients treated with insulin on or prior to the index date were excluded from the analysis. Patients were required to have been enrolled for at least 365 days before cohort entry.

2.2.3 Outcome Definitions

The outcome was defined using the presence of a diagnosis code of HF occurring for the first time in the patient’s history between 1 and 365 days post index.

The cohort definition is available at https://github.com/ohdsi-studies/PredictingHFinT2DM/tree/main/validation/inst/cohorts.

The study period contained data from 2000 to 2018. The exact period varies between the databases and is available in Table 1.

2.2.4 Covariates

In total, we derived around 39,000 candidate covariates. These included more than 26,000 conditions, 13,000 procedures and drugs, and demographic information.

2.2.5 Statistical Analysis

Model development followed the framework for the creation and validation of patient-level prediction models presented in Reps et al. [5]. We used a ‘train–test split’ method to perform internal validation. In each target population cohort, a random sample of 75% of the patients (‘training sample’) was used to develop the prediction model, and the remaining 25% of the patients (‘test sample’) was used to internally validate the prediction model developed.

We used regularized logistic regression risk models, also known as least absolute shrinkage and selection operator. Regularisation is a process to limit overfitting in model development. This process works by assigning a ‘cost’ to the inclusion of a variable, and the variable must contribute more to the model performance than this cost in order to be included. If this condition is not met, the coefficient of the covariate becomes 0, which therefore eliminates the covariate from the model, providing an in-built feature selection [24].

2.2.6 Open-Source Software

We used the PatientLevelPrediction R-package (version 4.0.1) and R (v4.0.2) to perform all analyses. All development analysis code and cohort definitions are available at https://github.com/ohdsi-studies/PredictingHFinT2DM. The validation package is available at https://github.com/ohdsi-studies/PredictingHFinT2DM/tree/main/validation.

3 Results

Across all databases, we selected 403,187 patients with T2DM initiating second-line treatment. Of these, 12,173 developed HF during the 1-year follow-up. Next, we performed patient-level prediction of HF. The number of patients and the AUCs are given in Table 2.

Table 2 Number of patients and internal validation performance per database

The AUC results, as shown in Fig. 2, show reasonable performance. The main diagonal of the heatmaps indicates the internal validation. All other results are from external validation. The mean AUCs across internal and external validation were 0.78 (Commercial Claims and Encounters [CCAE]), 0.76 (IBM MarketScan® multi-state Medicaid database), 0.76 (IBM MarketScan® Medicare supplemental database [MDCR]), 0.78 (Optum Clinformatics), and 0.78 (Optum EHR). The best performing models in terms of discrimination were developed in CCAE, Optum Clinformatics, and Optum EHR and appeared to be the most consistent across the external validations. When comparing the baseline model, consisting of only age and sex, with the full model, the performances dropped. For example, for CCAE, the data-driven model achieved 0.78 compared with the baseline model of 0.64 for CCAE and 0.80 (data driven) and 0.69 (baseline) for Optum Clinformatics.

Fig. 2
figure 2

A heatmap of the area under the concentration–time curve values across internal validation (values on the lead diagonal) and external validations of the developed prediction models. The colour scale runs from red (low discriminative ability) to green (high discriminative ability). The upper section details the performances for the data-driven model. The lower half details the same but then for the age and sex model. AUC area under the concentration–time curve, ccae Commercial Claims and Encounters, EHR electronic health records, mdcd Medicaid, mdcr Medicare

Of note, models externally validated in the MDCR dataset consistently outperformed the model that was developed there. This occurred for the data-driven model (internal 0.73), with the external validation of CCAE, Optum Clinformatics, and Optum EHR achieving 0.75, 0.76, and 0.74, respectively.

We assessed the calibration of the three models with the best discrimination (CCAE, Optum Clinformatics, and Optum EHR). The calibration results from these three models across the external validations are shown in Fig. 3. The models generally appear to be well calibrated.

Fig. 3
figure 3

Internal and external calibration of the Optum EHR, Optum Clinformatics, and CCAE trained models. CCAE Commercial Claims and Encounters, EHR electronic health records, mdcd Medicaid, mdcr Medicare

Concerning the best model produced, the CCAE and Optum Clinformatics developed models had the best discrimination performance. The CCAE model contained 195 covariates, compared with 413 for Optum Clinformatics, so it is preferred. The names and coefficients of the covariates in the CCAE model are available in Appendix 1 in the electronic supplementary material (ESM).

For the CCAE-developed model, demographic plots are provided in the ESM. These plots show the calibration of the model stratified by sex across age groups.

All results are available in a study application located at https://data.ohdsi.org/PredictingHFinT2DM/.

4 Discussion

This study demonstrates the use of IPEV for model development and external validation. External validation of a prediction model has traditionally lacked any contextual information on what the expected performance in the database should be. By including a baseline and data-driven model developed in each database, context can be added to the performance of a model externally validated in this database.

Recent improvements in database interoperability and standardisation of tools made it possible to utilise IPEV to develop and contextually validate models for predicting HF in T2DM. This contextual validation provides a more rigorous approach to model assessment. For example, where a model’s performance drops from training to external validation but achieves performance consistent with expectations in the external validation database, this then raises the question of what the difference is between the two databases. Similarly, if a model achieves a lower performance than expected in a new database, this can be interpreted as overfitting to training data.

The inclusion of a baseline model (using only age and sex covariates) in each training step provides context to the performance gain from increasing model complexity. By comparing the more complex model with this baseline model, a better assessment of complexity–performance trade-off can be made to analyse the potential for clinical implementation. If a large disparity in performance between these two models is observed, a parsimonious model (of around ten variables) could be created to attempt to bridge the gap between the performance of the complex model and the ease of implementation of the baseline model. The interpretation of the results is aided by the inclusion of a heatmap. This allows for easy visual inspection of performance across external validations. Once differences in performance across external validation have been demonstrated, it would be interesting to investigate the case mix of the cohorts in the database as well as the prevalence of the predictors to better understand these performance differences [25].

Considering the specific use case, the performance of the CCAE model developed in this paper suggests it could be used in treatment planning. This model had good discriminative performance that was consistent across external validations (AUC internal 0.78, external 0.75–0.79). There was a minor loss in discrimination for some of the external validations, for example MDCR had the lowest AUC (0.75). This lower performance was in line with the internal validation of the database, and MDCR had the worst performance across all the external validations, suggesting it is a more problematic dataset in which to make predictions. Possible explanations for this are that the underlying case mix of patients could make discrimination more difficult. For example, patients in this database are generally older, so it could become more difficult to separate them; there is also little to no overlap in the ages of patients between CCAE and MDCR. Another reason could be that the lower numbers of patients might mean data are insufficient to provide a reliable estimate or to develop the optimal model. Specifics of performance in different demographics are available in the Shiny R package. The model showed reasonable calibration across internal and external validations, with some overestimation of risk for patients at higher risk. The Optum EHR external validation showed a larger miscalibration and could benefit from some recalibration before implementation. When we compared the data-driven and baseline models, the performance of the latter across all the validations for all models was only moderate and often produced a drop of 0.1–0.2 AUC, demonstrating that the increase in complexity provided significant performance gains. Age and sex alone were insufficient to accurately predict future HF, and more complex models are needed.

Calibration is important when using a model for clinical decision making, and this result highlights that our model likely requires recalibration when applied to case mixes that differ from the development database.

The model could be implemented at either the treatment facility or the health authority level. Using the previously discussed American Diabetes Association treatment guidelines, the use of a risk model to stratify patients can be impactful, and the evidence generated in this paper suggests that the CCAE-developed model could be a candidate for clinical use. If patients can be assessed on their risk of HF, their treatment can be personalised, helping to prevent medication switching or the addition of new medicines to treat HF when there are diabetes treatments with known beneficial HF effects. To our knowledge, this is the only model available in open source that can be used for this specific prediction problem.

This method is scalable and can be expanded to use more databases as they become available. An example is through the European Health Data and Evidence Network (EHDEN) project, which is currently standardizing 100 databases to the OMOP CDM. This network could be leveraged to provide context to the external validation of prediction models at an unprecedented scale. This would lead to improved models, stronger evidence, and a bigger clinical impact. When considering the case of a federated data network such as EHDEN, IPEV is particularly suitable. As privacy concerns prevent the sharing of patient-level data, a development and validation process that does not require this is necessary. IPEV incorporates ‘privacy by design’, whereby research can be performed by separate researchers at separate locations without sharing patient data. This is a major advantage as it maintains the ability to produce excellent and clinically impactful research without introducing any new privacy or security concerns. This means that the method can be used under the standard procedures of obtaining institutional review board approval, while maintaining data security and improving the quality of research without significantly burdening researchers.

A limitation of this method is that it does not use the full data available for training. There is evidence to suggest that combining data can improve internal validation. However, this requires researchers to share data and violates data privacy concerns. Further, methods such as federated learning are compatible with IPEV. If a researcher is particularly concerned with improving the performance of the developed model, they could combine n − 1 databases and test in the nth, then rotate through development using IPEV, leaving out one database at a time, simultaneously increasing the data available for training and maintaining external validity.

5 Conclusion

Using IPEV lends weight to the model development process. The rotation of development through multiple databases provides context, allowing for thorough analysis of performance. The inclusion of a baseline model in all modelling steps provides further context to the performance gains of increasing model complexity. IPEV provides a huge opportunity in a new era of standardised data and analytics to improve insights into and trust in prediction models on an unprecedented scale.