Individualised variable-interval risk-based screening for sight-threatening diabetic retinopathy: the Liverpool Risk Calculation Engine

Aims/hypothesis Individualised variable-interval risk-based screening offers better targeting and improved cost-effectiveness in screening for diabetic retinopathy. We developed a generalisable risk calculation engine (RCE) to assign personalised intervals linked to local population characteristics, and explored differences in assignment compared with current practice. Methods Data from 5 years of photographic screening and primary care for people with diabetes, screen negative at the first of > 1 episode, were combined in a purpose-built near-real-time warehouse. Covariates were selected from a dataset created using mixed qualitative/quantitative methods. Markov modelling predicted progression to screen-positive (referable diabetic retinopathy) against the local cohort history. Retinopathy grade informed baseline risk and multiple imputation dealt with missing data. Acceptable intervals (6, 12, 24 months) and risk threshold (2.5%) were established with patients and professional end users. Results Data were from 11,806 people with diabetes (46,525 episodes, 388 screen-positive). Covariates with sufficient predictive value were: duration of known disease, HbA1c, age, systolic BP and total cholesterol. Corrected AUC (95% CIs) were: 6 months 0.88 (0.83, 0.93), 12 months 0.90 (0.87, 0.93) and 24 months 0.91 (0.87, 0.94). Sensitivities/specificities for a 2.5% risk were: 6 months 0.61, 0.93, 12 months 0.67, 0.90 and 24 months 0.82, 0.81. Implementing individualised RCE-based intervals would reduce the proportion of people becoming screen-positive before the allocated screening date by > 50% and the number of episodes by 30%. Conclusions/interpretation The Liverpool RCE shows sufficient performance for a local introduction into practice before wider implementation, subject to external validation. This approach offers potential enhancements of screening in improved local applicability, targeting and cost-effectiveness. Electronic supplementary material The online version of this article (doi:10.1007/s00125-017-4386-0) contains peer-reviewed but unedited supplementary material, which is available to authorised users.

Pr{X (t n ) £ x n | X (t), t £ t n-1 }= Pr{X (t n ) £ x n | X (t n-1 )} (1) and more generally, if 12 n t t t    L then: (2) Let P be the transition probability matrix with entries: for i, j=1,…,k. The model is specified in terms of transition intensities (or hazards or risks) [18,19]: We also defined: so we could define a kk  transition intensity matrix Q(t) with entries l ij (t) (see Equation 2 in main manuscript). The simplest model assumes that () ij ij t   is independent of t, which implies transition intensities from an exponential distribution. In this case the process is stationary [16], that is given two time points, s and t, the process will only depend on the difference t-s.
We can consider a simple nonhomogeneous Markov model [17] with a timedependent intensity matrix of the form ( ) = 0 ( , ) where 0 is a fixed transition intensity matrix. In this case, ( , ) defines an operational time such that the process is time-homogeneous Markov. For a given , let = ∫ ( , ) 0 and define ( ) = ( ), then ( ) is a homogeneous Markov process with intensity matrix 0 . In our model we consider Weibull transition intensities with shape parameter , given by ( , ) = −1 . The operational time in this case is = , which is a simple power transformation of the observed times (note that we recover the timehomogeneous model when = 1). Following Kalbfleisch and Lawless [18], we estimate from the data, by maximising the maximised model log-likelihood with respect to it by a simple line-search. Introduction of the parameter in the model increases the overall likelihood, even considering the increased model complexity, so it improves the overall quality of the fit. A relationship between transition intensities and transition probabilities is given by the following relationship and expressed as a matrix [18,19]: which is the solution of the Kolmogorov equation dP(t) = P(t)Qdt .
The equality provides estimates of the state transition probabilities. Note that the exponential operation is to be intended as a matrix operation, which is in general computationally complex. We assumed the entries of the matrix Q to be dependent on b functionally independent parameters, which might represent the baseline transitions of the model, and/or regression parameters relating the instantaneous transition intensities to a set of covariates. If each individual under study has associated a covariates vector of r risk factors z, we assume the following form for the transition intensities of the Markov process: where 0 ij  is the log of the baseline transition intensity (we assume the baseline corresponds to the mean of the covariates). The covariates vector can be timedependent, representing the history of the risk factors in a time interval, or a suitable summary statistic of the history, or fixed at baseline.
If we observe a random sample of n individuals at times 01 , , , Note that the transition probabilities in the likelihood depend (typically nonlinearly) on the parameters through the Kolmogorov equation [18]. Maximisation of the above likelihood function with respect to the parameters provides estimates of the entries of the transition intensity matrix.
It should be noted that, in general, information on an individual's passage through the disease states usually would not be complete, in the sense that we will only know an individual's status at several points in time as illustrated in Figure 1 [19]. The model was fitted using the msm library of the R package, with additional code written by AE to interface the library with the multiple imputation functions and the covariate selection procedures (see next sections).

Imputation in the model fitting
Imputing missing values and then doing an ordinary analysis as if the imputed values were real measurements is usually better than excluding subjects with incomplete data. Problems with case-wise deletion of subjects with missing values include a reduction of the sample size, an increase in the real standard error of parameter estimates, and biased parameter estimates if data is not missing completely at random. Multiple imputation doesn't incur any of the above problems; however, methods for properly accounting for having incomplete data can be complex.
Multiple imputation uses random draws from the conditional distribution of the target variable, given the other variables (e.g., we draw a value of HbA1c, conditional on cholesterol, age at diagnosis, sex etc.). Note that causal chains are not relevant, so we can use variables measured "in the future" (both for the same or a different subject's history). Conditioning on outcomes is also possible in multiple imputation procedures, and it helps reduce the bias in parameter estimates; in our model, we conditioned on both screening times and states. To properly account for variability due to unknown values, the imputation is repeated M times, where M ≥ 3 (for our model we have M =10.) Each repetition results in a "completed" dataset that is analysed using the standard method. Parameter estimates are averaged over these multiple imputations to obtain better estimates than those from single imputation. The variance-covariance matrix of the averaged parameter estimates, adjusted for variability due to imputation, is estimated using the following [21,22]: where i V is the ordinary complete data estimate of the variance-covariance matrix for the model parameters from the ith imputation, and B is the between-imputation sample covariance matrix, the diagonal entries of which are the ordinary sample variances of the M parameter estimates.
We used the aregImpute multiple imputation algorithm, implemented in the aregImpute function, part of the rms library of the R package. In the following we give a necessarily brief description of the algorithm; for further details, see the documentation of the function, and the previously referenced sources. aregImpute takes all aspects of uncertainty into account using the bootstrap to approximate the drawing of predicted values and using different bootstrap samples for each multiple imputation. aregImpute applies weighted predictive mean matching so that no distributional assumptions are required. We used van Buuren's "Type 1" matching [21] to capture the right amount of uncertainty: here one computes predicted values for missing values using a regression fit on the bootstrap sample, and finds donor observations by matching those predictions to predictions from potential donors using the regression fit from the original sample of complete observations. When a predictor of the target variable is missing, it is first imputed from its last imputation when it was a target variable. A donor is defined as a complete observation whose predicted target is closest to the predicted value of the target from all complete observations.

Covariate selection
Estimation and model selection were combined under a unified framework achieved by minimisation of: where min AIC is the minimum of the different AICc values. D i was then interpreted as the information loss experienced if we used fitted model g i rather than the best fitting model. The covariates retained in the model were the ones that achieved a total rescaled AICc=0.

Model checking
Note that due to the multiple imputation process, both bootstrapping and 4-fold cross validation were applied to the "median" dataset (i.e. the data set obtained by replacing all missing covariates with the medians over the ten imputation sets).

Covariate selection and risk model
Covariates were centered at their means, so the baseline transitions refer to a hypothetical subject with mean covariate values: age at diagnosis: 61 years; time since diagnosis of diabetic disease: 6 years; HbA1c: 7.2% (55.5 mmol/mol); total cholesterol: 4.3 mmol/l; systolic blood pressure: 132 mmHg. As described above missing entries were imputed using multiple imputation techniques. All covariates (including those that didn't enter the final model) and the outcomes (screening times and associated states) were used in the imputation process. Missing entries for the covariates used in the model fitting (see Table 1) for which multiple imputation was required were: disease duration 994; HbA1c 6311; systolic BP 4651; total cholesterol, 7615; diastolic BP 4643; eGFR 8271; HDL cholesterol 8303 The two baseline transition intensities to the screen positive state in the rightmost column of Q equation (2)) are shown in ESM Figure 3.  Fig. 1 Example of panel or interval censored data as applied to screening data in ISDR dataset. The process in this example is observed on three occasions, at times 1.5, 3.5 and 5 years, in the states 2, 2 and 1, respectively. No transition is observed exactly (and state 3 is not observed at all).  Fig. 3: Transition intensity baselines to the screen positive state from the nondeferable retinopathy, one eye (top) and from the non-deferable retinopathy, two eyes (bottom) states.