Abstract
The Net Reclassification Improvement (NRI) has become a popular metric for evaluating improvement in disease prediction models through the past years. The concept is relatively straightforward but usage and interpretation has been different across studies. While no thresholds exist for evaluating the degree of improvement, many studies have relied solely on the significance of the NRI estimate. However, recent studies recommend that statistical testing with the NRI should be avoided. We propose using confidence ellipses around the estimated values of event and non-event NRIs which might provide the best measure of variability around the point estimates. Our developments are illustrated using practical examples from EPIC-Potsdam study.
Avoid common mistakes on your manuscript.
Background
Risk prediction models have become a main focus in epidemiological research in the past years. Although a large number of prediction models exists, of which some have already been integrated in treatment strategies or health promotion programs, there is an ongoing effort to improve prediction models by the use of new risk markers. For the evaluation of such model extensions, the Net Reclassification Improvement (NRI) was proposed by Pencina et al. in 2008 as an addition to the evaluation of discrimination, e.g. by comparing receiver operating characteristic curves [1]. The NRI is based on the calculation of the amount of correctly and incorrectly reclassified cases and non-cases comparing classification of individuals into a priori defined risk categories in terms of their predicted risk between two nested models. Since its publication it has been used in a growing number of studies, however, there is a large heterogeneity in its use, presentation, and interpretation [2, 3]. Especially with regard to testing statistical significance of NRI estimates, there remains uncertainty. Pencina [4] discussed that even small NRI values (<0.01) might produce statistically significant p values and Pepe et al. [5] points out that valid methods for inference for the NRI do not exist. In a recent review of NRI measures, Kerr et al. [2] raise concerns about the proposed test statistic and variance formula. This suggests that statistical testing should be avoided for the NRI measure. However, confidence intervals provide precision estimates and are preferable, not only for the overall NRI, but also for its components. The NRI components do not reflect an overall improvement but rather improvement among cases and non-cases separately. Therefore, our aim was to introduce a method to calculate a confidence ellipse around the two components of the NRI which reflects the precision of the estimates and can help interpret the magnitude and variability of the observed effects.
Definition of NRI
Extension of prediction models with additional risk factors usually leads to changes in predicted risk for individual study participants. When predefined risk categories are used, this is reflected by upward and downward movements across these risk categories from the reference to the extended model. This reclassification is used for the calculation of the NRI which considers proportions of upward and downward movements separately for cases and non-cases (1) [1].
The corresponding standard error for the NRI and its components was defined by Pencina et al. [4] and depends on the standard error of cases, which often is a much smaller group:
As such, the NRI is the sum of the single components (NRI _{ cases }, NRI _{ non-cases }) reflecting improvement among cases or improvement among non-cases or both. Thereby, the overall measure does not include evaluation of improvement among cases or non-cases separately. Absolute risks are derived from regression models; either logistic regression or Cox-regression with the disease as the outcome variable.
Confidence ellipse for two components of NRI
Pencina already suggested to report CIs for the NRI and used the bootstrap method for their construction [4]. Calculation of CIs would be informative not only for the overall NRI but also for the single components. Besides the bootstrapping method, CIs can be calculated with a formula related to the construction of CIs for independent proportions according to Agresti [6]; this approach will be applied further on. The standard errors for the overall NRI and its single components were defined before, so that the CIs can be defined as follows:
with \(z_{1 - \frac{\alpha}{ 2}}\) as the \(\left( {1 - \frac{\alpha}{ 2}} \right)\)-quantile of the standard normal distribution. The CIs for NRI _{ cases } and NRI _{ non-cases } can be calculated with the same method. While CIs of the two NRI components, NRI _{ cases } and NRI _{ non-cases }, can be interpreted individually, this again would not allow an easy interpretation in terms of the overall improvement. To overcome this problem, we propose to use a confidence ellipse which allows evaluating the single components NRI _{ cases } and NRI _{ non-cases } in combination.
We introduce the following notation: Let \(\theta = (\theta_{1, } \theta_{2} )\) be the parameter consisting of the NRI components, i.e.
We define the following probabilities
and can write \(\theta\) as a function of these probabilities:
Consequently, the maximum likelihood estimates of \(\theta_{1}\) and \(\theta_{2}\) are given by the relative frequencies \(\hat{p}_{j} = v_{j} /N\) (with \(N = N_{cases} + N_{{non{ - }cases}}\)) as follows:
with
Up | Down | Total | |
---|---|---|---|
Case | \( v_{1} \) | \( v_{2} \) | \( v_{5} \) |
Non-case | \( v_{3} \) | \( v_{4} \) | \( N - v_{5} \) |
Applying the multivariate central limit theorem to the vector of relative frequencies \(\hat{\varvec{p}} = (\hat{p}_{1} ,\hat{p}_{2} , \hat{p}_{3} , \hat{p}_{4} , \hat{p}_{5} )^{T}\) we get, that for a large sample size \(N\) the distribution of \(\sqrt N (\hat{\varvec{p}} - \varvec{p})\) can be approximated by a five-dimensional normal distribution, i.e.,
Here \(A(\varvec{p})\) is the covariance matrix of the limit distribution. It depends on the underlying probabilities \(p_{j}\) and can be computed as:
With the help of the so-called delta method we can derive from (2) the asymptotic variance of \(\hat{\theta } = (\hat{\theta }_{1} ,\hat{\theta }_{2} )\). Here we use, that \(\hat{\theta } = g(\hat{\varvec{p}})\). To derive the asymptotic variance of \(\hat{\theta }\) one has to multiply the matrix of partial derivatives of \(g\) with \(A(\varvec{p})\). This leads to
with \(W(\varvec{p}) = \left( {\begin{array}{*{20}c} {w_{1} } & 0 \\ 0 & {w_{2} } \\ \end{array} } \right)\) and
The asymptotic normality of \(\hat{\theta }\) implies that
with \(\chi_{2}^{2}\), the Chi squared distribution with two degrees of freedom and \(W^{ - 1} \left( {\hat{\varvec{p}}} \right)\) is the inverse of the matrix \(W(\varvec{p})\).
Because of the diagonal structure of \(W(\varvec{p})\) and with asymptotic result from (3) we can define a \(\left( {1 - \alpha } \right)\) confidence ellipse for \(\theta\) as
The determination of the confidence ellipse allows to determine the simultaneous precision of the NRI estimates for cases and non-cases.
Using previous notation and the following relationships \(\hat{p}_{up, cases} = \hat{p}_{1} /\hat{p}_{5}\), \(\hat{p}_{down, cases} = \hat{p}_{2} /\hat{p}_{5}\), \(\hat{p}_{{up, non{ - }cases}} = \hat{p}_{3} /(1 - \hat{p}_{5})\) and \(\hat{p}_{up, cases} = \hat{p}_{4} /(1 - \hat{p}_{5})\), the confidence ellipse can also be defined with the following equation.
Empirical data
Study population
The European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study is a prospective cohort study initially including 27,548 participants aged 35–65 years. Details of recruitment and follow-up procedures were described previously [7, 8]. Briefly, within a median follow-up time of 7 years, 849 participants out of 25,167 participants free of diabetes at baseline developed incident diabetes. On this basis, the German diabetes risk score (GDRS) was developed using Cox-regression [9]. With the GDRS the 5-year risk for developing future type 2 diabetes can be calculated using information on lifestyle and anthropometric factors, diet and physical activity. It serves as the reference model in this underlying model comparison. We used data from 21,846 participants (727 cases) who had also information on family history of diabetes available. The extended model additionally included family history; this model was compared with the reference model. Table 1 shows the reclassification of cases and non-cases due to model extension based on the use of 5 predefined risk categories.
Calculation of Confidence Intervals and Confidence ellipses
Based on the asymptotic method we determined 95 % CIs for NRI _{ cases } and NRI _{ non-cases } (Fig. 1). Taking into account the large number of non-cases it is obvious that estimation of NRI _{ non-cases } was much more precise than of NRI _{ cases }. The calculation of CIs for single components does not allow evaluating both components in combination.
Therefore, we computed a confidence ellipse for NRI _{ cases } and NRI _{ non-cases } to reflect precision of their estimates in combination and which also allows to evaluate the area of acceptable values. Figure 1 shows CIs for the single components (vertical and horizontal lines) as well as the confidence ellipse, both approaches were based on the five risk categories described before. When constructing CIs for the components separately, NRI _{ cases } (0.0619) has a CI of 0.0219–0.1019. Therefore, the value 0.02 lies outside of this interval while the NRI _{ non-cases } (0.0379) had a CI ranging from 0.0318 to 0.0440 thus including a value of 0.035. Using both CIs separately would therefore lead to the conclusion that NRI _{ cases } is significantly higher than 0.02 while NRI _{ non-cases } is not significantly higher than 0.035. However, examining the vector (0.02, 0.035) within the confidence ellipse we can see that it is located inside the area of the ellipse. Thus, the confidence ellipse indicates that—when evaluated together—neither is the NRI _{ cases } different from 0.02 nor is the NRI _{ non-cases } different from 0.035. This example clearly indicates that evaluating single NRI components separately might result in different decisions than evaluating the single components in combination by the use of confidence ellipses.
These results were based on the asymptotic method for both the calculation of CIs and of the confidence ellipse.
Discussion
The use of the NRI is informative for the evaluation of improvements of prediction models when taking into account the obvious limitations associated with the use of categories and cut-offs. Given that no established cut-offs for the NRI exist which allow interpreting its value as being meaningful from a clinical or public health point of view, reliance solely on significance testing has been frequently adopted in reclassification analyses.
As recommended in a recent review of the NRI methods [3], it is preferable to investigate model improvement separately for cases or non-cases. A general framework for testing the two components of the overall NRI, NRI _{ cases } and NRI _{ non-cases }, has previously been laid out by Pencina et al. [1]. However, a major drawback of examining single components in isolation is that the results cannot be interpreted in terms of the overall model improvement. We note that recent recommendations suggest not applying statistical testing at all [2, 3]. Likewise, our developments facilitate the use of confidence intervals. A particularly appealing approach is based on using the confidence ellipse which reflects the 2-dimensional nature of the situation. Our empirical example indicates that confidence ellipses can be useful in reflecting both, the precision of the NRI estimation as well as putting the results in the context of overall improvement.
Our proposed method of confidence ellipses is also flexible here as it can be applied to evaluating extensions of prediction models using equal or different weights as well as thresholds of acceptable model improvement for cases and non-cases as already discussed by Greenland [10].
In conclusion, confidence ellipses might be particularly useful in the context of evaluating overall or case- versus non-case-specific model improvement as they allow evaluating varying acceptable values of the NRI components in combination and also reflect the precision of their estimates.
References
Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–72.
Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net reclassification indices for evaluating risk prediction instruments: a critical review. Epidemiology. 2014;25(1):114–21.
Leening MJG, Vedder MM, Witteman JCM, Pencina MJ, Steyerberg EW. Net Reclassification Improvement: computation, interpretation, and controversies a literature review and clinician’s guide. Ann Intern Med. 2014;160(2):122–31.
Pencina MJ, D’Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11–21.
Pepe MS, Kerr KF, Longton G, Wang Z. Testing for improvement in prediction model performance. Stat Med. 2013;32(9):1467–82.
Agresti A. Categorical Data Analysis. Balding DJ, Bloomfield P, Cressie NAC, editors. Hoboken: Wiley; 2002.
Bergmann MM, Bussas U, Boeing H. Follow-up procedures in EPIC-Germany—data quality aspects. European prospective investigation into cancer and nutrition. Ann Nutr Metab. 1999;43(4):225–34.
Boeing H, Korfmann A, Bergmann MM. Recruitment procedures of EPIC-Germany. European Investigation into Cancer and Nutrition. Ann Nutr Metab. 1999;43(4):205–15.
Schulze MB, Hoffmann K, Boeing H, Linseisen J, Rohrmann S, Mohlig M, et al. An accurate risk score based on anthropometric, dietary, and lifestyle factors to predict the development of type 2 diabetes. Diabetes Care. 2007;30(3):510–5.
Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27(2):199–206.
Acknowledgments
This work was supported in part by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center for Diabetes Research (DZD e.V.). The recruitment phase of the EPIC-Potsdam Study was supported by the Federal Ministry of Science, Germany (01 EA 9401) and the European Union (SOC 95201408 05F02). The follow-up of the EPIC-Potsdam Study was supported by German Cancer Aid (70-2488-Ha I) and the European Community (SOC 98200769 05F02). We thank Dr. Manuela Bergmann who was responsible for the methodological and organisational work of data collections of exposures and outcomes and Wolfgang Fleischhauer for his medical expertise that was employed in case ascertainment and contacts with the physicians and Ellen Kohlsdorf for data management.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
About this article
Cite this article
Mühlenbruch, K., Kuxhaus, O., Pencina, M.J. et al. A confidence ellipse for the Net Reclassification Improvement. Eur J Epidemiol 30, 299–304 (2015). https://doi.org/10.1007/s10654-015-0001-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10654-015-0001-1