Key messages

  • Selective reporting of association and non-standardisation of cut points make evidence synthesis difficult

  • There was a negative correlation between improved discrimination and the baseline performance of Framingham Risk Score

  • No evidence was found between poor reporting practice and the magnitude of categorical net reclassification index

Background

Prediction models are not perfect [1] and often have methodological weaknesses meaning few are used in everyday practice [2]. Framingham Risk Score (FRS) [3] is one of the exceptions because of its extensive validation [1]. It is however, not perfect with researchers attempting to improve prediction within the intermediate category [4]. Numerous other tests have been investigated hoping to improve upon this base model [5]. This is also in keeping with the rising interest in novel biomarkers in medicine [6], particularly in the field of cardiovascular [7] and cancer [8] risk prediction. In search of a surrogate biomarker that detects subclinical disease, coronary calcium score (CACS) has been investigated for more than a decade [9] with some proposing screening with computed tomography (CT) in the general population [10]. Also the quality of both primary and secondary prognostic studies is generally poor [11] but with a few exceptions [6]. Thoracic calcium score is considered a relative of CACS where the Agatston method [12] is applied to the thoracic aorta [13]. CT coronary angiogram (CTCA) has established itself in the acute chest pain setting and is now being investigated as a tool of reclassifying cardiac risk based on luminal stenosis (and other characteristics) in the CONFIRM cohort [14]. These imaging biomarkers generate substantial interest and may add incremental value to traditional Framingham risk factors. Given the known methodological issues [15,16,17], we looked for differences between adequate and inadequate reporting practice.

Methods

This article is reported in line with Preferred Reporting Items for Systematic Reviews and Meta-Analyses [18]. The protocol for this review was registered on PROSPERO (2015:CRD42015023795) [19]. CLP searched MEDLINE, EMBASE, Web of Science and the Cochrane Central Register of Controlled Trials in July 2015. The bibliographies of all included studies were searched for further potential studies. Attempts were made to retrieve missing information in the included publications by contacting authors and grey literature was searched [20, 21]. Only full text publications were included. No language restrictions were applied. An update search was carried out in June 2016.

Screening Process & Study Selection

Two reviewers (CLP, NP) conducted title and abstract, and then full-text screening, against the inclusion criteria. A third reviewer (CH) was involved if disagreements were not resolved by consensus. We only included studies that examined Agatston or CACS, thoracic aorta calcium score (TACS) or CTCA as new predictors and used any iteration of FRS as a baseline model. We included any cohort study that reported the association of FRS with the defined predictors and cardiac endpoints and/or cardiovascular comorbidities. In addition, these studies had to report one of the following: summary statistics indicating incremental value of the predictor of interest in addition to the old model, such as difference in area under the receiver operating characteristics curve (Δ AUC) [22, 23], category-based net reclassification index (NRI) [24, 25], integrated discrimination improvement (IDI) [26], relative IDI (rIDI) [27] or other reclassification measures. Composite endpoints [28] were considered and studies using surrogate outcomes were excluded [29].

Data extraction

Two authors (CLP, NP) independently extracted data from the included studies, recording the first author, journal, publication year, outcome assessed, population evaluated and their inferences on whether the additional predictor improves prediction beyond the FRS. Publications were classified whenever possible as defined by Framingham/Wilson 1998, Framingham/Adult Treatment Panel III (ATP) 2002 and Framingham/ D’Agnostino 2008 [30, 31]. Original 95% CI, standard error, standard deviation or p-values of summary estimates of interest were extracted [32, 33]. If the Kaplan-Meier survival curve was available, any missing hazard ratio was estimated (by YW) [34]. Specifically, the standard of reporting effect sizes that signalled incremental prognostic value was evaluated. The choice of optimal cut points/thresholds was also examined, particularly in relation to size effects that indicate association [35]. Various methods of quantifying incremental prognostic value of an additional test have been described [36]. We focused on the reporting characteristics of multivariable regression, calibration, discrimination and reclassification [15]. For the documentation of multivariable regression, we determined its adequacy based on the availability of information on whether an additional predictor was significant at p < 0.05 level, or the use of tests that penalised the inclusion of an additional predictor. For discrimination, we assessed the documentation of the baseline AUC of FRS and the Δ AUC as a result of an additional predictor of interest. The adequacy of reporting baseline AUC relied on accurate documentation of the FRS as originally published [15]. In brief, the calculation of FRS could be threatened by addition, deletion or modification of the original FRS items. Other aspects included whether coronary heart disease (CHD) was measured and the measured population was similar to the original FRS population. For reclassification, all publications were searched for NRI calculation or results. The type of NRI was verified. We considered established categories (e.g. < 10%, 10–20%, > 20%) or any justified use of categories as appropriate to relevant data sets. The recommendation of reporting reclassification was taken from [37].

Critical appraisal

We rated study quality using the Quality In Prognosis Studies (QUIPS) tool [38]. Two reviewers (CLP, NHP) conducted the quality assessment on the major aspects and two reviewers (JP, YHW) assessed the statistical aspect of the studies independently.

Data analysis

For studies where the 95% CI or standard error was not reported, a correlation coefficient of 0.3 between FRS and CACS/CTCA was used to allow estimation of the 95% CI for Δ AUC based on data from [39]. Numbers were displayed as exact numbers, median or percentages. The alteration of the risk factors used to calculate FRS was assessed based on previously published items [15] with modifications. The items were scored ordinally as either yes, no or unclear. FRS model of the 1998 iteration was scored against 18 items. FRS model of the 2002 and 2008 iterations were scored against 15 items (3 diabetes related items discounted). The summation of the individual item score indicated the overall level of alteration which was dichotomised into a binary variable: minor and major alterations. The threshold for dichotomisation was based on the median number of items altered among the included studies. The summary of NRIs and AUCs was displayed as medians and interquartile ranges. NRIs and AUCs were subsequently split into two groups depending on the practice of reporting being either adequate or inadequate, or as equivalent binary groups. The aspects of reporting were based on previously published work on AUC [15] and NRI [37] with adaptations. The respective groups were then compared using the Wilcoxon sign rank test at significance level p < 0.05. Specifically, we were looking for any particular practice of reporting AUC or NRI leading to excessive claims of any additional predictor. All statistical analysis was carried out using STATA version 14.0 (StataCorp, College Station, Texas).

Results

Included studies

Eight hundred and one unique hits were screened, leading to 35 studies (Fig. 1) encompassing 206,663 patients (men = 118,114, 55.1%) [4, 5, 9, 13, 14, 39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68]. All publications concluded that at least 1 imaging biomarker indicated either independent association with composite endpoints, improved discrimination or classification beyond traditional risk factors. However, there were reservations about TACS [13, 48] and some argued against the reclassification properties of CACS and CTCA [52, 59].

Fig. 1
figure 1

The Preferred Reporting Items for Systematic Reviews and Meta-analysis Flow Diagram

Quality of included studies

The included studies were usually at predominantly low risk of bias with regard to study participation, measurement of prognostic factors, outcome measure and confounding factors. There was low to moderate risk of bias for statistical analysis because several studies selectively reported results and/ were not clear about the process of model building. The majority of studies were at high risk of attrition bias because there were notable amount of missing data and/ the number of participants’ loss to follow-up was not accounted for. Figure 2 shows the overall bias assessment.

Fig. 2
figure 2

Bias assessment of included studies using the Quality in Prognosis Studies tool

The types and calculation of Framingham risk score

Eleven studies (31.4%) adopted FRS 1998 [5, 39, 44, 46, 48, 52, 53, 59, 62, 66, 67], 13 studies (37.1%) adopted FRS 2002 [13, 40,41,42,43, 47, 49, 50, 54, 58, 63, 65, 68] and 3 studies adopted FRS 2008 [51, 56, 60]. Three studies used both FRS 1998 and 2002 [4, 9, 14]. Two studies did not specify the iteration of FRS used [55, 57]. According to previously published criteria [15], additions, deletions and modifications of risk factors are shown in Table 1. Six studies (17.1%) did not provide any mean estimate or a breakdown of different categories of FRS [41, 51, 55, 58, 67, 69]. The median number of items altered was 3. Using that as the threshold, twelve studies (34.3%) had major alterations [4, 9, 39, 41,42,43,44, 46, 51, 59, 67, 68] and 23 studies (65.7%) had minor alterations [5, 13, 14, 40, 45, 47,48,49,50, 52,53,54,55,56,57,58, 60,61,62,63,64,65,66]. Five of 23 studies that had minor alterations did not have any components of FRS altered [49, 55, 56, 60, 66]. Of those 5 studies, two studies did not provide any information about the components of FRS [56, 60] and were given the benefit of the doubt, however, findings should be interpreted with caution.

Table 1 Alteration of the risk factors used for the calculation of Framingham Risk Score in 35 eligible studies compared to the Framingham Risk Score 1998, 2002 and 2008

Thresholds and reporting of association

Odds ratio, relative risk, c-index and hazard ratio were used to indicate association of the imaging biomarker with outcomes. There was selective reporting of subgroups and p values among the reported subgroups (Table 2). The reference groups of investigations were not consistent. In CTCA, the reference group could either be no disease or non-obstructive disease. In both TACS and CACS, the reference group was not always a score of zero. The Additional file 1 displays the details regarding the thresholds of different types of investigation. The cut points that define respective categories were also variable.

Table 2 Selective reporting of association

Intended population for Framingham risk score

Four studies (11.4%) had an exclusively Caucasian population [5, 13, 52, 53]. Eleven studies (31.4%) had more than 10% non-Caucasian population [41, 48,49,50,51, 54, 57, 58, 61, 66, 68]. Twenty two studies (62.9%) had not recorded ethnicity as a variable [4, 9, 14, 40, 42,43,44,45,46,47, 52, 53, 55, 56, 58,59,60, 62,63,64,65, 67]. Five studies (14.3%) had documented CHD at baseline [45, 57, 63, 65, 67]. Considering all the information, only 4 studies (11.4%) were identified as similar to the original Framingham population [5, 13, 52, 53].

Documentation of regression, discrimination & AUC analysis

Of the 35 studies, the majority appropriately reported multivariable regression (74.3%). Thirty three studies reported AUC estimates for both FRS alone and the FRS with additional CT biomarkers with data on 76 such pairs of data. Appropriate documentation of AUC was not common practice (36.4%). The method used to compare receiver operating characteristics curves was not always described (39.4%). Only eight studies reported calibration (22.9%) [4, 48, 51,52,53,54,55, 68]. Table 3 shows the reporting of regression and discrimination. The AUC of FRS alone ranged from 0.53 to 0.77 (median = 0.68). The Δ AUC ranged from − 0.07 to 0.24 (median = 0.06). There was strong inverse correlation between the Δ AUC and the baseline FRS AUC (Spearman correlation coefficient, − 0.46, p < 0.0001). When the baseline FRS AUC performed well, the Δ AUC was relatively lower with the addition of a CT biomarker (Fig. 3).

Table 3 Documentation of multivariable regression, calibration, discrimination and reclassification
Fig. 3
figure 3

The correlation between difference in AUC and baseline Framingham Risk Score AUC

Table 4 shows the median AUC values and the Δ AUC when the data was classified according to features of design and analysis [15]. The baseline FRS AUC performs better with minor alterations compared with those with major alterations of the Framingham model (p = 0.0006). The improvement in AUC was greater in those with major alterations of the Framingham model (p = 0.015). Other factors that significantly affected the performance of AUC included the exploration of data analysis, reporting of calibration and validation, multivariable and AUC documentation (all p < 0.05). The types of incremental value reported were associated with a difference in AUC performance, but only significant when a threshold of 2 was chosen. In the sample population, measurement of CHD as an outcome or whether the population was similar to the original Framingham cohort did not significantly alter the AUC performance.

Table 4 Median AUC values and ΔAUC according to different aspects of design and analysis

Documentation of reclassification and NRI analysis

Twenty three studies reported NRI estimates and all had at least 2 cut-offs, with those that had 3 cut-offs making up the biggest NRIs [59, 67]. The number of thresholds influenced the value of NRI [70]. The most commonly used type of NRI was categorical NRI (69.6%). The studies that reported calibration were the same as those documented in the last section. Table 3 shows the reporting of reclassification analysis and NRI. Complete reporting of reclassification analysis using reclassification table or text was not common practice (31.4%). When reclassification analysis was done, only half was considered appropriate (56.3%). The actual number of patients being up or down classified to a different risk group was not documented. In conjunction with the documentation of NRI, the proportion of subjects being correctly reclassified was not always available (43.8%). Most studies subjectively drew strong conclusions from the NRI calculated (68.7%). The individual components of events and non-events and also their respective NRI components were not always available (at least 43.8%). Fifteen studies reported categorical NRI with 46 data points. The values of categorical combined NRI ranged from − 0.083 to 0.785 (median = 0.249). None of the aspects of reporting [37] was significantly related to the difference in the values of categorical combined NRI (Table 5).

Table 5 Median NRI values according to different aspects of design and analysis

Discussion

The majority of studies claimed improved discrimination and reclassification of the outlined CT biomarkers over the established Framingham model. For association, hazard ratios were commonly used but the variation in reporting practice hindered evidence synthesis. Although all studies used a similar baseline model for AUC analysis, the performance of FRS varied. There was a clear negative correlation between improved discrimination and baseline performance of FRS. In contrast, despite the poor reporting, there was no difference in the magnitude of categorical NRI between adequate and inadequate reporting practice.

Selective reporting of association is known and remains an issue [16]. We found that non-standardisation of thresholds and different reference groups across studies prohibit future meta-analysis. Here we substantiate this with 3 included studies. Chow et al. used non-obstructive coronary disease as the reference group for 3 different composite outcomes [65]. The same author then used no coronary artery disease as the reference group in the CONFIRM cohort [63]. In [66], we estimated hazard ratios using Kaplan-Meier curve [34]. When non-obstructive coronary disease is the reference group, the estimated association of obstructive disease with composite outcome is smaller (HR = 2.03, 95% CI 1.47–2.79), compared with when no coronary disease/normal is the reference, the estimated association is bigger (HR = 3.24, 95% CI 2.28–4.63) [66].

NRI records the transformation in predicted risk that changes from one category to another category after the introduction of an additional test. However, it is only meaningful when the information about risk thresholds is available. The change in predicted risk could be correct or incorrect. In this population, the concept that the subject could be wrongly reclassified with an additional test was not clearly outlined. The combined NRI could have been driven by predominantly event NRI, leading to overestimation. This could have been clarified by reporting of the components of NRI but this was not standard practice, despite recommendation [17]. Deriving from concern about miscalibration, another recommendation was the regular reporting of calibration [26, 71, 72], with graphical plot being the best assessment [73]. Calibration, however, was regularly overlooked. To counteract the issues with missing data, we have met solutions such as Weibull extrapolation [5, 44, 47] and adjustment of risk cut-offs by the ratio of actual follow up. These strategies translate to a fact that a significant proportion of the included studies used non-standardised risk categories but almost all managed to justify. A more definitive solution would be a move towards decision curve analysis [74]. Only 1 study provided information to allow adjustment using Kaplan-Meier estimates [54]. This is on a background of insufficient reporting on the handling of censoring. This adjustment should receive more attention, especially when censoring happened early on during follow-up [75]. There is currently no consensus on what is a large enough NRI. Overall, considering the uncertainty in NRI [17] and the small values of NRI, one should not draw strong conclusions from the use of NRI alone. Given the popularity of NRI in cardiovascular research [26], a framework of reporting NRI should be followed, for example in [37].

Discrimination as measured by AUC analysis is an established method of measuring incremental value [76]. It was reported in almost all the studies but adequate documentation was not common practice. In our study, reporting of calibration, validation and AUC documentation all influenced the values of AUC. The baseline Framingham model is an established score even considering its various iterations [1]. Big improvements in AUC was seen in cases where the baseline model performed badly. This echoes previous findings [15] but in a more defined population. As eluded to in [15], this phenomenon is similar to when a new drug is only effective when compared to an ineffective comparator drug [77]. In addition to the above, inadequate reporting practices were associated with inflated estimates.

Our investigation on AUC [15] and NRI [17] were not empirical and can only serve as an update in a different population. The assessment on thresholds was minimal compared with previous investigation [17]. The harm of imaging using CT (radiation burden) was not explored. The focus was solely on the potential benefit. Studies that indicated association or only had a reclassification table could be excluded because we focused on studies that had at least 1 summary estimate that indicated incremental value. We were unable to assess publication bias where articles showed no or worsening prediction if these studies remain unpublished. The calculation of model coefficients can significantly impact the baseline AUC. There are different ways of calculating the model coefficients (re-estimation for the new population, using published coefficients or a point based model) but these were not explored. The difference between using CHD or CVD as an outcome was investigated, however, the justification and transparency of reporting outcomes was not examined.

Association on its own is insufficient to substantiate incremental value [78] and large values are infrequent in biomarker research [79]. AUC analysis is seen as a good starting point and reclassification should follow rather than replace AUC analysis [76]. However, AUC analysis is not without fault [79]. Transparent reporting of NRI should be compulsory, for example use of a reclassification table (38), and readers should be aware of the controversies surrounding NRI [80]. The co-existence of a lack of increase in AUC and a positive NRI should alarm readers [81]. In general, the reporting in prognosis studies needs to be more robust [82,83,84,85]. Data should be made available for individual patient data meta-analysis.

Conclusion

Inconsistent thresholds, reference groups and selective reporting prohibit future evidence synthesis of associations. Inadequate documentation of discrimination, calibration and validation are widespread. The variable baseline performance and other aspects of reporting discrimination inflate potential incremental values. Reporting of reclassification is also insufficient but significant differences between adequate and inadequate reporting practice have not been identified.