Background

Heart failure (HF) is a major public health problem with high morbidity and mortality [1, 2]. Over half of the HF patients have heart failure with preserved ejection fraction (HFpEF), which is featured by elevated left ventricular filling pressure during exercise despite a normal ejection fraction [3]. As HFpEF is a syndrome with different underlying pathophysiologic mechanisms, detection is difficult with current guidelines [3, 4].

The use of natriuretic peptides (NPs), specifically brain natriuretic peptide (BNP) and N-terminal prohormone of brain natriuretic peptide (NT-proBNP), is advised in HFpEF guidelines [3, 5, 6]. Detection of HFpEF in the stable outpatient population is difficult since levels of NPs are usually low as opposed to patients with heart failure with reduced ejection fraction (HFrEF) [7]. Compared with cut-offs for NPs in current guidelines [3, 5, 6], up to one-third of all HFpEF outpatients have NP levels below the typical diagnostic thresholds and will be missed [8, 9]. This is alarming as obesity is a common comorbidity in HFpEF patients and is associated with lower NT-proBNP levels, which could lead to underdiagnosis of HFpEF [10,11,12].

Although the exact underlying mechanism of HFpEF is still unclear, diastolic dysfunction (DD) is an established precursor in the development of HFpEF [13, 14]. NPs are not included in the current guidelines for the detection of DD [15]. However, emerging evidence suggests that BNP is higher in patients with DD in comparison to adults without DD [7, 16, 17]. Moreover, it has been demonstrated that levels gradually rise in parallel to the severity of diastolic abnormalities as assessed by echocardiography [18, 19]. As NPs are secreted by the ventricular walls as a result of abnormal preload and afterload, and systemic inflammation is apparent in patients with DD, NPs could be useful in detecting DD [20, 21]. Nonetheless, the performance of NPs for the detection of early subclinical DD is not as good as symptomatic DD [22].

Since early detection of DD and HFpEF in the non-acute setting is important for prevention and treatment strategies, a good diagnostic marker, such as NPs, is needed. However, a clear overview of the diagnostic performance of NPs for the detection of DD and HFpEF, in a non-acute setting, is currently lacking.

Therefore, this study aimed to systematically review and meta-analyse studies investigating the diagnostic performance of NPs for the detection of DD and HFpEF.

Methods

Data search

We performed a systematic review of PubMed and Embase.com from their inception to May 13, 2019 (SR and LS), according to the PRISMA-DTA statement [23]. The search terms ‘heart failure’ or ‘diastolic dysfunction’ were combined with general search terms ‘diagnostic performance’ or ‘markers’, as this broad search string was used for a set of systematic reviews describing a range of diagnostic markers (NPs, echo markers or biomarkers) (see Additional file 1). Reference lists of the identified articles were hand-searched for relevant publications. The protocol and search strategy were preregistered on PROSPERO (Registration number CRD42018065018). Because the protocol and search strategy were focussed on a broader research question, the identified studies were reported in three manuscripts focussed on echo parameters [24], biomarkers [25] and natriuretic peptides in this manuscript.

Patient and public involvement

Patients were not involved in the generation of this meta-analysis.

Study selection

Two reviewers independently screened titles, abstracts and full-text (SR/AJvB/MLH/JWJB). Studies were included if they (i) studied a diagnostic performance measure, (ii) studied the performance of NPs for the detection of DD and/or HFpEF, (iii) included a control population without DD or HFpEF or with HFrEF, (iv) had a cross-sectional study design (maximum follow-up 2 years) and (v) were written in English or Dutch. We excluded studies if they (i) studied the performance of the diagnostic marker for the detection of acute HF, (ii) are in rare patient populations (e.g. beta-thalassemia, hypertrophic cardiomyopathy or infiltrative disorders) or (iii) used a single echo marker as reference standard. Inconsistencies in study selection were resolved through consensus with a third reviewer (AJvB/JWJB). The mean positive and negative proportion of agreement between the reviewers for title/abstract screening was 40 and 96%, respectively. For full-text screening, the mean positive and negative proportion of agreement was 55 and 60%, respectively. For further details on the literature search and inclusion and exclusion criteria, see eMethods in Additional file 1.

Data extraction

One reviewer (SR) extracted the data, including measures of study design, study population, number of participants, markers, diagnostic performance measures and demographics, which was appraised by a second reviewer (AJvB/JWJB).

Quality assessment

Two reviewers (SR/AJvB/JWJB) independently evaluated the quality of the included studies using the QUADAS-2 checklist [26]. This checklist provides a quality score on four domains: patient selection, index test, reference standard and patient flow and timing. Each domain received a low, high or unclear risk of bias or concerns regarding applicability. A domain was rated as high risk of bias when one of the two or two of the three support questions were answered in a negative manner. A domain was rated as low risk of bias when two of the three or all support questions were answered in a positive manner. A domain was rated as unclear when one the of support questions could not be answered due to lack of information in the study. Inconsistencies in the quality assessment were resolved through consensus with a third reviewer (AJvB/JWJB). The mean positive and negative proportion of agreement between reviewers was 91 and 79%, respectively. For further details on the quality assessment, see eMethods in Additional file 1.

Data synthesis

Studies were meta-analysed using a random-effects model when two or more studies investigated the same diagnostic measure in similar study populations with similar control populations and reported similar diagnostic performance measures. The studies had to provide confidence intervals of this diagnostic performance measure or sufficient information (2 × 2 table) to compute these confidence intervals. Forest plots of random-effects meta-analysis models were fitted to respectively AUCs, or sensitivities and specificities for all studies and stratified for cross-sectional versus case-control studies. In subgroup analyses, we examined differences by study design (cross-sectional versus case-control), geographic location (European versus other studies), assays (Roche versus other for NT-proBNP and FEIA versus RIA for BNP) and decade of publication (2000–2009 versus 2010–2019). Subgroup effect statistic was calculated by means of a Wald test to determine differences between respective subgroups, if both subgroups included a minimum of two studies [27]. In sensitivity analyses, we further determined heterogeneity by geographic location by including European studies only, or by study population by excluding studies with hospitalized patients. Heterogeneity was tested using I2 with I2 > 50% considered as substantial. Publication bias was evaluated by visual inspection of funnel plots. All analyses and plots were performed in RStudio 3.4.2 using the Metafor package [28]. Trivariate generalized mixed models (GLMM) were fitted in SAS Studio to meta-analyse PPVs and NPVs for all cross-sectional studies to account for differences in the prevalence of DD [29]. Percentage positive and negative agreement was calculated for title/abstract and full-text screening and quality assessment to determine the inter-rater reliability [30].

Results

Search results

From 11,728 titles/abstracts, 352 full-text articles were screened and 51 studies were included in the data extraction (Additional file 3: Figure S1). Twenty-three studies reported the diagnostic performance for the detection of DD and 27 studies for the detection of HFpEF and one study for both. Two studies reported diagnostic performance of ANP [20, 31].

Study characteristics

Thirty studies were cross-sectional [18, 20, 22, 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] and twenty-one [31, 59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78] were case-control studies (Tables 1 and 2). Twenty-eight were performed in Europe [20, 32,33,34,35,36, 38, 40, 41, 43,44,45, 50, 52, 53, 55,56,57, 59, 63, 64, 66, 69, 73, 76,77,78], 11 in North America [22, 37, 39, 46, 49, 54, 65, 70, 72, 74, 75], 11 in Asia [31, 42, 47, 51, 58, 60,61,62, 67, 68, 71] and one was a multi-country study [48]. The control populations were either healthy controls (N = 14) [31, 35, 59, 60, 62, 65, 67,68,69, 71, 72, 75,76,77] or were recruited at the same department as the patient population, but without DD or HFpEF (N = 37) [32,33,34, 36,37,38,39,40,41,42, 58, 61, 63, 64, 66, 70]. The mean age of the DD and HFpEF populations ranged from 51 to 74 years and from 50.3 to 84.3, respectively. The mean age of the control populations ranged from 41.7 to 83.6 years. Five DD studies included patients with HFpEF [38, 43, 48, 73, 76], one DD study included patients with signs and symptoms [44], but did not exclude on LVEF, and eight DD studies only included patients with LVEF above 45 or 50% [18, 20, 45, 47, 50, 52, 53, 77]. The reference diagnoses used in the included studies varied from use of maximal velocity of the E-wave (e’) (N = 10) [43,44,45, 49, 50, 52, 54, 56, 57, 74] versus no tissue Doppler (TDI) measures (N = 14) [18, 20, 22, 38, 46,47,48, 51, 53, 55, 73, 76,77,78] for DD studies and, use of extensive criteria (N = 13) [31, 32, 34, 35, 38, 41, 59, 61, 63, 64, 66, 69, 72], signs and symptoms of HF in combination with LVEF (N = 11) [36, 40, 42, 58, 60, 62, 65, 67, 68, 70, 71], expert opinion (N = 1) [33] or catheterization (N = 3) [37, 39, 75] for HFpEF studies (Additional file 2: Table S1). Diagnostic performance of NPs for all studies for detection of either DD or HFpEF can be found in Additional file 2: Table S2.

Table 1 Study characteristics of the 28 included studies for the detection of heart failure with preserved ejection fraction
Table 2 Study characteristics of the 24 included studies for the detection of diastolic dysfunction

Quality scores

The risk of bias for the domains patient selection, index test and reference standard was high: respectively 76, 53 and 61% (Additional file 3: Figure S2). This high risk of bias was mainly caused by a case-control design (N = 21), use of threshold values of NPs that were not pre-specified (N = 25) and inappropriate reference standards (N = 27). Only 20% of the studies had high risk of bias for the domain flow and timing. The funnel plots for NPs with DD and HFpEF suggest some evidence of publication bias for smaller studies (Additional file 3: Figure S3).

Meta-analyses

For detection of DD, AUC values of five NT-proBNP studies and ten BNP studies showed a summary estimate of 0.77 (0.69–0.84; I2 = 84.6%) and of 0.80 (0.73–0.87; I2 = 87.0%), respectively (Fig. 1). The AUC values for detection of HFpEF versus controls without HFpEF (Fig. 1) of 13 NT-proBNP studies showed a summary estimate of 0.80 (0.74–0.87; I2 = 92.1%) and 0.79 (0.69–0.90; I2 = 79.2%) for five BNP studies. Subgroup analyses did not show differences between cross-sectional and case-control studies for BNP for detection of DD and HFpEF and for NT-proBNP for detection of HFpEF (p values > 0.17). A summary estimate for cross-sectional studies for NT-proBNP for detection of HFpEF was not feasible, as only two studies were cross-sectional [35, 42]. Excluding case-control studies for BNP for the detection of HFpEF only resulted in a summary estimate of 0.85 (0.76–0.95; I2 = 51.1%) (Fig. 1). For the detection of HFpEF in comparison to HFrEF, AUC values of seven NT-proBNP studies showed a summary estimate of 0.69 (0.66–0.72; I2 = 0%). Excluding case-control studies did not alter the summary estimates and heterogeneity: mean AUC of 0.69 (0.66–0.72; I2 = 0%).

Fig. 1
figure 1

Meta-analysis of AUC values of NT-proBNP and BNP for the detection of DD with controls without DD or for HFpEF with controls without HFpEF

For the detection of DD, nine studies reported sensitivity and specificity for NT-proBNP with a mean of 62% (44–80%; I2 = 98.0%) and a mean of 77% (67–88%; I2 = 98.4%), respectively. Nine studies reported this for BNP with a mean sensitivity of 72% (59–85%; I2 = 91.1%) and a mean specificity of 78% (67–87%; I2 = 98.7%) (Fig. 2). Subgroup analyses showed differences in sensitivity and specificity between cross-sectional and case-control studies for BNP for detection of DD (p values < 0.05) with mean sensitivities of 77% (73–81%; I2 = 0%) and 53% (0–100%; I2 = 97.4%) and mean specificities of 72% (61–83%; I2 = 95.4%) and 96% (90–100%; I2 = 86.4%), respectively. Excluding case-control studies did not reduce heterogeneity for NT-proBNP for detection of DD with non-invasive measures as a reference standard, mean sensitivity 61% (41–81%; I2 = 97.6%) and mean specificity 76% (64–87%; I2 = 99.1%). However, exclusion of case-control studies resulted in a mean sensitivity of 77% (73–81%; I2 = 0%) and mean specificity of 72% (61–83%; I2 = 95.4%) for BNP with inconsistent effects on heterogeneity.

Fig. 2
figure 2

Meta-analysis of sensitivity and specificity of NT-proBNP and BNP for the detection of DD with controls without DD

Sensitivity and specificity were reported for NT-proBNP for detection of HFpEF in ten studies with a mean of 69% (56–81%; I2 = 96.9%) and a mean of 85% (76–91%; I2 = 98.9%), respectively. Four studies reported sensitivity and specificity for BNP for detection of HFpEF with a mean of 68% (44–93%; I2 = 92.8%) and a mean of 78% (61–95%; I2 = 92.2%), respectively (Fig. 3). With only one cross-sectional study for NT-proBNP, summary estimates could not be computed. Subgroup analyses did not show differences between cross-sectional and case-control studies for BNP for the detection of HFpEF (p values > 0.1).

Fig. 3
figure 3

Meta-analysis of sensitivity and specificity of NT-proBNP and BNP for the detection of HFpEF with controls without HFpEF

Subgroup analyses

Subgroup analyses did not show differences between European studies and studies from other countries (p values > 0.05), nor for BNP or NT-proBNP assay (p values > 0.07), nor for the decade of publication (2000–2009 versus 2010–2019) (p values > 0.08) (Additional file 2: Table S3).

Sensitivity analysis

Including only ambulatory patients (N = 3–12 studies) resulted in lower heterogeneity for BNP for detection of HFpEF with similar summary estimates: AUC of 0.75 (0.66–0.85; I2 = 69.7%).

Positive and negative predictive values

For detection of DD, eight cross-sectional studies reported PPV and NPV for NT-proBNP with a mean of 63% (34–92%) and a mean of 81% (74–88%), respectively (Fig. 4). Seven cross-sectional studies reported this for BNP with a mean PPV of 54% (23–85%) and a mean NPV of 90% (82–98%). With only two cross-sectional studies for both NPs, summary estimates for detection of HFpEF could not be computed. The two NT-proBNP studies for the detection of HFpEF showed inconsistent results, but remarkably, for BNP, the PPV was higher than NPV: around 90 and 70%, respectively.

Fig. 4
figure 4

Meta-analysis of positive and negative predictive value of NT-proBNP and BNP for the detection of DD with controls without DD

Incremental diagnostic performance

Three studies reported incremental values of NPs on top of clinical models, but NPs did not improve the diagnostic performance of the model [34, 36, 72]. Two studies reported the diagnostic performance of clinical models including NPs, but the diagnostic performance of the model only was not reported [48, 56].

Discussion

Our study is the first systematic review and meta-analysis of NPs as a diagnostic marker for the detection of DD and HFpEF. The meta-analysis indicates a reasonable diagnostic performance for both NPs for detection of DD and HFpEF with AUC values around 0.80, although heterogeneity between studies was high. Heterogeneity was partly explained by the case-control design of half of the BNP studies for detection of HFpEF. For both NPs, sensitivity was lower than specificity: approximately 65% versus 80%, respectively. Both NPs have adequate ability to rule out DD with a NPV of approximately 80%. The ability of both NPs to prove DD is lower with a PPV of approximately 60%. The risk of bias was generally high for three of the four domains.

Our systematic review and meta-analysis has several strengths. Our study provides a comprehensive overview of the diagnostic performance of NPs for detection of HFpEF and DD. The systematic review and meta-analysis are performed according to the PRISMA-DTA statement and included a quality assessment [23, 26]. However, this systematic review also has some limitations, such as substantial heterogeneity due to the quality of the included studies. The heterogeneity of included studies can be due to spectrum bias, which is a bias introduced by different inclusion criteria resulting in different study populations. We therefore performed sensitivity analyses excluding studies with hospitalized patients, but this did not affect our results. We also performed subgroup analyses for other characteristics such as geographical region, assay and decade of publication, but we did not detect any differences. However, study settings also differed across included studies in other aspects such as sex and age of the study population for which we could not perform a meta-analysis as the groups became too small. The effect of spectrum bias on the reported diagnostic measures is hard to quantify as it is unclear if this would lead to over- or underestimation of the diagnostic measures. Another explanation for the heterogeneity of included studies could be the heterogeneous nature of the HFpEF syndrome.

Overall, included studies had a high risk of bias with half of the studies using a case-control design. This results in an overestimation of diagnostic performance, as the contrast in clinical characteristics between patient and control population is large. Moreover, in case-control studies, the control population does not accurately reflect the population suspected to have DD or HFpEF, for whom NPs will be used in clinical practice. This limits the applicability of the results to other studies and clinical practice. Restricting the meta-analyses to cross-sectional studies substantially reduced heterogeneity (I2 = 51.1%) in studies for BNP for detection of HFpEF, but the summary estimates remained similar. For the meta-analyses for PPV and NPV, case-control studies were already excluded because of the use of prevalence estimates in Trivariate GLMM.

This study showed that the diagnostic performance of NPs for detection of DD and HFpEF, versus no DD or HFpEF, is reasonable with summary AUC values around 0.80 for both NT-proBNP and BNP, while the diagnostic performance of NT-proBNP for detection of HFpEF versus HFrEF was lower: around 0.70. For both NPs for the detection of DD and HFpEF, specificity (~ 80%) was higher than sensitivity (~ 60–70%). The overall performance persisted in analyses excluding case-control studies. Therefore, both for detection of DD versus no DD or HFpEF versus no HFpEF, our results indicate that these measures seem to perform better for ruling out of DD or HFpEF than for making the diagnosis. However, the specific performance of NPs in clinical practice also depends on the clinical setting. The high percentage of false-negatives might be more severe for secondary or tertiary care, while in primary care, NPs may be more important to rule out DD or HFpEF, for which other diagnostic characteristics, such as NPV, are important. Our results indicate that NPs are useful to rule out DD or HFpEF in primary care with a low prevalence of these conditions but are less suitable to use to differentiate HFpEF from HFrEF.

NT-proBNP and BNP have equal capability for ruling out or ruling in DD, but NT-proBNP has a higher specificity for detection of HFpEF. In general, based on this meta-analysis, one NP is not clearly preferred over the other for the detection of DD and HFpEF. The ranges of cut-off values used in the included studies were wide. In comparison to the current guidelines, only six studies used the same cut-off value for NT-proBNP as the new HFA-PEFF diagnostic algorithm [3]. Therefore, we recommend to use the cut-off values as proposed by the current guidelines [3, 6].

For detection of DD, both NPs have a substantially lower PPV (~ 60%) than NPV (~ 85%). This means that both NPs are potentially better in ruling out DD than proving DD. Current guidelines for detection of DD do not include NPs as a diagnostic marker [15]. NPs are released as a consequence of volume overload, a characteristic that is absent in (asymptomatic) DD. Consequently, the positive predictive value for the detection of DD is low and will result in misclassification, and therefore, our findings are in line with guidelines for the diagnosis of DD, as NPs are not advised to diagnose DD. However, the guidelines could provide room for the use of NPs to rule out DD in certain settings [15]. For example, in patients with exertional dyspnoea, NPs have a very good ability to diagnose DD [18]. Our study also provides evidence that NPs might be useful to rule out DD in specific settings wherein a low prevalence of DD occurs, such as primary care. This approach could be suitable for screening patients at high risk of HFpEF in primary care such as diabetes patients.

For the detection of HFpEF, a trend towards a higher PPV than NPV is observed, suggesting that BNP could be useful for diagnosing HFpEF instead of ruling out HFpEF. This is in contrast with guidelines that propose to use NPs only for ruling out HFpEF [5, 6]. In acute settings, the diagnostic performance of NPs is good to detect acute HF from non-cardiac dyspnoea, as NP levels are higher in patients with acute HF [79, 80]. In non-acute HFpEF patients, NP levels can be closer to normal than the acute setting. This makes it more difficult to distinguish HFpEF from non-HFpEF patients based on NPs, especially in combination with common comorbidities that complicate the diagnosis further [10, 81]. Therefore, NPs should be used in combination with echocardiography for initial diagnosis of HFpEF, as guidelines recommend [3]. Evidence of the incremental value of NPs on top of clinical characteristics or echocardiography measures is limited, but important. We therefore recommend future studies to compute incremental values of NPs on top of clinical or echocardiographic diagnostic models for the detection of HFpEF, confirmed by catheterization. Furthermore, future studies should aim to reduce bias by using cross-sectional studies with pre-specified cut-off values and correct reference diagnoses with transparent patient population selection procedures. As our study shows the ability of NPs to rule out DD and as earlier recognition of HFpEF is key to prevent late diagnosis, future studies should focus on the possibilities of NPs in screening programmes in patients at risk in primary care such as patients with type 2 diabetes, for detection of DD as precursor of HFpEF.

Conclusion

This systematic review and meta-analysis of 51 studies shows that NPs have reasonable diagnostic performance for the detection of DD and HFpEF in a non-acute setting. NPs are useful to rule out DD and would not be a tool to rule in DD. NPs have value in the diagnosis of HFpEF, but not for ruling out HFpEF. However, NPs should be used in combination with echocardiography. As the risk of bias of the included studies is high and sensitivity of NPs for detection of DD or HFpEF is low compared to specificity, the use of NPs alone to detect DD or HFpEF should be discouraged as recommended by current guidelines. Nonetheless, the high NPV observed for both DD and HFpEF indicates they might be useful for screening of high-risk patients in primary care such as those with diabetes. For future research and guidelines, well-performed cross-sectional studies with pre-specified cut-off values for NPs are needed for unbiased estimates of diagnostic performance measures of NPs, especially for use in primary care.