Introduction

Cardiovascular magnetic resonance (CMR) imaging is the gold standard for the assessment of cardiac function and morphology [1, 2]. Left ventricular (LV) ejection fraction (EF) is the most established parameter for cardiac functional assessments in clinical routine and is used for the evaluation of disease severity, treatment follow-up and risk assessment for adverse events [3, 4].

To extract clinically relevant information such as LV mass, LV and right ventricular (RV) end-diastolic and end-systolic volume (EDV/ESV), stroke volume (SV) and EF, accurate post-processing of the cine CMR images is required. In daily clinical routine, post-processing is typically performed manually by delineating endocardial and epicardial LV borders as well as endocardial RV borders in all short-axis (SAX) slices covering the ventricles from atrioventricular ring to apex, in both end-diastolic and end-systolic phases. This task is time-consuming, tedious and subject to observer-variability [2, 5,6,7]. Emerging post-processing software based on deep-learning algorithms using convolutional neural networks now offer a fully automated approach for LV and RV volume assessments and have recently become commercially available [8]. Initial evaluations of these automated approaches are promising [9], however often based on pre-selected cases with excellent image quality or ‘cropped’ data [9, 10], i.e. SAX stacks are manually triaged to only include end-diastolic and end-systolic images effectively covering the ventricles before applying the automatic algorithm. Importantly, most of the observer-variability results from discrepancies in defining the most apical and basal short-axis SAX slices [11], which is whitewashed by previous manual ‘cropping’ of SAX stacks.

At the current time, it remains unknown whether fully automated quantification of biventricular morphology and function is feasible and accurate in clinical CMR routine. Accordingly, the aim of the present study was to evaluate the feasibility and accuracy of fully automated biventricular assessment of morphology and function in a variety of CMR data (neither pre-selected nor pre-processed) taken from a real-world data base of a tertiary care CMR unit.

Methods

Study design

The study population consisted of 300 randomly selected patients referred to CMR within clinical routine care between 2016 and 2018. The CMR imaging protocol was employed on clinical 1.5 or 3 Tesla (Magnetom Symphony or Magnetom Skyra, Siemens Healthineers, Erlangen, Germany) CMR scanners. Protocols were employed as appropriate for clinical routine, all of which including electrocardiogram (ECG)-gated balanced steady-state free precession (bSSFP) cine sequences for a SAX stack. Typical imaging parameters were as follows: 25 frames/cardiac cycle, pixel spacing 0.8 mm × 0.8 mm, 8 mm slice thickness as well as inter-slice gap, TE 1.5 ms, TR 3 ms. The study was approved by the Ethics Committee of the University Hospital Goettingen and complied with the Declaration of Helsinki. The Ethics Committee gave permission to waive informed consent for this retrospective analysis. Furthermore, agreement was assessed between the fully automated algorithm and expert consensus contours based on the Society for Cardiovascular Magnetic Resonance (SCMR) consensus data consisting of 15 cases with different pathologies [12].

CMR analyses

Volumetric analyses were performed manually in short-axis orientations by an experienced investigator according to standardized recommendations [11] using commercially available post-processing software (QMass®, Version 3.1.16.0, Medis Medical Imaging Systems, Leiden, Netherlands). Fully automated segmentation was performed employing dedicated commercially available software (suiteHEART®, Version 4.0.6, Neosoft, Pewaukee, Wisconsin, USA). The papillary muscles were included within the myocardium, trabecular tissue was excluded from the myocardial mass using both, the manual (QMass®) and the automated (suiteHEART®) software. Manual segmentation was performed by simple delineation of the LV endocardial- and epicardial borders and the RV endocardial border with Bézier smoothing at end-diastole and end-systole. No thresholding or edge detection was applied. Cross-referencing from long-axis locations was used to adjust for systolic atrioventricular ring descent. Fully automated segmentation was done overnight without any user-interaction neither by pre-processing the acquired short axis stack nor by post-processing automatically traced borders. Analyses included LV mass, and biventricular EDV, ESV, SV and EF. Agreement was tested between manual and fully automated analyses. Reproducibility was tested by reapplying the fully automated tracking algorithm on 20 randomly selected patients and by manual volumetric analyses by two experienced investigators including intra- and inter-observer reproducibility. All operators were blinded to each other’s results. Furthermore, the analysis time needed to perform manual segmentations was measured in the subset of 20 patients. The presence and relevance of artefacts impacting image quality was graded adopting the criteria proposed by Klinke et al. [13] taking wrap around, respiratory ghost, cardiac ghost, image blurring, metal and shimming artefacts into account (Table 1). One point was given if the artefact impeded the visualization of > 1/3 of the ventricular endocardial border at end-systole and/or end-diastole on a single SAX slice. If such artefact involved 2 slices or ≥ 3 slices, 2 and 3 points were given, respectively. Furthermore, accurate short-axis orientation was evaluated, resulting in an image quality score between 0 (= excellent quality) and 6 (= poor quality). Image quality scores were separately assessed for the LV and RV myocardium.

Table 1 Quality assessment of cine short-axis (SAX) images. The image quality score corresponds to the sum of qualitative scoring based on 6 criteria (range of score: 0–5). One point was given if an artefact impeded the visualization of > 1/3 of the ventricular endocardial border at end-systole and/or end-diastole on a single SAX slice. If such artefact involved 2 slices or ≥ 3 slices, 2 and 3 points were given, respectively. Incorrect short-axis orientation was graded with 2 points

For the SCMR consensus data, only LV parameters were compared between automated analyses and manual expert consensus parameters, since RV parameters were not provided. According to the method described by Suinesiaputra et al. [12], papillary muscles and trabecular tissue were excluded from the myocardial mass.

Statistics

Continuous variables were checked for normal distribution using the Shapiro-Wilk test and are presented as median with interquartile range (IQR). Biventricular volumes and LV mass were indexed to body surface area. Dependent variables were tested using the Wilcoxon signed-rank test. Agreement of manual and automated analyses as well as intra- and inter-observer variability was assessed first using Bland-Altman analysis [mean difference between measurements with 95% confidence interval (CI)] [14], second intra-class correlation coefficients (ICC) based on a model of absolute agreement, considered excellent if ICC > 0.74, good between 0.60 and 0.74, fair between 0.4 and 0.59 and poor below 0.4 [15], and third the coefficient of variation (CoV, = standard deviation [SD] of the differences divided by the mean) [16, 17]. P-values provided are two-sided, an alpha level below 0.05 was considered statistically significant. Statistical analyses were performed using IBM SPSS Statistic Software Version 24 (International Business Machines, Armonk, New York, USA) and Microsoft Excel (Microsoft, Redmond, Washington, USA).

Results

Study population

Patient characteristics and cardiac volumes for both manual and automated assessments are presented in Table 2. Biventricular automatic segmentation was feasible in all 300 cases. In comparison with manual evaluations, automatic assessments depicted higher LV volumes, lower LVEF, higher LV mass as well as higher RV EDV, RV SV and RVEF (p < 0.001 for all). The study population consisted of 100 referrals to evaluate ischemic heart disease, 120 patients with myocardial disease, 70 patients with congenital heart disease and 10 others. Table 3 provides an overview of clinical indications. There were 31 patients imaged after aortic valve replacement (AVR) of whom 18 received transcatheter aortic valve replacement (Edwards SAPIEN 3™, Edwards Lifesciences, Irvine, California, USA), 7 patients after open-surgery AVR using a bioprosthesis (Carpentier-Edwards Perimount™, Edwards Lifesciences) and 6 patients after open-surgery AVR with a mechanical aortic valve (SJM Regent™, St. Jude Medical Inc., St Paul, Minnesota, USA).

Table 2 Demographics and biventricular volumes
Table 3 Clinical CMR indications

Image quality and post-processing

LV-image quality was graded with 1.0 (SD 1.3) (Score 0 n = 168, Score 1 n = 19, Score 2 n = 46, and Score 3 n = 67 points. RV-image quality was graded with 1.1 (SD 1.3) (Score 0 n = 151, Score 1 n = 37, Score 2 n = 39, and Score 3 n = 73 points. Appropriate short-axis orientation was fulfilled in 298 case, the highest image quality score assigned was 3. Manual post-processing took on average 11.3 ± 1.5 min as opposed to automated pre-processing with < 1 min/SAX stack. Representative examples of high and low segmentation accuracy are given in Fig. 1. Corresponding videos including automatic segmentation of all phases and SAX slices can be found in Additional file 1.

Fig. 1
figure 1

Fully automated biventricular segmentation (upper panel) and manual segmentation (lower panel) of 6 representative cases. The figure comprises examples with good automated segmentation results (a-b) and limited automated segmentation results (c-f). Segmentation results of all phases and all short-axis slices can be found in the supplementary material. a End-diastolic phases of a patient with suspected myocarditis (1.5 T) and excellent biventricular segmentation. b End-systolic phases of a patient after transcatheter aortic valve replacement (TAVR) imaged at 3 T showing good segmentation results, except for insufficient segmentation of papillary muscles. c End-systolic phases of a patient with severe hypertrophic cardiomyopathy (HCM) imaged at 1.5 T with low accuracy of biventricular segmentations. d End-diastolic phases of a patient with dilated cardiomyopathy (DCM) imaged at 1.5 T with underestimation of right ventricular (RV) volume at the basal level. Also note misinterpretation of two apical thrombi as papillary muscles. e End-systolic phases of a patient with repaired Tetralogy of Fallot (ToF) with underestimation of RV volume due to severe metallic artefacts caused by sternal wires. f End-systolic phases of a patient with pulmonary arterial hypertension (PAH) with underestimation of RV volume most likely due to RV hypertrabeculation

Agreement of manual and automated analyses

Results comparing automated and manual volume assessments including mean differences with corresponding SD, ICC and CoV are presented in Tables 4 and 5. Corresponding Bland-Altman plots are presented in Figs. 2, 3 and 4. Agreement of manual and automated assessments in the overall cohort of 300 patients was excellent for all LV parameters (ICC ≥0.91), best for EDV (ICC 0.98) closely followed by ESV (ICC 0.96) as well as mass and EF (both ICC 0.95). The automated algorithm slightly overestimated LV mass, EDV and ESV while underestimating LV EF (mean difference − 2.5%, limits of agreement [LOA] -14.6 to 9.1%), p < 0.001). Agreement for RV volumes was excellent for RV EDV and ESV (both ICC 0.92) and good for RV SV (ICC 0.73) and EF (ICC 0.72). Similar to LV measurement, the automatic algorithm overestimated RV EDV, and also RV EF (mean difference 5.8%, LOA -13.0 to 24.6%, p < 0.001). Higher field strength (3 vs 1.5 Tesla) was associated with reduced agreement in biventricular volumes, though it was also associated with a decrease in image quality (1.5 T: LV image quality score 0.8 (SD 1.2), RV image quality score 0.7 (SD 1.1); 3.0 T: LV image quality score 1.4 (SD 1.3), RV image quality score 1.1 (SD 1.3); p < 0.001 for all). Similarly, aortic valve replacement resulted in lower agreement but was also accompanied by lower image quality (LV image quality score 1.9 (SD 1.2); RV image quality score 2.0 (SD 1.2). Repaired ToF was associated with decreased RV image quality (RV image quality score 1.8 [SD 1.1]) but preserved LV image quality (LV image quality score 0.62 [0.99]). Despite preserved LV image quality, agreement was reduced for both LV and RV volumes (Table 3.).

Table 4 Agreement between manual and automated segmentations. Agreement was analysed in the entire study group (n = 300) as well as in subgroups according to field strength, aortic valve replacement and repaired Tetralogy of Fallot
Table 5 Agreement between manual and automated analyses according to image quality
Fig. 2
figure 2

Agreement of automatically and manually derived biventricular morphology and function. Bland Altman plots (automatic – manual) are shown for the entire study collective (n = 300). LV/RV: left/right ventricle, EDV/ESV: end-diastolic/systolic volume, SV: stroke volume, EF: ejection fraction, Δ: difference

Fig. 3
figure 3

Agreement of automatically and manually derived left ventricular parameters according to image quality. Bland Altman plots (automatic – manual) are shown for studies with good image quality (score ≤ 1, n = 187) and for studies with reduced image quality (score ≥ 2, n = 113). LV: left ventricle, EDV/ESV: end-diastolic/systolic volume, SV: stroke volume, EF: ejection fraction, Δ: difference

Fig. 4
figure 4

Agreement of automatically and manually derived right ventricular parameters according to image quality. Bland Altman plots (automatic – manual) are shown for studies with good image quality (score ≤ 1, n = 188) and for studies with reduced image quality (score ≥ 2, n = 112). RV: right ventricle, EDV/ESV: end-diastolic/systolic volume, SV: stroke volume, EF: ejection fraction, Δ: difference

If classified according to image quality score, 0 to 1 point was associated with considerable better agreement than 2 to 3 points, both for LV and RV automated analyses. Considering an image quality score of ≤1, both LV and RV agreements were excellent for all variables with a bias of − 0.6% (LOA -7.6 to 6.4%) and 3.0% (LOA -9.2 to 15.2%) for LV EF and RV EF, respectively. However, large differences were observed in case of reduced image quality (score ≥ 2) with a bias of − 5.6% (LOA -20.6 to 9.4%) and 10.6% (LOA -13.6 to 34.8%) for LV EF and RV EF, respectively. LV and RV stroke volumes were very consistent in automated analyses, LV 48.1 ml/m2 compared to RV 47.6 ml/m2 in median, p = 0.435.

Results from the comparison between automatically and manually derived expert consensus LV parameters based on the SCMR consensus data are provided in the Additional file 1. In accordance with the study’s results, agreement was excellent in the majority of cases (ICC ≥ 0.95 for all LV parameters) (Additional file 2: Tables S1 and S2). There was one patient with a 20% difference in LV EF between automatic and manual expert results (Case # 15), which was a patient with severe LV hypertrophy (Additional file 2: Table S1, Figure S1), similar to the case shown in Fig. 1c.

Reproducibility

Reproducibility for manual segmentations was better for LV than for RV measurements. The automated algorithm yielded exactly the same results when being reapplied. Table 6 shows ICC, CoV and mean differences (SD) within and between observers.

Table 6 Reproducibility of manual and automated analyses

Discussion

The present study demonstrates the feasibility of fully automated quantification of biventricular morphology and function and reveals its current pitfalls and limitations in a ‘real-world’ clinical setting. Several notable findings should be considered. First, automatically and manually derived volumes agree well in case of good image quality; however, severe differences occur in case of reduced image quality. Second, agreement is better for LV than for RV volumes. Third, demanding anatomical circumstances (e.g. in patients with repaired ToF) result in lower agreement. Forth, different field strengths or the presence of valve replacements do not impede automated assessments as long as image quality is preserved.

Agreement of automated and manual assessment

CMR represents the reference standard for cardiac volume assessment [1] with incremental accuracy and reproducibility as compared to echocardiography [18]. However, CMR acquisition time is long and further requires time-consuming post-processing to extract clinically relevant information. Thus, efforts have been directed towards automated post-processing analyses based on deep-learning algorithms within the last decade [8, 19, 20]. The current literature demonstrates excellent agreement for automated and manual LV volume assessments [6, 10]; however, studies concerned with automatic RV segmentation are scarce [21]. Noteworthy, the study by Queirόs et al. [10] applied an automatic algorithm on cropped data, that is after manually defining the most basal and apical slices with subsequent cropping the SAX stack to include images effectively covering the LV before applying the automatic algorithm. Furthermore, ED and ES time points were manually pre-selected. However, the correct definition of the most basal slice is amongst the most challenging steps in SAX volume assessments and one of the most important source of observer variability [22], therefore representing a clear bias in testing the reliability of an automated algorithm. In the present study, we sought to simulate a real-world clinical scenario by randomly selecting patients from clinical routine imaging. We applied a commercially available automatic algorithm on clinically acquired SAX stacks – occasionally comprising both atria and ventricles – without any manual pre- or post-processing. The final data was acquired on 1.5 and 3.0 T scanners. Our data elaborates on the excellent agreement between automatically and manually derived volumes in case of good image quality, with overall better agreement for LV than for RV measurements. Indeed, quantification of RV volumes is generally more challenging as opposed to LV volumes due to the complex RV anatomy [23, 24]. Nevertheless, LV and RV stroke volumes were consistent in automated analyses in this patient group without intra- or extracardiac shunt.

Manual post-processing time took on average more than 11 min as compared to fully automated assessments with < 1 min. Importantly, automatic analyses of several CMR examinations (in this case 300 scans) run completely user-independent and were performed overnight. Furthermore, automated analyses promise to overcome limitations in observer variability, since the algorithm yields exactly the same measures when being reapplied by different users. Thus, the automated frame-work provides a highly reproducible approach and is able to extremely shorten post-processing times of CMR examinations with subsequent potential to improve cost-effectiveness [25]. Furthermore, the framework may provide ‘on-the-fly’ post-processing parallel to finishing the CMR scan (e.g. during late gadolinium enhancement acquisitions).

Impact of image quality

Our data demonstrate that image quality is the leading determinant of accuracy for fully automatic volume assessment. In case of good image quality (image quality score of ≤1 adopted to the criteria proposed by Klinke et al. [13], Table 1), the bias of both LV and RV function was within acceptable limits. However, in case of reduced image quality (image quality score ≥ 2), a large bias of > 5% was observed for both LV and RV EF with wide LOA, particularly for RV EF. Importantly, the relevance of RV function and volumes is increasingly recognized in various diseases [26]. For example, the diagnosis of arrhythmogenic right ventricular cardiomyopathy is challenging and heavily relies on the assessment of RV EDV and RV EF [27]. If considered for clinical use and decision making, a precise volume assessment is of utmost importance, and cannot be achieved with the proposed fully automatic algorithm in case of impaired image quality yet.

Technical and anatomical considerations

To further elucidate limitations of the commercially available software, we compared the agreement of automated and manually derived volumes for subgroups according to field strengths, the presence of aortic valve replacements as well as repaired ToF. Agreement was better at 1.5 T compared to 3 T scans; however, at 3 T considerably more artefacts (mainly due to inadequate breath-holding and shimming) were present. Reduced agreement at 3 T is therefore more likely a result of lower image quality. Due to the growing number of percutaneously implanted aortic valves [28] and increasing indications for CMR imaging [29] including aortic valve stenosis [30], the presence of valve replacement in CMR studies is likely to grow. As long as image quality was preserved in these patients, agreement of LV volumes remained acceptable, enabling the use of automated algorithms in this group of patients. In contrast, in patients with repaired ToF, both RV and LV agreement were considerably decreased, despite low image quality solely affecting the RV (metal artefacts resulting from sternal wires). Since LV image quality was good, reduced agreement is most likely due to the more demanding anatomy in these patients (distinctly larger RV than LV volumes), which points out the current limitations of fully automated analysis. Here, it remains to be investigated whether or not the proposed automatic deep-learning frame-work is able to further learn from these cases with subsequent improvement of accuracy.

Limitations

The study’s conclusions are derived from the comparison of 300 automatically and manually quantified clinical CMR examinations from a single centre. Although manual contouring was performed by experienced observers, intra- and inter-observer variability may limit its use as a reference standard. Details of the automatic algorithm are not disclosed by the software vendor and therefore cannot be reported. RV mass was not measured, since the automatic algorithm does not provide RV mass quantification.

Conclusion

Fully automated quantification of biventricular morphology and function is feasible and accurate in the majority of cases in a clinical routine setting and has the potential to extremely accelerate post-processing times and to improve reproducibility. However, in case of limited image quality or in patients with demanding anatomy (e.g. in patients with repaired ToF) the proposed fully-automatic frame-work does not yet provide satisfying results and still requires manual re-contouring.