Left ventricular (LV) mass and volumes are essential for the management of patients with cardiovascular disease. In particular, LV mass (LVM) is an independent predictor of cardiovascular events [1], and end-diastolic volume (EDV) and end-systolic volume (ESV) are associated with adverse remodeling [2]. Cardiovascular magnetic resonance (CMR) is currently the most accurate and reproducible method for quantifying LV mass and volumes [3]. CMR is non-invasive, does not require geometrical assumptions, is non-ionising and has high signal-to-noise ratio. Because of these advantages, CMR is becoming widely used for the measurement of ventricular volumes, function and mass in many clinical centers, as well as in large research studies including the Multi-Ethnic Study of Atherosclerosis (MESA) [4] and the UK Biobank [5].

LV mass and volume quantification requires accurate delineation of the blood pool and myocardium. Although the contrast between flowing blood and the myocardium in steady-state free precession (SSFP) images is typically excellent, the precise placement of the contours is reader dependent [6]. High reproducibility of LV mass and volume measurement based on cine CMR has been shown within single centers [7, 8], but differences in training and standard operating procedures may occur between centers [6]. A multi-center consensus ground truth dataset would be valuable for evaluating sources of variability, training new readers, establishing standard protocols for multi-centre studies and validating new computer algorithms for automated contouring. Such a dataset is difficult to establish, due to the time-consuming nature of manual contouring. Although there are several recourses that offer large breadth [5, 9, 10], no resource offers depth of expert analysis from multiple centers. Greater depth of readers is valuable in evaluating sources of variation and an unbiased consensus. The aim of this study was to develop a consensus ground truth LV contour dataset for SSFP cine images, derived from expert readers representing seven independent CMR centers from many countries around the world, in accordance with the SCMR post-processing guidelines [11].



Cine CMR images from 15 subjects (five healthy volunteers, six patients with myocardial infarction, two patients with heart failure, and two patients with LV hypertrophy) were included in this study. CMR images were acquired with contiguous short axis slices and 2–3 long axis slices in accordance with SCMR guidelines using three different scanners (4 GE, 5 Siemens and 6 Philips). Spatial resolution varied with FOV, ranging from 92 × 72 to 280 × 280 mm2 (see Table 1). Temporal resolution was typically 20–30 frames, except for one case with 60 cardiac frames. Slice thickness was either 8 or 10 mm. Short-axes view series covering the LV from apex to base were defined in 10–15 slices. Anonymized images were contributed to the Cardiac Atlas Project database with the approval of local institutional review boards. Written informed consent was obtained from all participants.

Table 1 Patient and image characteristics, showing variability in patient pathologies, gender, age, MR vendors and MR parameters. All units (slice thickness (SLT), slice distance (SLD), field of view (FOV) and pixel size) are in mm. Number of slices and frames are taken from short-axis series


Seven expert readers, representing seven CMR core laboratories from six countries around the world (two USA, Canada, two UK, Germany and Netherlands) analyzed all cases independently. The contours were reviewed by the core laboratory principal investigator and represented the standard practice of each core laboratory. Each center was able to use their usual software, with the condition that all contours were placed manually, or manually corrected if initial contours were found automatically. Contouring was performed in accordance with the SCMR guidelines on standardized image interpretation and post-processing of CMR images [11]. Trabeculae and papillary muscles were included in the blood pool and excluded from the LV mass. No attempt was made to train the readers, influence their analysis or achieve consistent results between readers.

Epicardial and endocardial contours were drawn on the short-axis slices at end-diastole (ED) and endocardial contours drawn at end-systole (ES). The ED and ES frames were pre-determined for all readers by the coordinating center, based on smallest area in a mid-ventricular slice. All readers were asked to contour slices covering the whole ventricle from apex to base, but there was no restriction on which slices to include or exclude.

Readers used a range of software packages. Two readers used OsiriX (Pixmeo, Geneva, Switzerland), five readers used QMass (Medis, Leiden, the Netherlands), and two readers used CMR42 (Circle Cardiovascular Imaging Inc., Calgary, Canada). Two readers used two different software packages. Contours were imported from these software packages and pre-processed using Matlab R2010a (Mathworks, Natick, MA, USA). Consensus contours were only generated if most of the readers (i.e. four or more) contoured the slice; otherwise, no consensus contours were produced.

Consensus contour estimation

Consensus contours were estimated using the Simultaneous Truth and Performance Level Estimation (STAPLE) method [12]. This method calculated unique contours in each slice that maximized the conditional probability of the consensus given the readers’ contours. Briefly, contours from each reader were first converted to binary images (1 for pixels within the contour, 0 for pixels outside the contour). Since the contour resolution was higher than the original images, the binary images were calculated at a resolution 4x higher than the original image. Given an estimate of the consensus contours, the sensitivity of each reader was calculated as the proportion of pixels inside the consensus contours that were also inside the reader contours. Similarly, the reader specificity was calculated as the proportion of pixels outside the consensus contours, which were also outside the reader contours. The STAPLE method uses Expectation-Maximization [13] to calculate the optimal consensus contour, as well as the reader sensitivity and specificity. There are two steps that are performed iteratively until convergence. The first step (Expectation) estimates the consensus probability given the reader contours and current estimates of sensitivity and specificity. The second step (Maximization) updates the sensitivity and specificity of each reader based on this consensus probability. The result is not the same as simple averaging or pixel voting, since specificity and sensitivity behave as weights during the optimization process, which are not assumed to be equal across all readers. Instead, the voting solution is used as the initial estimate to start the iteration. The STAPLE method has been successfully applied in several medical imaging applications, and was recently used to estimate consensus ground truth contours from automated CMR analysis methods [9].

Cavity volumes and myocardial mass

LV cavity volumes at ED (EDV) and ES (ESV) were computed by slice summation. Two-dimensional cavity areas were multiplied by the inter slice distance to compute a slice volume. The myocardial mass (LVM) was defined at ED by subtracting EDV from epicardial volume and multiplying by 1.05 g/ml.

Functional assessment

Root mean squared errors (RMSE) in volumes and mass were computed to measure the agreement between each reader and all the other readers. This was defined as

$$ {E}_i(F)=\sqrt{{\displaystyle \sum_{j=1,\kern0.1em j\ne i}^R{\displaystyle \sum_{k=1}^N\frac{{\left({F}_i(k)-{F}_j(k)\right)}^2}{N\left(R-1\right)}}}} $$

where E i indicates the RMSE for reader i, j indicates all other readers, F indicates either EDV, ESV, LVM or ejection fraction (EF), R is the number of readers, k indicates the cases, and N is the number of cases. A similar RMSE was applied also to the consensus, denoted by E C (F), to measure the agreement between the consensus and all readers:

$$ {E}_C(F)=\sqrt{{\displaystyle \sum_{j=1}^R{\displaystyle \sum_{k=1}^N\frac{{\left({F}_C(k)-{F}_i(k)\right)}^2}{NR}}}} $$

Smaller values of E C (F) compared to E i (F) for all i = 1, 2, … , R were taken to indicate functionally acceptable consensus contours.

Visual assessment

An independent reader with over 10 years experience in CMR, who was not affiliated with any of the participating core laboratories, visually assessed the consensus contours by scoring as either acceptable or unacceptable according to whether the contour was clinically plausible.


Bland-Altman analysis was performed to evaluate variation in volumes and mass across all readers relative to the consensus. The limits of agreement were defined at 95 % of confidence interval from the bias. Individual reader bias was quantified by the mean of the differences from the consensus, and reader precision was quantified by the standard deviation of the differences. Volumes and mass estimated from the consensus contours were calculated with the standard error estimated between the readers and the consensus. Statistical analysis was performed using the open source R statistics package (The R Foundation of Statistical Computing Platform, ver. 3.1.1).


Visual assessment

Of all 15 cases, no unacceptable consensus contours were found by the independent reader. Figure 1 shows a representative case of consensus contours estimated from this study. Although large disagreements between readers could be found in some of the apex, base and outflow tract slices, the consensus contours generated from these difficult slices (shown by Fig. 2) were visually acceptable.

Fig. 1
figure 1

Estimated consensus contours from a myocardial infarction case

Fig. 2
figure 2

Three examples of difficult cases with larger reader disagreement, showing how STAPLE can estimate consensus contours that gives the best agreement among readers. Top rows: reader contours, bottom rows: the estimated consensus contours

Functional assessment

Table 2 shows the estimated consensus LV function per case in terms of EDV, ESV, LVM and EF (ejection fraction). Standard errors between the readers and the consensus were used to indicate confidence intervals for the estimated values.

Table 2 Individual case consensus LV parameters: End-Diastolic Volume in ml (EDV), End-Systolic Volume in ml (ESV), Mass in g (LVM) and Ejection Fraction in % (EF). All values are represented by the consensus estimation ± standard error. Note that LVM was calculated from the ED frame

The RMSE values from the consensus (E C ) were always the smallest compared to any reader RMSE values (E i ). For EDV, E C was 14.7 ml while E i was from 15.2 to 28.4 ml. For ESV, E C was 13.2 ml while E i was from 14 to 21.5 ml. For LVM, E C was 17.5 g, while E i was from 20.2 to 34.5 g. For EF, E C was 4.2 % while E i was from 4.6 to 7.5 %.

Bias and precision

Figure 3 shows reader bias and precision using the estimated consensus values (Table 2) as the reference. All readers showed good precision, with low standard deviations of the differences. The precision in EDV ranged from 5.0 to 11.6 mL (average 8.9 mL); precision in ESV ranged from 7.0 to 11.8 mL (average 9.5 mL); precision in LVM ranged from 10.0 to 12.9 g (average 10.9 g); precision in EF ranged from 1.5 to 5.6 % (average 3.6 %). However, differences in analysis protocols between readers were evident, with some readers exhibiting smaller endocardial and larger epicardial contours, whilst others showed larger endocardial and smaller epicardial contours. The bias in EDV ranged from −36.6 to 40.5 mL (average 0.9 mL); bias in ESV ranged from −32.9 to 41.2 mL (average 0.8 mL); bias in LVM ranged from −44.5 to 59.6 g (average 0.7 g); bias in EF ranged from −11.7 to 12.8 % (average 0.0 %).

Fig. 3
figure 3

Reader bias and precision against the estimated consensus. Each bar denotes the mean ± one standard deviation

Individual reader reports were generated automatically. Figure 4 shows an example report demonstrating similarity of the reader contour with the consensus, with the disparity between expert readers (dark red bands).

Fig. 4
figure 4

An example of the automatically generated report visualizing the similarity between a reader (green) and consensus (red) contours. Magenta bands indicate the range of contour positions for all seven readers. Arrows indicate points on the reader contour with distance > 3 mm from the consensus contour. Top: endocardial; bottom: epicardial; left: ED; right: ES


Readers were consistent within themselves, which was indicated by the relatively small standard deviations (precision) from the consensus within each reader (Fig. 3). However, different readers could be larger or smaller than the consensus (bias). This was due primarily to differing practices at each core lab, where contouring was consistently smaller or larger compared with the other readers. We also examined other sources of possible bias, in particular in the outflow and apical slices (Fig. 2), but the bias contributions from these slices to the final mass and volume estimate were insignificant.

For clinical studies collating results across core labs, steps should be taken to reduce bias between core labs, either prior to the analysis (by training) or after analysis (by bias correction). This study provides a standard set of cases that could be used as a training set. Alternatively, automated post-hoc bias correction methods [14] can also be applied.

Consensus contour quality

The consensus LV function estimates had the better agreement with all readers than any individual reader. All the RMSE values for EDV, ESV, mass and EF were smallest for the consensus compared with any reader. This indicated that the consensus had better agreement with the readers than any individual reader.

As demonstrated by Fig. 1, the consensus contours were visually acceptable in all cases. Greatest disagreements between readers appeared in the areas where tissue contrast ratios were low, such as in the apical slices and in the outflow tract. Even for these slices, the estimated consensus contours were visually acceptable. Note that the outflow tract was contoured as part of the LV cavity as recommended in the SCMR guidelines [11].

The STAPLE method was performed on each slice and each contour type independently, without taking into account any information about the geometry or anatomy of the myocardium or any pixel intensity values. The STAPLE method is not a vote counting mechanism, and therefore the functional consensus is not a simple average of mass or volume across readers. An example of the difference between voting and STAPLE consensus contours is shown in Fig. 5, demonstrating that STAPLE found an acceptable epicardial apex contour whereas voting did not.

Fig. 5
figure 5

Left: epicardial apical ED contours drawn by readers. Right: the difference of consensus between pixel voting (cyan contour) and the STAPLE algorithm (red contour)

Reader assessment

In this study, we developed a resource CMR dataset, where myocardial contours were defined through a consensus of seven independent expert readers. The dataset will be useful for training and assessing new readers on contouring CMR images. Feedback to new readers can include graphical displays of areas of maximum disagreement (Fig. 4). Quantitative measurements can be measured in terms of mean, maximum and standard deviation of the distance from the new contour to the consensus.

Differences in mass and volumes derived by new reader contours can be compared with Table 2. The CMR data are available on request from the Cardiac Atlas Project for the purpose of training new readers and benchmarking automated processing methods. Note the consensus is a fusion of a number of different core labs, and does not reflect the practice of any particular lab.


Only fifteen cases of varying pathology were included in the study, due to the time consuming nature of manual contouring. However, the purpose of the study was to provide a resource for assessing automated methods and facilitating training. We therefore chose increased depth of readers from different centres over breadth of subjects. This provides considerable power for analysis of differences [15, 16]. Although resources are becoming available with many hundreds and even thousands of studies [5, 17], these provide manual contours from a small number (typically one or two) of centers. Our study therefore provides a unique resource representing the largest number of expert readers to date.

The SCMR guidelines [11] recommend either inclusion or exclusion of papillary muscles. Our study excluded papillary muscles from the LV mass, but papillary muscles are myocardial tissue, which ideally should be included. The formation of a manual ground truth from many expert readers would be very time consuming, but very valuable for validating automated methods [18, 19]. In the future, more widely automated methods are likely to increase the number of centers quantifying papillary mass. Right ventricular and atria consensus contours would also be useful to establish in the future.


We have estimated a set of consensus myocardial contours from SSFP CMR short-axis slices from expert readers representing experienced core labs around the world. The consensus contours achieved better agreement in LV mass and volumes than between readers. This consensus dataset is valuable resource for the assessment of new readers, as a basis for multi-center analyses, as well as benchmarking automated methods. The main source of bias was small but consistent differences in contour placement in all areas.