Background & Summary

Cardiac malformations and changes in heart structure that are present at birth are collectively referred to as congenital heart disease (CHD)1. It is the leading cause of birth defect related deaths2. However, life expectancy is improving, leading to a growing population of adults with CHD who need ongoing care3. In CHD, vessels or chambers may be abnormally shaped (e.g., dilation), intracardiac connections may be atypical (e.g., double outlet right ventricle (DORV), in which the aorta connects to the right ventricle instead of the left ventricle), structure locations may be unexpected (e.g., the definitions of “left” and “right” ventricle are based on chamber characteristics and not location, so they may be swapped), and structures may be duplicated (e.g., two superior vena cavae) or missing (e.g., single ventricle and common atrium). Each CHD patient has a unique heart, with its own blend of original heart defects, prior surgical interventions, and transformations from long-term cardiac remodeling4.

Segmentation methods for CHD patients would be highly valuable5. Studies indicate that the greater appreciation of a patient’s unique anatomy provided by 3D patient-specific surface models may improve surgical planning and even prompt changes to surgical plans derived from imaging6,7,8,9,10. There are opportunities in surgical simulation, e.g., to simulate placement of valve implants, clips, baffles or Fontan conduits11,12,13,14. Moreover, automatic segmentation would facilitate quantitative metrics of cardiac function, such as chamber volumes, ejection fraction, and aortic dimensions15,16, which for such complex hearts are typically derived from hours of manual segmentation. However, this is a vulnerable, small, heterogeneous population unlikely to attract significant interest by industry17.

We describe the first public dataset for whole-heart segmentation from cardiovascular magnetic resonance (CMR) images from CHD patients. We use “whole-heart segmentation” to refer to the segmentation of 8 structures: the left ventricle (LV), right ventricle (RV), left atrium (LA), right atrium (RA), aorta (AO), pulmonary artery (PA), superior vena cava (SVC) and inferior vena cava (IVC). Some whole-heart segmentation datasets also include a label for the myocardium, which is not included here because it is much less important than the cardiac chambers and great vessels for the proposed clinical applications.

No currently available public dataset addresses all key factors. MM-WHS includes just 16 MR images from CHD patients, out of 60 CT images and 60 MR images18. The ImageCHD dataset contains CT images19. Babies and children with CHD undergo more surgery (and therefore imaging) to correct heart issues early in life. Avoiding ionizing radiation is particularly important for children, making CMR much more attractive. The 64 CMR scans released by Bidhendi et al.20 include LV/RV segmentations only. Finally, the first CHD MICCAI challenge, HVSMR, from our group contains 20 CMR images to be segmented into three classes: the global blood pool, myocardium/vessel walls, and background (http://segchd.csail.mit.edu)21. Relatively few methods had been developed for CHD before our challenge, which inspired many new works (e.g.22,23,24,25). However, a larger dataset of abnormal cases with a true whole-heart segmentation task is needed (see Arafati et al.26).

Some examples of our dataset are shown in Fig. 1. The HVSMR-2.0 dataset27 includes 60 CMR scans specifically chosen for their inclusion of diverse heart defects, including many not found in the datasets described above. Each image has a precise manual whole-heart segmentation. The main technical challenge is anatomical variability. CMR scans can also exhibit inhomogeneity and off-resonance artifacts, as well as large dark artifacts surrounding implanted stents. Furthermore, the valves and thin walls that separate neighboring structures are often beyond the imaging resolution.

Fig. 1
figure 1

Example whole-heart segmentations in the HVSMR-2.0 dataset of cardiovascular magnetic resonance scans from patients with congenital heart disease (CHD). For definitions of each CHD subtype, see Table 1.

We anticipate that HVSMR-2.0 will drive the development of innovative methodologies. These may also be useful to other applications with anatomical variability. Interactive segmentation may be particularly interesting, due to the challenges of this dataset, which may lead to new general-purpose algorithms to assist in dataset creation. It would also be beneficial to investigate the effectiveness of modern machine learning methods, such as diffusion and transformer-based models or topological priors, for CHD segmentation.

Methods

Ethical approval

The Institutional Review Board (IRB) at Boston Children’s Hospital approved this retrospective evaluation of imaging data and waived the requirement for written informed consent (IRB-P00011748). The IRB Chairs determined that the sharing of these 60 deidentified CMR scans and associated metadata via an open license is not inconsistent with the IRB’s approval of the original protocol or waiver of informed consent. The IRB has no objection to the sharing of these deidentified images and data.

Subject selection

The dataset includes 60 images from patients with a wide variety of cardiac anatomies and defects, many of whom have undergone cardiac surgery for CHD. The distribution of diagnoses is given in Table 1. Subjects range in age from <1 to 52 years, with a mean age of 11.6 years and a median age of 10 years. There are 24 subjects aged 0–4 (neonates, babies, and toddlers), 25 subjects aged 5–18 (children) and 11 subjects aged 19 + (adults).

Table 1 Heart defects and surgeries.

The first 20 images come from the original HVSMR dataset, and we subsequently chose an additional 40 cases that encapsulate a wider range of CHD subtypes. We aimed to create a balanced dataset that samples the different heart defects and their combinations as uniformly as possible, so that any models trained using this dataset will be applicable to a wide range of defects. However, some imbalance is inevitable, since some defects are much more common than others (e.g., ventricular and atrial septal defects co-occur with many other heart defects).

The 20 original HVSMR images were chosen on a rolling basis, as acquired under standard clinical practice at Boston Children’s Hospital. The inclusion criteria for this study encompassed four key parameters. First, a prerequisite was high image quality, necessitating minimal bulk motion of patients; images displaying substantial patient motion were excluded. Second, signal homogeneity in both blood and myocardium was crucial, leading to the exclusion of images featuring significant signal inhomogeneity where the border between myocardium and blood was not visually identifiable. Third, the strength of off-resonance artifacts causing significant signal void in blood or myocardium was considered. Finally, the fourth criterion involved ensuring an adequate field of view, specifically encompassing the entire LV, RV, and great vasculature within the scan.

To choose the remaining 40 images, 3606 potential cases were identified by searching the written radiology reports at Boston Children’s Hospital for keywords indicating that a 3D CMR scan was acquired. Each radiology report includes standardized “cardiology codes”, which enumerate hundreds of different diagnoses, abnormalities in cardiac anatomy or function, and prior interventions. We identified codes pertaining to important heart defects and corrective surgeries of interest under the advice of a pediatric cardiologist (Su.G.), and manually selected 40 CMR studies with both (1) the codes required to create an overall dataset that was as balanced as possible, and (2) high image quality, as described above.

Subject categorization

Each case was classified as having mild, moderate or severe anatomical malformations, under the advice of a pediatric cardiologist (Su.G.). Definitions of CHD subtypes and surgeries are provided in Table 1. Mild: roughly normal anatomy, prior CHD surgery with restoration of normal anatomy, and/or a mildly or moderately dilated chamber or vessel. Moderate: abnormal connectivity, septal defect, bilateral SVC, a severely dilated chamber or vessel, and/or congenital connective tissue disorder causing tortuous vessels. Severe: Heart malposition or situs inversus, L-loop TGA, single ventricle, common atrium, and/or major prior reconstructive surgery resulting in highly abnormal anatomy. There are 12 “mild” subjects, 11 “moderate” subjects, and 37 “severe” subjects. Note that these categories represent deviations from normal anatomy, and not the patient’s prognosis, and that “moderate” cases still have significant heart defects (e.g., DORV). Most subjects have a unique combination of heart defects and prior surgeries (ignoring coincident variants): 64% of moderate subjects and 95% of severe subjects.

Image acquisition

All images capture a snapshot of the heart at high resolution, i.e., they depict a static heart and not a beating heart. 3D CMR images were acquired in an axial view on a clinical 1.5 T scanner (Philips Achieva) during clinical routine at Boston Children’s Hospital. Most images were acquired using a respiratory navigator technique during a free-breathing acquisition using a steady-state free precession (SSFP) pulse sequence, with prospective ECG gating to freeze cardiac motion and generate a static image of the heart. Intravenous gadolinium-based contrast agent (Ablavar (gadofoveset) or Gadovist) was used in many patients. Whether or not to use contrast was decided by the clinician. Since some institutions use contrast and some do not, it is helpful clinically that our dataset includes examples of both.

All images were manually cropped at the level of the chin to ensure that no facial features are present. As the images were originally acquired in a clinical environment, each image has a different size (481 × 410 × 171 on average, ranging from [256–720] × [120–607] × [90–528]). Each image has near-isotropic resolution (0.73 × 0.73 × 0.81 mm on average, ranging from [0.52–1.15] × [0.52–1.15] × [0.38–1.60] mm). The intensity range within the entire dataset is ~[0, 7500]. Imaging parameters ranged between: echo time 1.5–2.4 ms, repetition time 3.1–4.9 ms, flip angle 55–110°, bandwidth 540–1575 Hz.

Image preprocessing

The images were manually cropped to a tight region around the heart. The acquired field of view is different for babies, children and adults, so this step ensures that the size of each anatomical structure within the cropped images is roughly the same for all ages, greatly simplifying model training. The cropped image sizes vary (150 × 193 × 154 on average, ranging from [83–273] × [95–322] × [77–220]). Each cropped image has the same resolution as the original.

All cropped images were normalized using a customized scheme. For each image, two mean intensities were estimated: one for the cardiac blood pool and one for the lungs. The estimated blood pool intensity was mapped to 0.8, the estimated lung intensity was mapped to 0.07, and a linear transfer function was used to rescale each image, yielding an intensity range within the entire dataset of ~[−0.1, 3.3]. Each cropped image was transformed into an approximate short-axis orientation, and we estimated the blood pool intensity by automatically extracting a slab of the cropped images that typically contains the ventricles, and using the Mean Shift Algorithm28 to find the peak of the intensity histogram that corresponds to the blood pool. Similarly, we estimated the typical lung intensity by extracting a slab in the upper portion of each transformed cropped image that typically contains the lungs only, and using the mode of the resulting intensity histogram.

Ground truth segmentation

A detailed description of the protocol for segmenting each structure is provided in Table 2, which was created in close consultation with pediatric cardiologists at Boston Children’s Hospital and Children’s Hospital Colorado (T.G., Su.G., A.J.P, and J.R.). Note that some structures have required and/or optional areas. Many interfaces between structures have no clear intensity boundary. Valves are often too thin and fast-moving to be imaged, and some boundaries at septal defects or other junctions (e.g., SVC/RA or IVC/RA) are determined by the gross anatomy and not by any image gradients. To standardize the segmentation process, manual segmentations were performed so that interfaces separating different structures were approximately planar, unless a curved interface was clearly present. All segmentations were performed based on the 3D CMR image for each patient, without access to additional scans such as angiography, motion or flow information.

Table 2 Ground truth definitions of each cardiac structure and their optional zones.

All of the tools we used to create the ground truth whole-heart segmentations are open source. All manual segmentations were performed using 3D Slicer (http://www.slicer.org)29, which includes helpful modules for manual painting (with or without an editable intensity range), island processing, logical operators, 3D surface model editing (e.g., using scissors) and smoothing.

Segmenting the 20 HVSMR images

The 20 HVSMR images already had ground truth manual segmentations of the blood pool and myocardium, which were created using manual painting in an approximate short-axis view. The main tool was 3D Slicer’s paint option with an editable intensity range, in which intensity thresholding is applied in the region under the paintbrush only, providing a more objective and precise way of determining the boundary between the blood pool and the neighboring myocardium or vessel/chamber wall. These segmentations had already been reviewed by an associate professor in pediatric radiology and cardiology (M.H.M.) and a pediatric cardiologist (A.J.P).

For our purposes, the blood pool label had to be further split into 8 compartments (LV, RV, LA, RA, AO, PA, SVC, IVC). Trained raters (graduate and undergraduate students in medical image analysis, computer science and the life sciences) manually divided each blood pool into its constituent parts by creating a 3D blood pool surface model from the existing HVSMR segmentation, dropping fiducial landmarks onto the surface at the interfaces between structures, and fitting local separating planes. Each rater was trained in a specific subtask (e.g., mitral valve annotation) to avoid inter-observer variability. The annotations from the different raters were combined to create a single whole-heart segmentation.

Segmenting the 40 new images

The 40 new images were segmented using a custom semi-automatic pipeline that leveraged the new whole-heart segmentations of the 20 HVSMR scans. An overview is shown in Fig. 2.

Fig. 2
figure 2

Ground truth whole-heart segmentations of the 40 new cases were created using a pipeline that merged manual annotations with outputs from a neural network ensemble trained on the 20 whole-heart segmentations from the HVSMR dataset.

First, trained raters (again, graduate and undergraduate students in medical image analysis, computer science and the life sciences) used a 3D Slicer module (SlicerHeart, https://github.com/SlicerHeart/SlicerHeart), originally designed for valve contouring in echocardiography17, to annotate roughly planar contours separating the 8 heart structures. Specifically, we used the “Valve Annulus Analysis” and “Valve Segmentation” modules. Raters had access to the list of heart defects for each case, as extracted from the cardiology codes, for help in forecasting which structures would be present, their approximate locations, and their connectivity to other structures.

Separately, an ensemble of four 3D U-Net30 convolutional neural networks was trained on the 20 cropped, intensity-normalized and manually-segmented HVSMR images. The 20 images were split into 4 training datasets as in 4-fold cross-validation. The model architecture had 4 levels, with 24 learned channels at the first level, a doubled number of features at each subsequent level, 3 × 3 × 3 maxpooling after the first level, and 2 × 2 × 2 maxpooling after the second and third levels31. Each model had approximately 3,600,000 learnable parameters. Data augmentation included random affine transformations, left-right and anterior-posterior flips (helpful due to dextrocardia and other cardiac malpositions in CHD), nonlinear transformations, constant intensity shifts, and additive Gaussian noise. All implementations were done using Keras (http://keras.io) with a Tensorflow backend (http://tensorflow.org). The loss function was a categorical cross-entropy loss with spatially-varying weights that address class imbalance and more strongly penalize errors near ground truth segmentation borders31. Model parameters were optimized using adadelta for 2000 epochs with a learning rate of 0.001 and a batch size of 1. Each of the four models produces a probability map over 9 classes (8 foreground structures plus background). For each new image, the segmentation inferred by the ensemble was created by averaging the probability maps from the four models and computing an argmax. Segmentations were post-processed to retain the largest connected component for each structure (or two largest connected components for cases with bilateral SVC or hepatics labeled as IVC).

This network did not perform very well, since it was trained on such a small dataset. For example, a subsequent study31 found that, using a similar model architecture and training scheme, the accuracy after training on segmentations of the 20 HVSMR whole-heart segmentations and testing on the 40 new subjects yielded an average dice score of 87.7 ± 14.6 for “mild”/“moderate” test subjects but only 64.6 ± 31.9 for “severe” test subjects.

Nevertheless, once combined with the manually contoured interfaces described above, segmentation cleanup via island relabeling and further painting or erasing was easier than manual segmentation from scratch. For details, see Table 3. A first pass was performed by trained raters (graduate and undergraduate students in medical image analysis, computer science and the life sciences), and the protocol was completed by the supervising annotator (D.F.P.). Note that the most difficult task, namely annotating the interfaces between structures, remained completely manual. In addition, the boundaries between each structure and the surrounding myocardium or vessel/chamber wall were carefully inspected and manually adjusted as necessary, avoiding bias towards the U-Net output. The time required is highly dependent on the complexity of the case. We estimate that our workflow requires approximately 4–8 hours per 3D CMR image, versus approximately 8–16 hours for purely manual segmentation.

Table 3 Protocol followed to manually correct automatic segmentations from an ensemble of four 3D U-Nets30, using manually contours separating the 8 heart structures.

After segmenting the 40 new scans, the whole-heart segmentations of the original 20 HVSMR images were re-reviewed to verify plane placement and segmentation boundaries, which were manually adjusted as needed. This aimed to mitigate potential annotation changes between the old and new scans.

Additional considerations for vessels

Results from a previous whole-heart segmentation challenge noted that fair evaluation can be problematic when the ground truth does not have standardized vessel lengths18. To avoid this, the ground truth segmentations of the AO, PA, SVC and IVC were created with consistent endpoints based on cardiac landmarks (see Table 2). Custom 3D Slicer python scripts were written so that the vessels could be appropriately cut using the relevant manually placed fiducials, manually placed slice planes, or automatically calculated slice planes. This process creates the best segmentations to use for model training.

However, algorithms that produce vessel segmentations that are slightly too short or too long are just as useful clinically18,32,33. To address this, we established “optional zones” for the AO, PA, SVC, IVC and pulmonary veins (PVs, within the LA) that are derived from concrete landmarks and embody both minimum and maximum vessel lengths. More details are provided in Table 2, and an example is shown in Fig. 3. Again, 3D Slicer scripts were written to assist in implementing the segmentation protocol. Before model evaluation, the optional zones should be subtracted from both the ground truth segmentation and the model’s estimated segmentation, so that only the required regions of each are compared.

Fig. 3
figure 3

Example optional zones in the ground truth vessel segmentations for the AO, PA, SVC, IVC and LA pulmonary veins, with highlighted explanations for the SVC.

Final post-processing and review

Automated post-processing removed any small islands in the background or within individual segmentations, performed a mild smoothing, and ensured only one or two connected components per structure (when known). Finally, all 60 images were reviewed by a pediatric cardiologist (J.R.) to evaluate the accuracy of segmentation, and edited where necessary.

Diagnosis information provided with each scan

Table 1 lists the applicable heart defects, prior surgeries, and coincident variants that were recorded for each scan. Under the advice of a pediatric cardiologist (J.R.), the presence or absence of each was determined based on the cardiology codes and written radiology report. Since the codes can be incomplete or refer to heart defects that had already been surgically corrected, the supervising annotator (D.F.P.), under the supervision of a pediatric cardiologist (J.R.), manually reviewed each image to verify that its list of defects, prior surgeries and coincident variants was correct and complete.

Relationship to previous work

The first 20 images of HVSMR 2.0 come from the original HVSMR dataset, which was held as a challenge at MICCAI 2016 (http://segchd.csail.mit.edu)21. However, the segmentations we now provide for these images are completely different, with ground-truth annotations for 8 foreground structures instead of only 2.

The full dataset described here was used in previous image segmentation methods research from our group31. Our aim here is to make the data from this publication public, since there are currently no open datasets for whole-heart segmentation in CMR images from CHD patients. We also provide more detailed descriptions of the steps used to create, annotate and validate the dataset. We performed an additional step of validation and annotation revisions with a pediatric cardiologist (J.R.) for all 60 images. Finally, we provide additional clinical and technical information corresponding to each scan.

Data Records

The dataset is available at figshare at https://doi.org/10.6084/m9.figshare.c.7074755.v227, with this section being the primary source of information on the availability and content of the data being described. All images and segmentations are in the Neuroimaging Informatics Technology Initiative (NIfTI) format. Segmentations and endpoints files (which delineate the optional zones) include labels 1-LV, 2-RV, 3-LA, 4-RA, 5-AO, 6-PA, 7-SVC, and 8-IVC. The data is stored in three.zip files:

  • orig: CMR images manually cropped at the chin without image normalization (pat#_orig.nii.gz), with corresponding whole-heart segmentations (pat#_orig_seg.nii.gz) and endpoints files (pat#_orig_seg_endpoints.nii.gz).

  • cropped: CMR images manually cropped around the heart without image normalization (pat#_cropped.nii.gz), with corresponding whole-heart segmentations (pat#_cropped_seg.nii.gz) and endpoints files (pat#_cropped_seg_endpoints.nii.gz).

  • cropped_norm: CMR images manually cropped around the heart after image normalization (pat#_cropped_norm.nii.gz), with corresponding whole-heart segmentations (pat#_cropped_seg.nii.gz) and endpoints files (pat#_cropped_seg_end-points.nii.gz).

We also provide two .csv files, hvsmr2_clinical.csv and hvsmr2_technical.csv, containing additional information for each scan. In each file, the Pat column gives the patient number, referring to the filenames listed above.

The hvsmr2_clinical.csv file contains demographic and clinical data. The Age column gives the age at scan time in years. Subjects under one year of age therefore have an age of ‘0’. The Category column indicates the degree of morphological malformations (“mild”, “moderate” or “severe”), as described in the “Subject categorization“ section above. The remaining columns include the detailed diagnoses, prior surgeries and coincident variants for each scan corresponding to Table 1. In each cell, X indicates that a condition applies.

The hvsmr2_technical.csv file contains image acquisition parameters and the dataset splits used in previous studies. The TE, TR, FA and BW columns show the echo time (ms), repetition time (ms), flip angle (°), and bandwidth (Hz) for the scan. The HVSMR2016 column indicates whether the image was in the original HVSMR dataset, and if so, whether they were in the training split or the testing split. The PaceMEDIA2022 column indicates whether the image was assigned to a cross-validation split (numbered 1–4) or the testing split in our group’s previous medical image analysis methods research that used this dataset31.

Technical Validation

Subject selection, subject categorization, and manual segmentation were performed under the advice of pediatric cardiologists (T.G., Su.G., A.J.P), and underwent a final review for accuracy by a fourth pediatric cardiologist (J.R.). The segmentation process was supervised by an associate professor in pediatric radiology and cardiology (M.H.M.). Raters were supervised by an expert in medical image analysis with over 10 years’ experience at that time (D.F.P.). Training for the graduate and undergraduate students who performed annotations included education on CMR and CHD, written instructions, and frequent one-on-one consultation with D.F.P.

A quality control step was performed after each step of the segmentation process (D.F.P.). This included reviewing images for image quality, verifying the cardiology codes extracted from radiology reports, manually checking and editing the contours that were manually placed to separate cardiac structures, and manually checking and editing the whole-heart segmentations. All segmentations were first reviewed by an associate professor in pediatric radiology and cardiology (M.H.M.) and then again by a pediatric cardiologist (A.J.P for the 20 HVSMR images and J.R. for all).

Usage Notes

Users should decide whether they want to use images that have been cropped around the heart and/or intensity normalized, and use the data in the corresponding folder as described in the “Data Records” section. We do not recommend mixing data between the orig, cropped and cropped_norm directories. Note that all images have a unique size and resolution, and typically must be resampled before model training.

When training models, users should use the provided whole-heart segmentations as the ground truth. However, during evaluation, the optional zones for vessels within the endpoints files should be considered. The optional zone segmentations should be subtracted from both the ground truth and the predicted segmentation (in one-hot representations) before computing a segmentation score, so that only the required regions are compared.

Users are free to use the metric of their choosing to evaluate segmentation accuracy. We recommend the Dice score (a widely accepted metric of volume overlap) and the Hausdorff distance (either the maximum Hausdorff distance or the 95th percentile, to provide a performance measure in millimeters, which is useful for clinicians). Note that some structures may be missing in the ground truth (e.g., in the case of single ventricle or common atrium).

Users can split the 60 cases into training, validation and testing datasets at their discretion. All segmentations are openly provided because we want to enable research in interactive segmentation, which may be necessary for challenging cases and for which there is no established biomedical challenge protocol. We do provide the splits used in our previous research31. This allows direct comparison between different works that use the same splits, while also providing flexibility to users to make the most sensible choices for their own research.

Our previous work31 found that segmentation neural networks typically perform well on both “mild” and “moderate” cases but are significantly challenged by the “severe” cases. We encourage authors to similarly report their results on “mild”/“moderate” and “severe” cases separately. In particular, networks may encounter challenges in distinguishing between left and right in severe cases with heterotaxy, dextrocardia, single ventricle, or common atrium. If a network mistakenly assigns the LA label to the RA, or the LV to the RV (or vice versa), dice scores will be significantly decreased.

Additional background on segmentation for congenital heart disease can be found in the Ph.D. thesis of Danielle F. Pace5, from which this paper was adapted.