Introduction

Congenital diaphragmatic hernia (CDH) is a rare congenital malformation characterized by a diaphragmatic defect that allows intrathoracic herniation of abdominal viscera, which affects normal lung development, leading to lung hypoplasia and postnatal pulmonary hypertension [1,2,3]. CDH affects 1 in 2500 births, but neonatal survival depends on several factors, such as defect side and size, herniated organs, associated anomalies, and gestational age at birth [4, 5]. Therefore, advanced imaging is crucial for a complete prenatal assessment and parental counseling. Combined evaluation of lung size, liver position, and defect side is conventionally accepted to stratify CDH fetuses in different groups, correlated with perinatal mortality and long-term morbidity [6, 7], and to guide prenatal intervention of fetal endoscopic tracheal occlusion (FETO) in selected cases [8, 9].

Fetal magnetic resonance imaging (MRI) enhances prenatal CDH evaluation through high anatomic specificity of the diaphragmatic defect, hernia location, content, and alteration in other fetal organs [10,11,12]. Therefore, it could be considered the most reliable technique to assess lung hypoplasia and calculate the observed/expected total fetal lung volume (O/E TFLV) [13]. It also permits a volumetric quantification of the intrathoracic hepatic parenchyma, expressed as liver herniation percentage (%LH) [14,15,16]. However, fetal MRI is an operator-dependent exam in which experience plays a key role, especially for segmentation, which is fundamental for accurate organ volume and shape assessment. However, general-usage medical image visualization software usually does not provide the physician with specific segmentation options, so the contouring work is still manual and prone to imprecision. Moreover, the broad spectrum of disease presentation poses additional challenges to the clinician [17].

Recently, the application of novel artificial intelligence (AI) technologies has been spreading in the neonatal field to support medical data analysis. Through the traditional machine learning (ML) approach and its modern deep learning (DL) extension, forecasting algorithms are built to predict specific outcomes, guide interventions, segment organs and vessels, and improve the overall quality of care [18,19,20].

However, these methodologies still need to be successfully applied to CDH newborns, so manual segmentation remains time-consuming and operator-dependent.

In CDH patients, building an automatic segmentation software could facilitate and standardize lung volume measurement, improve data collection accuracy, and create solid AI algorithms to predict postnatal outcomes.

In this study, we explored the possible application of a publicly available DL-based automatic segmentation system (nnU-Net) for automatic MRI contouring of the lungs and liver of fetuses with CDH. We then extracted pyradiomics standard features from the manual and the nnU-Net segmented ROIs to test the agreement between the two groups of features. Finally, a support vector machine (SVM) classifier was trained on shape features computed both in the manual and automatic segmentations of lungs and liver and employed to test the possibility of predicting liver herniation as a dichotomous variable (up/down).

Materials and methods

This study represents an exploratory secondary analysis of the CLANNISH retrospective cohort study (Clinical Trial Identification no. NCT04609163) performed at Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy, involving the Fetal Surgery Center, Pediatric Radiology Service, Pediatric Surgery Unit, and Neonatal Intensive Care Unit (NICU) [21]. At the same time, the Department of Mathematics and Physics of the Università del Salento (Lecce, Italy) and the Department of Physics and Chemistry of the Università degli Studi di Palermo (Palermo, Italy) were involved in ML and DL data analyses and segmentation algorithms. A comprehensive description of the main study design has been previously published [21].

Subjects

We enrolled 39 inborn patients, born between 01/01/2012 and 31/12/2020, with isolated CDH from singleton pregnancies, taken in charge at the Fetal Surgery Unit of the Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico (Milan, Italy) before the 30th week of gestation. The only exclusion criterion was a pre- or postnatal diagnosis of non-isolated CDH.

Data collection

A retrospective data collection of clinical and radiological variables from newborns’ and mothers’ medical records was performed for eligible patients (Astraia, Astraia Software GmbH, Ismaning, Germany; NeoCare, GPI SpA, Trento, Italy). In addition, the native sequences from fetal MRI were collected, with separate acquisition for the lungs and liver.

Manual segmentation of lung and liver volumes

The imaging software used was Synapse PACS and Synapse 3D (FUJIFILM Medical Systems Lexington, MA, US). Lung volumes were calculated on the T2 HASTE sequences, selecting the best image quality plane without motion-induced artifacts [22]. Liver volumes were calculated on T1 VIBE sequences [23]. A pediatric radiologist with 15 years of experience in fetal MRI performed the manual segmentation of lung and liver volumes. In each slice, left and right lung and liver areas were determined separately by tracing freehand regions of interest (ROIs), excluding the pulmonary hila and mediastinal structures. The areas were automatically added to obtain the entire organ volume, multiplied by the sum of slice thickness and intergap by the software.

The DICOM files were then anonymized, converted to the NIFTI format for easier manipulation, and fed to the segmentation pipeline.

Segmentation with no-new-Net (nn-NET)

A publicly available segmentation pipeline based on DL achieved automatic lung and liver MRI segmentation. The pipeline was the no-new-Net (nn-NET), a specialized DL framework for medical image segmentation [24]. The framework is based on the U-Net architecture, a popular convolutional neural network that is particularly effective for biomedical image segmentation. It was developed to address the challenge of designing neural network architectures well-suited for various medical imaging tasks without requiring manual configuration or architectural modifications for each new task. nnU-Net automatically adapts its architecture to the specific characteristics of the dataset. It analyzes the dataset and decides on the most appropriate network architecture, preprocessing steps, and training strategies. This includes decisions about the network depth, convolutional kernel sizes, and the number of feature maps. This automation reduces the need for manual tuning and expert knowledge, making high-quality segmentation accessible even to those who might not be specialists in deep learning or medical image analysis. This network can achieve good segmentation results even with datasets of limited size. The nnU-Net segmentation pipeline is organized in several steps: (1) dataset structuring to a format compatible with the software; (2) the extraction of a dataset “fingerprint” containing dataset-specific properties, used to build various 2D/3D configurations, among which the best is “3D cascade”; (3) model training and validation, which we performed in the default fivefold cross-validation scheme. The software automatically gives Sørensen–Dice and Jaccard coefficients for segmentation quality evaluation. We ran the pipeline on a Server Supermicro 2023US-TR4, 2 CPU AMD Rome 7282 16C/32 T 2.8G 64 MB, equipped with 256 GB RDIMM DDR4 RAM and GPU Nvidia Tesla V100 32 GB HBM2 PCIe 3.0 (property of INFN, the Italian National Institute for Nuclear Physics, branch of Lecce). A cross-validation fold of each configuration took about 1 full day of calculations.

Radiomics features

After segmentation, several standard 3D radiomics features were calculated. Pyradiomics was chosen for feature calculation [25]. This software package is freely available and allows the computation of many variables both from the original images and after preprocessing by various filters (e.g., wavelets or LoG, Laplacian of Gaussian). It also allows automatic reslicing with a chosen interpolator. The computed features, a subset of those available in pyradiomics, and after removing some correlated ones, are listed in Table 1. For gray level co-occurrence matrix (GLCM) and neighborhood gray tone difference matrix (NGTDM) calculation, only pixel pairs separated by a distance of 1 pixel were considered.

Table 1 Pyradiomics features (11 for shape, 17 for 1st order, and five groups for a total of 75 variables for higher-order features; the overall number of features is 103). Only features from the original images (no preprocessing) were considered

The MR images were preliminarily resized to all have the same voxel size of 1 × 1 × 1 mm3; the sitkBSpline interpolator was used for this purpose.

Reproducibility of pyradiomics features

In order to test if the features calculated from manually and automatically segmented ROIs had similar values, a Wilcoxon rank-sum test and tests based on the intraclass correlation coefficients (ICCs) were performed. ICC is a statistical measure ranging from 0 to 1, with values close to 1 representing stronger feature reproducibility in segmentations. McGraw and Wong [26] defined 10 forms of ICC. In this study, we calculated the interrater reliability by employing a two-way mixed effect, absolute agreement, single rater/measurement model considering the variation between two or more raters who evaluate the same group of subjects (Eq. 1) [27]:

$${\text{ICC}}= \frac{{MS}_{R}-{MS}_{E}}{{MS}_{R}+\left(k-1\right){MS}_{E}+ \frac{k}{n}({MS}_{C}- {MS}_{E})}$$
(1)

where MSR is the mean square for rows, MSE is the mean square error, MSC is the mean square for columns, k is the number of observers involved, and n is the number of subjects.

A freely available code was used for ICC computation [28]. According to ICC values, we stratified the features into four groups as having excellent (ICC >  = 0.75), good (0.60 <  = ICC < 0.75), fair (0.40 <  = ICC < 0.60), or poor (ICC < 0.40) reproducibility [29]. The reproducibility within groups of features was also assessed using the Wilcoxon rank-sum test with a p value threshold set at 0.05.

Prediction of liver herniation by machine learning

To test the possibility of predicting liver herniation as a dichotomous variable (up/down), several ML forecasting algorithms were implemented in the Matlab environment and Python, according to the experimenters’ convenience, using features calculated by pyradiomics. Several ML approaches were tested, such as decision trees, linear and non-linear artificial neural networks (ANN), and support vector machines (SVM) with various standard kernels.

Decision trees are a widely used method in ML, for both classification and regression tasks. A decision tree works by breaking down the classification procedure into a series of steps, each represented by a tree node (or leaf), so that an associated decision tree is incrementally developed. Each step asks a question that has a “yes” or “no” answer and redirects the flow towards different branches as you move down to another node or a tree lead, depending on the answer. The path from root to the final leaves (the classes) gives the overall classification rule. ANNs are inspired by the structure of the human brain. They consist of layers of interconnected nodes, known as neurons, which process information. Each connection between neurons has a weight that adjusts as the ANN learns from data. This structure allows ANNs to learn complex patterns and make predictions or decisions. ANNs can be linear or non-linear, depending on how these nodes and layers are arranged and interact. In simple terms, ANNs are like complex webs that learn to recognize patterns from the data they are trained on. Support vector machines (SVM) are another method used for classification and regression tasks. SVMs work by finding the best boundary that separates data into classes. This boundary is chosen to maximize the margin, or distance, between the boundary and the closest data points from each class, known as support vectors. SVMs are efficient in high-dimensional spaces and are versatile, as they can use various kernels (mathematical functions) to transform data so that a non-linear boundary can be used linearly.

For this part, only left-sided CDH patients were considered because of their larger numerosity, homogeneity, and variability in liver position, leaving outright CDH cases in which the liver is almost always herniated. The results obtained with the features computed in the manually segmented ROIs of the liver and lungs were compared with those obtained with those calculated in the nnU-Net segmented ROIs.

Since the MRI scans were very dissimilar in gray-level content, only shape features were used, discarding variables computed on the gray levels to avoid further image manipulation (intensity standardization). This choice left 22 features (Table 1), considering the liver and the lungs.

We trained and validated the models with a Leave One Patient Out (LOPO) scheme, in which each patient was chosen as the only element in the validation set, and the remaining patients built the training set. Classification quality was expressed as the area under the receiver operating characteristic (ROC) curve and by confusion matrices.

Results

We enrolled 39 CDH cases, 30 with left and 9 with right side diaphragmatic defect. The dataset was quite balanced regarding liver herniation, with 22 up and 17 down cases. All the right-sided CDH cases were up.

The MR images were very inhomogeneous as to voxel size (the in-plane size was 0.21 mm to 0.78 mm, and the thickness was 3 mm or 6 mm) and gray level range (Fig. 1).

Fig. 1
figure 1

Inhomogeneity in the MR images. The gray-value histograms were calculated within the lung (left plots) and liver (right plots) manually segmented ROIs

Segmentation

Segmentation results showed a very good accordance between manual and automatic methods. In Fig. 2, we reported an example of segmentation results of two single MRI cases of the liver and lungs, in which perfect accordance was observed. Nonetheless, quality varied for other images, and in some lung segmentation tests, one of the two lungs was lost during automatic segmentation. The Jaccard coefficient values for the whole dataset expressed as box plots are reported in Fig. 3. The average Jaccard coefficient for lung segmentation was 0.65, while liver segmentation showed better results with an average value of the Jaccard coefficient of 0.75. A Jaccard coefficient of 1 indicates perfect agreement, while a coefficient of 0 indicates no agreement.

Fig. 2
figure 2

2D Segmentation results for the liver (top row) and lungs (bottom row). 3D manually segmented ROIs are shown in red A, D; automatic contouring is shown in green B, E. The overlaps of manual and automatic segmentations are shown in C and F

Fig. 3
figure 3

Boxplots report the mean Jaccard coefficient values for the lungs and liver

Reproducibility of pyradiomics features

Figures S1 and S2 (Supplemental Materials) show, respectively, for lungs and liver, the scatterplots of the features obtained for each variable, the values calculated in the manual ROIs (x-axis), and the corresponding values calculated in the automatic ROIs (y-axis). In case of perfect correspondence, the points should be located on the quadrant bisector.

To state the agreement between manual and automatic feature groups, we employed the Wilcoxon rank-sum test within each group of features. We also computed and examined ICCs across single features for testing interrater reliability. Table 2 provides results for single-measure ICCs under a two-way mixed model with absolute agreement.

Table 2 Intraclass correlation coefficients (ICCs) between radiomic features derived from manual and automatic segmentations for the liver (A) and the lungs (B). The Wilcoxon rank-sum test was executed for single features and across groups of features (e.g., shape and first order). Features with ICC < 0.40 were considered poorly reproducible and highlighted in light gray. A p value < 0.05 was considered statistically significant, and the corresponding rows were marked with one or more asterisks (*p < 0.05; **p < 0.01; ***p < 0.001)

Based on the approach chosen by Owens et al., we then classified the 103 features into four groups according to their ICC values having excellent (ICC ≥ 0.75), good (0.60 ≤ ICC < 0.75), fair (0.40 ≤ ICC < 0.60), or poor reproducibility (ICC < 0.40) [29]. Results are visually reported in Fig. S3 (Supplemental Materials) with a heat map. Of the 103 features, 46 (45%) showed excellent reproducibility, 11 (11%) exhibited good reproducibility, 22 (21%) showed fair reproducibility, and in 24 features (23%), reproducibility was poor.

Machine learning

As previously stated, only MRI shape features were used to automatically classify up vs. down liver herniation. In order to test whether the features considered at high reproducibility were more predictive in detecting liver herniation than the others, we also used ICC values as cut-offs for feature selection. In the first test, all the features were used without exclusion (case 1: no selection). In subsequent attempts, three thresholds were selected: 0.60, 0.70, and 0.75 (cases 2 to 4), and only the radiomic features with ICC values not lower than the threshold were considered. The features of the lung and liver were included for each specific case, and the corresponding results are shown in Table 3.

Table 3 Various classification tests without and with feature selection. Case 1 included all the shape features, while cases 2 to 4 selected features based on three different cut-offs on the ICCs values (see “Machine learning”). The last two columns report AUCs obtained for manual and automatic ROIs

The best results were obtained without feature selection. Figure 4 shows the ROC curves for liver herniation (up/down) acquired by the best-tested classifier (a linear SVM). Without feature selection, the AUC obtained for the dataset of manually segmented ROIs and the one for the automatically segmented ROIs were equal to 0.86 and 0.84, respectively. The confusion matrices (cm) obtained and the corresponding values for sensitivity, specificity, and accuracy are reported in Table 4.

Fig. 4
figure 4

ROC curves for liver herniation prediction with SVM classifier and shape features. Left: manually segmented ROIs, right: nnU-Net segmentations

Table 4 Metrics of performance obtained for manually vs. automatically segmented ROIs

Discussion

In newborns with CDH, automatic segmentation of the fetal lung and liver is feasible and shows high accordance with manual results. To the best of our knowledge, this represents the first attempt to apply an automatic segmentation system for fetuses with CDH, aiming to standardize the assessment of lung and liver volume and provide a reliable automatic prediction of liver herniation, which represent two main prognostic factors for postnatal outcome.

The segmentation software selected for this work was nnU-Net, a general-purpose 3D biomedical image segmentation tool. nnU-Net is designed to automatically deal with the dataset diversity found in the medical domain due to imaging modality, image sizes, voxel spacing (isotropic/anisotropic), pixel intensity (quantitative and standard as in computed tomography or essentially qualitative and non-standard as in MRI). This method demonstrates the flexibility most segmentation frameworks, designed on specific image types and properties, do not allow. Moreover, nnU-Net automates the key decisions for designing a segmentation system for a given dataset, significantly speeding up application development. Furthermore, if any improvement in segmentation quality is desired, the nnU-Net modular structure allows easy integration of new architectures and methods. nnU-Net relies on Python v3 and PyTorch and needs NVIDIA Compute Unified Device Architecture (CUDA) for most operations [24]. The quality of segmentation obtained with nnU-Net in the dataset of interest for this work was quite good, as demonstrated by the values of the Jaccard coefficients. An average Jaccard coefficient of 0.65 for lung segmentation suggested that, on average, the overlap between the algorithm segmentation and the ground truth segmentation was 65%. This meant good accuracy, as more than half of the segmented area was correctly overlapped with the actual area. The higher average Jaccard coefficient of 0.75 for liver segmentation indicated even better accuracy, with 75% overlap. Segmentation of the liver was better than the lung one: such a result was expected because of the larger organ volume compared to the lungs, which are even smaller in these patients due to the mechanism of the disease. Some values of the Jaccard coefficient were very large, but this was not true for all the patients.

After directly comparing the ROIs produced manually with those segmented with nnU-Net, we decided to compare the pyradiomics features computed in the automatically segmented ROIs with those extracted from the manual ROIs, as an indirect and practical test of segmentation quality. The rationale behind this test was that manual segmentation is a very time-consuming process that can hardly be applied to large datasets, so there is an interest in ascertaining if features extracted from ROIs obtained by automatic segmentation could produce results as accurate and useful as those extracted from regions drawn by manual segmentation.

For this purpose, we applied correlation and reproducibility tests to the two sets of features. Various techniques were employed to test for feature reproducibility between manual and automatic ROIs. Figures S1, S2, and S3 qualitatively show that some variables are reproducible so that their use can be granted, while others are not. This is particularly true for tiny lungs, and the Wilcoxon rank-sum test was used for the significance check. Our tests demonstrated that the two groups were significantly correlated and showed good agreement as measured by ICC.

A further indirect test of segmentation quality was performed by building a ML application for binary liver-herniation prediction/classification based on the features computed from manually or automatically segmented ROIs of the lungs and the livers. It was remarked that the MR images were very different in grayscale, so using features based on gray values would have needed some procedure of intensity standardization. For this reason, we avoided further image manipulation and only used shape features, discarding variables computed on the gray levels. Various classifiers were employed with similar results, and the highest performing was a linear SVM, which was trained on both feature sets (shape features extracted from manual vs. automatic ROIs). The two sets yielded similar (quite large) discrimination power between the up and down liver, as measured by the AUC value. Also, the shapes of the ROC curves were quite similar. This result suggested that the automatic segmentations produced by nnU-Net can be practically employed in ML applications. Even using less reproducible features helped classify liver up/down conditions, as it was found that selecting only highly reproducible features decreased classification quality. It is also interesting that when the whole sets of features were used (from manual vs. automatic ROIs), there was almost no difference in AUC between the two sets (AUC = 0.86 and 0.84, respectively), while feature selection led to a disparity in AUC with larger values for the set of features extracted from the automatic ROIs. Shape features, being based only on the ROI contour, might potentially be more deeply affected by segmentation errors, so the fact that AUC did not decrease from manual to automatic ROIs, and even increased when a partial dataset was chosen, is particularly significant and proves the goodness of the automatic segmentation.

Finally, it is noteworthy that, though the described ML application started as a convenient means for assessing feature reproducibility and thus as a test of goodness for nnU-Net segmentation, it was also a helpful result per se, suggesting that such reliable classification is feasible.

However, the limited number of cases has to be considered when considering these results, as CDH is a rare disease, which leads to a limited dataset of images available.

To increase the study population, collaboration with other institutions and the inclusion of future cases could be considered. Moreover, data augmentation by the generation of synthetic data can be a way to artificially increase dataset cardinality during training, which may have a positive impact on the segmentation of lungs and livers from the original data.

Traditional forms exist (e.g., the application of spatial transformations to the images), and more recent approaches based on neural networks look promising, particularly for tiny datasets [30].

Another critical aspect, mainly concerning the ML results, depends on data inhomogeneity, specifically the lack of a standard grayscale in the images. To overcome this limit, we chose to discard ML features based on the gray-level content of ROIs. Image standardization (i.e., wisely transforming the images to a common gray-level scale) is possible, though it is very delicate and demanding. The advantage would be that after standardization, gray-level-based features—at least those with good reproducibility from manual to automatic ROIs—could also be used for classification purposes to increase ML quality. It is also possible that image standardization may lead to an increase in nnU-Net segmentation quality, helping the segmentation algorithms.

Despite these limitations, the findings of our research are encouraging. The definition of an automatic segmentation software tool specifically designed for the fetal lung and liver would be relevant to clinical practice. Since CDH assessment is largely based on prenatal imaging, automatic segmentation would be key in simplifying and standardizing the diagnostic process. Moreover, it would provide more accurate imaging data for developing robust algorithms and tools for the early prediction of postnatal outcomes.

Artificial intelligence-based prediction systems are proving to greatly support the interpretation of clinical data and images of various conditions in the NICUs. For example, AI systems have been successfully developed to analyze retinal images for diagnosing retinopathy of prematurity and plus disease, where some subtle and fine signals may escape the human eye [31,32,33,34]. AI models could identify complex patterns and associations in the volume of data available in preterm infant EHRs that traditional statistical methods or human experts may miss. These models can facilitate early detection of complications such as sepsis and necrotizing enterocolitis [35,36,37,38].

AI enables data integration from multiple sources, such as imaging modalities and clinical features. As a future perspective, fetal MRI and US data should be integrated with fetal-maternal clinical variables automatically extracted from electronic medical records. Identifying critical factors and assessing the relationship between clinical-radiologic variables and patient outcomes might help to further elucidate the major determinants of CDH pathophysiology, especially postnatal pulmonary hypertension. Through an integrated multimodal analysis, the early detection of key features could enable the building of forecasting prognostic algorithms and provide a unique advancement in managing fetuses and neonates with CDH, ultimately improving the overall quality of care. For example, parental counseling would be more accurate, helping parents to understand the pathological condition better and feel more involved in the care process. Prenatal risk stratification is also crucial for the appropriate selection of FETO candidates. After birth, algorithms may be able to anticipate critical events and guide timely interventions, such as determining the optimal timing for surgery or indicating the onset of complications. Patients at high risk of ECMO could also be identified. In addition, more rational resource allocation and cost-effective management could be facilitated.

Conclusions

Within the limitations of this study, automatic MRI segmentation of the lungs and liver of CDH fetuses through nnU-Net is feasible, with good reproducibility of pyradiomics features. In addition, a machine learning approach for predicting liver herniation offers good reliability.

Our results could open the way to new applications of artificial intelligence in the neonatal field to standardize prenatal assessment and provide a reliable automatic tool for prognostic evaluation in CDH patients.