Background

The different image contrast of computed tomography (CT) and magnetic resonance imaging (MRI) offer distinct clinical utilities. Segmentation is a prerequisite to automatically extract biomarkers, especially in large cohorts like the German National Cohort [1] or the UK Biobank [2]. While the extraction of the precise bone structure of the spine from CT is publicly available [3, 4], neither a segmentation nor an annotated ground truth dataset for the whole spine including the posterior elements is currently available for MRI.

Accurate segmentations are not only vital for scientific studies but also enable the exact localization of abnormalities in clinical routine. Unlike CT, MRI provides additional information about bone marrow edema-like changes, intervertebral disc degeneration, degenerative endplate changes, ligaments, joint effusions, and the spinal cord. Robust and precise segmentation and quantification of such spinal structures are a prerequisite, e.g., to evaluate large epidemiologic studies or to enable automated reporting. An alternative to labor-intensive manual annotations is the potential use of image-to-image translation to extract bony structures. This approach may overcome challenges like partial volume effects (e.g., at the spinous process) and subtle signal differences (e.g., of vertebral end plates and ligaments in MRI), which are easily distinguishable in high-resolution CT but not in MRI.

Image-to-image translation involves transforming images from one domain to another, and several deep learning methods have been employed for this purpose, including Pix2Pix [5], CycleGAN [6], and contrastive unpaired translation (CUT) [7]. These methods have been used in various studies to generate missing sequences, translate to different domains, enhance image quality, and improve resolution [8]. In the medical domain, these methods have shown success in rigid structures like the brain, head, and pelvis, where registration guarantees that both domains have similar tissue distributions and anomalies [8]. However, if biases are not accounted for, the model may hallucinate new structures to fit both distributions [9]. Due to this difficulty, translating warpable structures like the spine is less explored in the literature. Some successful implementations have shown that translated images can be similar to the target images and might mislead medical experts [10,11,12,13,14]. However, none of these works has focused on using translations for downstream tasks, such as segmentations in the output domain.

This study aimed to develop and compare different image translation networks for pretrained CT-based segmentation models when applied to MRI datasets (Fig. 1). The primary focus was on segmenting the entire spine, with special attention to accurately translating the posterior spine structures, as they pose challenges in MRI delineation. We compared generative adversarial network (GAN)-based approaches [5, 7] with new denoising diffusion models [15,16,17]. Denoising diffusion functions are fundamentally different from GANs, as they add and remove noise to an image instead of relying on the discriminator and generator zero-sum game in GANs. In the computer vision domain, denoising diffusion models have outperformed GANs in various tasks, including upscaling, inpainting, image restoration, and paired image-to-image translation [18]. While diffusion has been applied to medical image translation tasks in a limited number of papers [17, 19,20,21,22], we adapted the conditional denoising diffusion for paired image-to-image two-dimensional (2D) and three-dimensional (3D) translation.

Fig. 1
figure 1

Training pipeline. In our datasets, we identified the center of the vertebral body and spinous process (green box; raw data). Based on the center points, we rigidly registered CT onto MRI to align the bone structures between the two images (yellow box; training data). Aligned images were used to train our image-to-image models. Finally, the MRIs of validation and test sets were translated to CT images. Segmentation was performed on synthesized CT images and, consequently, was perfectly aligned with the original MRIs (blue box from left to right; prediction). The generated segmentations can be used for generating additional and new center points to iteratively optimize the registration

The purposes of this study were as follows: (1) to improve existing image-to-image translation for spine MRI to CT translation by improving all steps of the process, from data alignment, implementation of new denoising diffusion translations and comparison to GANs, and finally extension of our findings to 3D translation; (2) to utilize the translated CT images for automatic segmentation of the entire spine, eliminating the need for a manually labeled segmentation mask in the original MRI domain; and (3) to develop the ability to generate full spine segmentations on MRI, which are currently not available.

Methods

In brief, we aligned CT and MR spine images through rigid landmark registration [23]. With this paired data, we trained various image-to-image models to generate synthetic CT images. We used an available CT segmentation algorithm [3, 4] to generate vertebral masks in these synthesized CTs for the original MRI. These resulting segmentations were subsequently used to generate new landmarks for new training data (Fig. 1). During inference, the MRI is sufficient to generate a segmentation by translating the MRI to a synthetic CT and subsequently applying an existing CT segmentation algorithm. We compared different landmark registrations and 2D models. Finally, we adapted the results into 3D models and assessed the accuracy of the resulting segmentations.

Data

In this study, we retrospectively collected sagittal T1-weighted and T2-weighted MRI and corresponding CT images of the spine from the same patient within a week. Approval from the local ethics committee was obtained, and informed consent was waived. Figure 2 illustrates our data selection process. Sixty-two T1-weighted image series (18 males, aged 66 ± 15 years [mean ± standard deviation]; 44 females, aged 72 ± 13 years) were used from another unpublished in-house study, including five thoracic and 57 lumbar volumes. Additionally, a new dataset was collected of 201 T2-weighted image series (50 males, aged 65 ± 20 years; 42 females, aged 69 ± 17 years) from 92 patients, including 38 cervical, 99 thoracic, and 70 lumbar volumes. Patients with fractures and degenerative changes were included, while those with motion artifacts, metastases, and foreign objects were excluded, because for segmentation models, it would benefit when the translation suppresses these anomalies. We performed rigid registration of the matching MRIs and CTs based on the center of mass of the vertebral body and the spinous process (Fig. 1, bottom left). In-house test set, training, and validation set were split patient-wise for different MRI acquisitions of other spine regions. For validation, six T1-weighted and nine T2-weighted MRIs were used as they could not be aligned with the CTs due to substantially different patient positioning.

Fig. 2
figure 2

Datasets, preparation, exclusion, and split. MRI data were acquired with 12 different scanners from 3 different vendors. Additionally, we used the MRSSegClg for external testing. For the 2D training, we only consider 2D slices containing a spine. We demonstrated generalizability using a full-body MRI from the German National Cohort dataset for the figures in this paper

We used 172 lumbar MRI and segmentation volumes from the MRSpineSeg Challenge (MRSSegClg) [24, 25] for external evaluation of Dice similarity coefficient (DSC). This dataset focuses on the lumbar region, but the segmentation exceeds the bony borders, questioning its validity. One subject was used for pipeline development and validation. Validation sets were used to find optimal inference parameters and to avoid overfitting. Since the labels in MRSSegClg encompass not only the bony spine but also adjacent ligaments and soft tissue, we manually adjusted the labels for a subset of 20 volumes to restrict them solely to the bone. We analyzed these subsets as two distinct datasets.

Image preprocessing

CT and MR datasets were rigidly registered [23] by using landmarks to facilitate paired image translation. For the single-landmark approach, we selected the center of mass (CM) of the vertebral bodies. To address rotational misalignment around the cranio-caudal axis, the CM of the spinal processes was added for the two-landmark approach, as such rotational misalignment was frequently observed. Landmarks for CT were automatically determined based on vertebral and subregion segmentations (Fig. 1). For the T2-weighted images, we manually identified the CM points for both the vertebral bodies and the spinous processes. The manual centroid selection and ground truth segmentation corrections in the test sets were performed by J. S., a radiologist with 3 years of experience. To obtain the points for the T1-weighted images, we synthesized CTs by adapting the T2 weighted to CT translation, generating segmentation from synthetic images, and extracting the CMs. Roughly 10 to 20% of the failure cases were first excluded and then translated with models that were trained on the other T1-weighted images. This proved sufficient to generate all CM points. To assess the impact of additional landmarks on registration, we computed the DSC using our pipeline on the T2-weighted dataset using the manual ground truth as a reference.

CT images were transformed to the range of [-1, 1] by dividing the values by 1,000 HU and clamping outliers to retain air, soft tissue, and bone while suppressing extreme intensities. Linear rescaling was applied to the MRI data, converting the range from [0, max] to [-1, 1]. To account for varying intensities, MRIs were augmented with a random color jitter (brightness, contrast randomization: 0.2). Image pairs were resampled to a uniform spatial resolution of 1 × 1 mm in the sagittal plane and a slice thickness of 2.5–3.5 mm, as acquired in the MRI. To enhance the training data by a factor of 10 and simulate weak scoliosis and unaligned acquisition, we introduced 3D image deformations using the elastic deformation Python plug-in [26]. Subsequently, the volumes were sliced into 2D sagittal images, and slices without segmentation were removed. Random cropping was performed to adjust the image size to 256 × 256 pixels.

Models for image-to-image translation

To compare various image-to-image translation methods, we implemented two unpaired methods, namely CUT [7] and SynDiff [17], along with three paired methods, Pix2Pix [5], DDIM noise, and DDIM image. The training process involved unregistered and registered data using both single- and two-landmark approaches. For DDIM, we employed a UNet architecture [26] with convolutional self-attention and embeddings for the timesteps, which we refer to as self-attention U-network (SA-UNet) [18, 27, 28]. The diffusion mechanism predicted either noise or the image, with the other computed during inference. A learning rate of 0.00002 was used, and we set the timestep to t = 20 for the DDIM inference parameter. The value of \(\upeta = 1\) (noise generation is fully random) was determined by optimizing on the validation set. We compared our approach to CUT [7], Pix2Pix [5], and SynDiff [17]. During our experiments, we performed a hyperparameter search for the reference ResNet and UNet. Additionally, we introduced a weighted structural similarity index metric (SSIM) loss from a recent paper [29] to update the loss formulation. To further explore the impact of different models and methods, we also tested CUT and Pix2Pix with the SA-UNet. All models were randomly initialized. In our analysis of DDIM, we ablated three inference parameters [16, 30]. However, the results did not show substantial effects, and we have included them in the Supplementary material along with brief descriptions of the tested methods.

Image quality

The evaluation of image quality involved comparing actual and synthesized CT images. To quantify this, we used the “peak signal-to-noise ratio” (PSNR) metric. In this context, the reference image serves as the signal, while the divergence between the two images is considered the noise. A PSNR value above 30 dB indicates that the difference between the two images is imperceptible to the human eye [10]. It is important to note that we did not control the correspondence of soft tissue, as it fell outside the scope of our downstream task. To handle this in our evaluation, we masked pixels that were further than 10 pixels away from a segmented spine structure, setting them to zero. We also computed the absolute difference (L1) mean squared error (MSE), SSIM, and visual information fidelity (VIFp).

Downstream task: segmentation

We utilized a publicly available segmentation algorithm [3, 4] on the synthesized CT images. We then compared the DSC globally and on a vertebral level between the synthesized and ground truth segmentations in four datasets. The segmentation ground truth of the in-house datasets was derived from the aligned CT image and was manually corrected (datasets 1 and 2). The segmentation of the MRSSegClg that is known to exceed the bony structures (dataset 3) and a manually corrected subset of MRSSegClg (dataset 4) [24, 25]. In Fig. 3c and d, the segmentation reaching beyond the bony structures of MRSSegClg is highlighted. For analysis purposes, we excluded structures that the CT segmentation algorithm could not segment, such as the sacrum and partially visualized vertebrae.

Fig. 3
figure 3

Difficulties of the MRI data for unpaired training and issues with the MRSSegClg segmentation. a The bone marrow of the posterior elements and the epidural fat were not easily differentiated. Unpaired learning has issues translating the arcus as bone and the epidural fat as soft tissue in the CT domain. b In posterior elements, bone and soft tissue boundaries are weakly defined due to partial volume effects in and around the spinous process. c The segmentations of the MRSSegClg include soft tissues around the spinous process, caused by difficulties of the original annotators as described in B. d The soft tissues around the vertebrae are also segmented in the MRSSegClg. c and d show the reasons why we manually improved the segmentation in a small subset

3D image translation with diffusion

The first implementations of both DDIM and Pix2Pix in 3D, similar to the 2D approach, did not converge. We thus implemented changes according to recommendations of Bieder et al. [31]. To optimize graphics processing unit storage, we eliminated attention layers and replaced concatenation skip connections with addition operations. Additionally, we introduced a position embedding by concatenating ramps ranging from zero to one of the original images’ full dimensions into the input. The training was done on 3D patches, and our approach used a patch size of (128 × 128 × 32), where the left/right side was limited to 32 pixels due to the image shape. This setup is “fully convolutional,” which means that during inference, an image of any size can be computed by the network as long the sides are divisible by 8. To the best of our knowledge, this represents the first 3D image-to-image translation with diffusion. Since 3D translations require to include the left/right direction, we resampled all images to 1 mm isotropic.

Statistical analysis and software

We employed a paired t-test to assess the significance of PSNR and DSC between different models. To achieve a fixed size of 256 × 256 pixels for assessing image quality, we used one crop per image slice. When reporting differences in multiple experiments, we present the worst (i.e., highest) p-value. We skip significance calculations other image quality metrics because the results are redundant. For 3D data, we pad the test data, and the 3D models generate 1-mm isotropic volumes, which are later resampled to the original MRI size.

Results

Influence of rigid registration

Networks trained on unregistered data were incapable of learning the difference between soft tissue and bone. During our early testing, we noticed that most methods could correctly identify the vertebral body, but translating the posterior structures was impossible. Especially, the spinous process was often omitted in the translation, as shown in Fig. 4. “One point per vertebra” registration was sufficient for the vertebral body translation, but the spine could rotate around the craniocaudal axis. This caused the spinous process to disappear in translated images (Fig. 4a, b). Additionally, confusion between epidural fat and bone shifted the entire posterior elements towards the spinal cord. Overcoming this issue required accounting for rotation by adding additional points to the rigid registration (Fig. 4). Next to visual findings, we observed a significant increase in DSC from 1 to 2 points per vertebra registration: Pix2Pix 0.68 to 0.73 (p < 0.003); SynDiff 0.74 to 0.77 (p < 0.001); DDIM noise 0.55 to 0.72 (p < 0.011); and DDIM image 0.70 to 0.75 (p < 0.001). Notably, the best unpaired method, SynDiff, could not learn posterior structure translation without registration (DSC without registration 0.75).

Fig. 4
figure 4

Comparison of one and two registration points per vertebra versus real data. a We registered with a single point in the center of the vertebral body. The vertebral body could rotate along the spine axis. This caused the posterior vertebra structures to be misaligned. b When we registered the images with an additional point on the spinous process, we avoided this rotation around the spine itself. The blue dashed lines are for locating the relation between axial and sagittal slices. c Translation with networks trained on registrations with 0, 1, or 2 points per vertebra. Images are from the in-house T2-weighted test dataset. Posterior structures are only reconstructed correctly with 2-point registration. DDIM Denoising diffusion implicit model

Image quality

The unpaired CUT models performed worse than all others (p < 0.001), while all other models performed on a similar level (Table 1 for PSNR and other common metrics). Example outputs from the test sets can be seen in Fig. 5. The Pix2Pix with the SA-UNet performed better on T1-weighted images and worse on T2-weighted images than the smaller UNet (T1 weighted, p < 0.001; T2 weighted, p = 0.041). Even though SynDiff had an unpaired formulation, it had similar results compared to our paired Pix2Pix and DDIM noise (slightly worse in T1 weighted and better in T2 weighted, all p < 0.003). The DDIM image mode performed slightly better than the DDIM noise mode (p < 0.001), SynDiff (p < 0.001), and Pix2Pix (p < 0.001). DDIM image mode produces images with less noise than the original data. Less noise should make the segmentation easier. Overall, the DDIM image mode was our best-performing 2D model.

Table 1 Image quality for T1-weighted and T2-weighted MRI to CT translation
Fig. 5
figure 5

Translation from test sets T1-weighted/T2-weighted MRI to CT from the neck to the lumbar vertebra. We did not control the type of reconstruction of the CT. Therefore, the noise level and appearance could differ from the reference and were still considered correct. The 3D variances were trained on an improved training set, which was only done for T2 weighted. The reference is a registered real CT. * is an off-angle acquisition with strong partial volume effects. The dataset contains a high number of broken vertebral bodies, which causes them to be also translated correctly. CUT Contrastive unpaired translation, DDIM Denoising diffusion implicit model, SA-UNet Self-attention U-network

Downstream task: segmentation

Three 2D models shared the best DSC: Pix2Pix SA-UNet, SynDiff, and DDIM image mode (Table 2): Pix2Pix SA-UNet versus SynDiff, p = 0.019; Pix2Pix SA-UNet versus DDIM image mode, p < 0.001; and DDIM image mode versus SynDiff, p = 0.455. DDIM in noise mode and Pix2Pix UNet (DDIM noise versus Pix2Pix UNet, p = 0.972) were worse than the three best models (p < 0.001). The CUT reconstruction was unsuited for segmentation and was the worst model (CUT versus all p < 0.001). An example of the segmentation from different translations for a full spine can be found in Fig. 6 in an example dataset from the German National Cohort [1].

Table 2 Average Dice similarity coefficient↑ per volume and per vertebra on the T1 weighted, T2-weighted MRI, and the MRSSegClg
Fig. 6
figure 6

Translation from T2-weighted MR to CT and the segmentation results in an external full spine scan. The MRI shown is a random image from the German National Cohort dataset. The CT translation is stitched. The 2D networks only work on a fixed size of 256 × 256, and the 3D models run out of memory for the entire image. The 2D networks needed classifier-free guidance (w = 1) for these out-of-distribution images or else the neck regions would not form correctly because the frontal area has a drop in magnetic resonance signal. The 3D networks do not delineate the background and soft tissue when we use a small number of steps (t = 25). A We observed underpredictions in the thorax process spinous. B The neck has higher variability between different translations. Moving to 3D translation resolves these issues. CUT Contrastive unpaired translation, DDIM Denoising diffusion implicit model, SA-UNet Self-attention U-network

We observed comparable rankings in the MRSSegClg [24, 25] and T1-weighted datasets when excluding the vertebral body (Table 3). In the in-house T2-weighted test set, SynDiff has a considerably higher DSC than Pix2Pix SA-UNet and DDIM image mode (p < 0.001), indicating a better performance in the “more complicated” anatomical structures for this data set only.

Table 3 Average posterior structures Dice similarity coefficient↑ per volume and per vertebra

The correction of the MRSSegClg segmentations resulted in an increased DSC of up to 0.02. The rankings of all methods on the original versus the corrected MRSSegClg dataset were mostly consistent, indicating that no method had exploited the false delineation by overpredicting the segmentation.

Overall, Pix2Pix SA-UNet, DDIM image mode, and SynDiff were equally capable of producing CT images for the segmentation algorithm, closely followed by DDIM noise mode and the Pix2Pix UNet.

3D image translation with diffusion

All 3D models increased the DSC compared to our 2D models (p < 0.006). Pix2Pix 3D and DDIM 3D noise performed on a similar level, while DDIM 3D image performances were consistently a bit better close to the rounding threshold (p < 0.001). PSNR showed a drop compared to the 2D variants. The 3D models outperform all 2D models on posterior structures (Fig. 7: T2 weighted, p < 0.024; MRSSegClg (ours), p < 0.005 for DDIM 3D image, p < 0.062 for DDIM 3D noise; p < 0.462 for Pix2Pix 3D; posterior structures are unavailable in the original MRSSegClg). With the rescaling to 1-mm isotropic, we receive a super-resolution of our mask in the thick slice direction that resembles a more realistic 3D shape than the native resolution (Fig. 7).

Fig. 7
figure 7

3D visualization of the generated segmentations out of the German National Cohort and in-house datasets. The 3D translation models produce isometric segmentation (iso) that looks biologically correct. After downscaling to the native resolution (native), we observe that the spinous process gets deformed by reducing the slice thickness because the spinous process is thinner than two to three slices. The examples are translated by the DDIM image mode model. We observe no noticeable drop in translation quality for MRIs from other scanners. Degenerative changes that are not in the training set are often repaired during translation. While it can partially reproduce when vertebral bodies grow together, which is present in rare cases in the training set. This can be observed by the over-segmentation in the right image from vertebra 7 to 10 counted from the bottom. DDIM Denoising diffusion implicit model, ISO Isometric segmentation, Native Native resolution segmentation

Discussion

This study successfully demonstrated the feasibility of translating standard sagittal spine MRI into the CT domain, enabling subsequent CT-based image processing. Specifically, the registration process, with a minimum of 2 points per vertebra, enables accurately translating posterior structures, which are typically challenging for image translation and segmentation. To achieve this, a low-data registration technique was introduced for pairing CT and MRI images, which can be automated by our translation and segmentation pipeline. In our low-data domain, paired translation methods performed on a similar level, with DDIM in image mode being the single best model. The spinous process was not always correctly translated in our 2D approaches. We resolved this issue by changing the process to 3D. Our 3D methods had a drop in image quality compared to the 2D translation. We believe this is due to the required resampling from the 1-mm isotropic output to the native resolution of the test data. Ultimately, the image-to-image translation facilitated MRI segmentation using a pretrained CT segmentation algorithm for all spine regions.

Our results extend prior works that have been limited to high-resolution gradient-echo Dixon T1-weighted sequences to CT translations [14, 32, 33] as well as to intra-modality MR translations for different contrasts from standard T1-weighted and T2-weighted TSE sequences to short tau inversion recovery [34] or T2-weighted fat-saturated images [35], frequently used in spinal MRI. Commercial products are available for MRI to CT translation [36, 37]. However, in contrast to our approach, they require a dedicated, isotropic gradient-echo sequence. They are unavailable for standard T1-weighted or even T2-weighted TSE sequences. Acquiring an additional, dedicated image only for segmentation is resource and time demanding in everyday medical practice and not possible at all in existing data like in available large epidemiological studies like the German National Cohort.

Mature preprocessing pipelines enable image translation in other body regions [8]. For example, in brain MRI, every sample can rigidly be registered to an atlas, and the non-brain tissue is removed. However, in the spine, where vertebrae may be moving between acquisitions, such a simple, rigid preprocessing is impossible. Additionally, the mapping of intensities from the MR to the CT domain is highly dependent on the anatomy, e.g., fat and water would have similar signals in T2-weighted MRI but have substantially different density values in CT, despite being in close anatomical location with a high intersubject variability. Consequently, a network cannot learn the relationship between anatomy and intensity translation based on unpaired images; the tested unpaired method CUT [7] would require additional constraints to learn an anatomically correct translation. SynDiff [17] has an unpaired CycleGAN [6] in its formulation and worked on paired datasets similar to paired methods. Still, it could not correctly translate the posterior structures on unmatched data. We demonstrated that our rigid registration is a required preprocessing for a correct translation, even for SynDiff, and we believe that better processing, such as deformable registration, can lead to better results. However, to account for inter-vertebra movement between two acquisitions due to different patient lying positions between CT and MR acquisitions would require whole vertebral segmentation. Other papers combat this issue by using axial slices, which only need a local vertebra registration [10,11,12] or only focusing on the lumbar spine [5,6,7,8,9], where acquisitions can be performed in a more standardized patient positioning than the cervical spine. Oulbacha and Kadourys’s et al. [38] also use sagittal slices like our study. However, they face similar challenges with incorrectly translating posterior structures, as observed in their figures. To address these issues, we employed dedicated preprocessing techniques and transitioned to a 3D approach.

Our study has limitations. Our pipeline enables us to generate segmentations that are available in other modalities. This method cannot produce segmentations of structures that are not segmented but visible in the input domain. We observed weaknesses in translating neck and thoracic regions when using external images, especially for the 2D methods. The posterior elements in the thoracic region were still the most difficult, and the segmentation and the translation showed more errors compared to other regions. Classifier-free guidance showed substantial improvement in language-based DDIM generation [30] and had a visible impact in 2D translation on an out-of-training distribution like the German National Cohort images. Still, the difference in image quality and the DSC are too small to measure. Therefore, we excluded classifier-free guidance [30] from our analysis, as the effect was too small to be investigated in available test sets. The same is true for testing a different number of time steps and the determinism parameter \(\eta\). We go in more detail about these inference parameters in the Supplemental materials.

In conclusion, we were able to show that image segmentations can be generated in a novel target domain without manual annotations if segmentations exist for another image domain, and paired data for both domains can be obtained. For the spine, we identified minimum registration requirements for paired image-to-image translations. With this approach, SynDiff, Pix2Pix, and DDIM enabled translation of 2D images resulting in similarly good downstream segmentations. We introduced a 3D variant of conditional diffusion for image-to-image translation that improved the segmentation of posterior spinal elements compared to 2D translation. The synthesized segmentations represent a novel ground truth for MRI-based spine segmentations that are prerequisites for spine studies involving large cohorts.