Introduction

Intervertebral discs (IVDs) play a key role in spinal flexibility and load distribution [1]. Magnetic resonance imaging (MRI) represents the ideal method for IVD visualization, for its ability to create detailed imaging without harmful ionizing radiation [2]. The precise MRI-based IVD segmentation is critical for diagnosing spinal disorders, guiding treatments, and facilitating accurate spinal interventions [3].

Deep learning (DL) techniques, especially convolutional neural networks (CNNs), have proven to be effective in IVD segmentation [4]. Research mainly uses supervised methods like encoder–decoder architectures [5, 6] and fully convolutional models [7], with growing interest in mixed supervised [8, 9] and semi-supervised approaches [10]. However, these techniques often face challenges with domain shift across different MRI devices and protocols [11], where variability in scanner settings and MR modalities leads to significant image intensity differences [12].

In different fields of medical image analysis, various domain adaptation (DA) techniques have been used to address the challenges associated with transferring knowledge from a source domain to an unlabeled target domain. These techniques include adversarial learning which aligns feature distributions [13], self-ensembling methods that generate pseudo-labels for unlabeled target domain samples using trained model predictions [14], cycle consistency models that synthesize target images from source images [15], and DA methods based on variational autoencoders [16].

Fig. 1
figure 1

In MRI scans, IVDs display considerable variation in signal intensity, shape, and texture, influenced by aging, degeneration, and pathology, contributing to their diverse appearances. This image highlights the distinct characteristics of the datasets, showing different contrast levels and illuminations. Target 2 in particular, comprises images from different scanners and patients with pathology, resulting in a greater complexity compared to the other datasets

Recently, incorporating auxiliary tasks in computer vision models has shown to be a straightforward way to develop domain-invariant features, improving adaptability across different domains [17]. This approach, particularly beneficial in medical image analysis, simplifies training and addresses generalization challenges posed by data variability across institutions and patient groups. Indeed, various researchers have applied this strategy to solve DA tasks. A structure-driven DA approach for unsupervised cross-modality cardiac segmentation is proposed in [18]. A set of 3D landmarks serves as representative points that embody the heart anatomical structure across various imaging modalities (computed tomography (CT) and MRI). The model learns to predict the positions of the landmarks, facilitating the identification and use of shared structural information. Cardiac structures segmentation from CT and MR volumes is also explored in [19], which uses an edge generation auxiliary task to support the primary segmentation task in the target domain. To cope with domain shift, they employ hierarchical low-level adversarial learning to encourage informative feature suppression hierarchically. Unsupervised DA is explored in [20] for abdominal multi-organ segmentation on CT scans, leveraging the organ location information. A jigsaw puzzle auxiliary task is devised to reconstruct a CT scan from shuffled patches. Additionally, a super-resolution network is used to standardize images from multiple domains. The auxiliary and super-resolution tasks are trained alongside the organ segmentation task to enhance overall performance.

The effectiveness of the DA based on the inclusion of auxiliary tasks strongly depends on the optimal design of the pretext task, which can present challenges, such as the domain shift between the pretext task and the final segmentation domains [21]. Thus, drawing inspiration from recent research in the field of fashion compatibility [22, 23], we recognize the significance of color and texture in analyzing and categorizing visual data. This approach involves the application of color and texture pretext tasks to extract discriminative features while disregarding shape information. Color features in natural images translate into intensity features in medical imaging, as shown in [24], which recently reported a similar approach for histopathological image classification and out-of-distribution detection. This aspect gains importance in MRI, where hardware and software variations result in non-standard tissue intensities, crucial for differentiating IVDs and vertebrae. Therefore, the application of this specific pretext task becomes particularly relevant.

Guided by these considerations, the contributions of this paper can be summarized as follows:

  1. 1.

    First attempt at exploring self-supervised DA for IVD segmentation in MRI.

  2. 2.

    First-ever work that leverages self-supervised learning in medical image segmentation, specifically introducing intensity-pretext tasks for DA. The framework is designed to be end-to-end, offering a straightforward yet effective approach to this complex task.

All the experiments are performed using publicly available datasets to promote comparisons with the presented methods.

Materials and methods

Dataset description

We used three MRI datasets: one as the primary source domain (S) and two others as target domains (T1 and T2). They were collected from different medical centers using various MRI scanners, leading to a rich diversity in patient demographics and MRI parameters, essential for enhancing the robustness and applicability of the developed segmentation model.

S: The dataset S, released by [4], was obtained from a single hospital in China and includes T2-weighted volumetric MRI of 215 subjects, acquired with a 3.0 Tesla MRI scanner (Ingenia, Philips, Amsterdam, Nederlands). For training, validating, and testing the model, the dataset was split into three sets including 172, 19, and 19 volumes, respectively.

T1: The dataset T1 was released by [25], it consists of T2-weighted MRI from 23 patients, acquired with a 1.5 Tesla MRI scanner of Siemens (Siemens Healthcare, Erlangen, Germany). The dataset was split into three sets for training, validation, and testing, each including 14, 4, and 5 volumes, respectively. The testing set was meticulously labeled by three experienced operators.

T2: The dataset T2 consists of 30 MR volumes from 29 patients with different medical conditions. Various scanner models from different vendors were used to collect the dataset, including Philips scanners (Achieva, Ingenia, and Elition), Siemens scanners (Avanto, Verio, Espree, Symphony, Amira, Aera, and Magnetom), and a GE scanner (Signa). MR scans from 19 patients were used to train the model, 5 for validation, and 6 for testing. Also in this case, three operators with significant experience carefully assigned labels to the testing set.

For the three datasets, all voxels belonging to IVDs were labeled as 1 while the others were set to 0. For further details on the datasets refer to the corresponding papers. For each dataset, the images included in the test sets were selected to ensure no patient overlap between the train and test sets. Moreover, to guarantee a robust evaluation of our model’s adaptability to a wide range of acquisition contexts, for dataset T2, which includes images acquired with various scanner models, the test set was selected to contain images acquired with scanner models not present in the dataset S. Samples of the testing set images are shown in Fig. 1.

Algorithm 1
figure a

DA method pseudo-code.

Proposed method

The proposed approach is a dual-task model for IVD segmentation, enhanced by an intensity-based pretext learning task. This additional task generates labels from S, T1, and T2 images based on applied transformations, aiding in feature extraction invariant across domains. The main aim is to improve intensity representation within the embedding space, thereby enhancing the model’s generalization and effectiveness in handling intensity variations in medical images.

Fig. 2
figure 2

Our proposed framework for self-supervised DA focuses on learning domain-invariant feature intensity representation. This is achieved by incorporating a pretext learning task that automatically generates labels from images of both the source and the two target domains. The pretext task and the main task, which is IVD segmentation, are simultaneously trained using multi-task learning

Figure 2 shows our method focused on IVD semantic segmentation as the main task. We employ an encoder network \(e(\cdot )\), which serves as a feature extractor. Additionally, we use a decoder network \(d(\cdot )\) to recover spatial information and generate accurate IVD segmentation. This encoder–decoder structure adopts the same architecture as the baseline U-Net described in “Baseline and domain adaptation comparisons”. The \(e(\cdot )\) is composed of four blocks, each consisting of two 3D convolutional (conv) layers with a kernel size of \(3\times 3\times 3\) and same-padding. Following the conv layers, a rectified linear unit (ReLU) activation function and a batch normalization layer are applied. A max pooling operation with a stride of \(2\times 2\times 1\) is performed. At each block, the number of channels doubles, enabling the incremental learning of more complex features. The number of channels starts from 32 and progressively increases to 512. The \(e(\cdot )\) also incorporates a bottleneck section which facilitates the connection to the \(d(\cdot )\). The bottleneck consists of two additional 3D conv layers with a kernel size of \(3\times 3\times 3\) and same-padding. Subsequently, a ReLU activation function and a batch normalization layer are applied to further enhance the learned representations. Similarly to \(e(\cdot )\), the \(d(\cdot )\) function is composed of four blocks. Each block comprises two 3D conv layers with a kernel size of \(3\times 3\times 3\). These layers are followed by a ReLU activation function and an upsampling layer with a kernel size of \(2\times 2\times 1\), which reduces the number of feature channels by half. To recover the lost features resulting from downsampling in the \(e(\cdot )\) path, the input of each block is concatenated with the corresponding feature maps from \(e(\cdot )\). The last block consists of three 3D conv layers, with the first two being followed by a ReLU activation function, and the last one activated by softmax. The number of filters used in the conv layers starts at 256 and is halved in each subsequent block until reaching 32 filters. This CNN is trained end-to-end using labeled samples from the source domain (S = {\(X_{\textrm{s}}, Y_{\textrm{s}}\)}).

To learn intensity invariant features, \(e(\cdot )\) is also trained to recognize intensity distortions from both target domains (T = {\(X_{\textrm{t}}, Y_{\textrm{t}}\)}) and source domain (\(X_{\textrm{s}}, Y'_{\textrm{s}}\)). \(Y_{\textrm{t}}\) and \(Y'_{\textrm{s}}\) are derived automatically by applying image intensity transformations to their respective images and labeling them based on the specific transformation. This is further detailed in “Pretext tasks” section.

The entire DA method is outlined in Algorithm 1. During forward propagation, samples from both the source and target domains are processed by the shared encoder. Subsequently, the losses for the main task (\(L_{\textrm{seg}}\)) and the pretext task (\(L_{\textrm{p}}\)) are calculated, and these losses are then back-propagated and accumulated at \(e(\cdot )\). By training \(e(\cdot )\) with samples from all three domains, the model learns feature representations that are invariant to domain differences.

Pretext tasks

The intensity prediction pretext task proposed in this work is inspired by prior works from different fields [17, 22, 23]. Given a set of \(N_{\textrm{t}}\) and \(N_{\textrm{s}}\) training images from \(T = \{x^{\textrm{t}}_{i}\}^{N_{\textrm{t}}}_{i=0}\) and \(S = \{x^{\textrm{s}}_{i}\}^{N_{\textrm{s}}}_{i=0}\), respectively, three different sets of intensity transformations, namely Gaussian noise, Gaussian blur, and contrast enhancement are applied. The intensity transformation prediction model \(i(\cdot )\), takes the feature maps generated by the function \(e(\cdot )\) as input and produces a probability distribution representing different intensity transformations, including the option of no intensity transformation. The \(i(\cdot )\) model is composed of three blocks, each containing two 3D conv layers. These conv layers have a kernel size of \(3\times 3 \times 3\) and use same-padding. The ReLU activation function is applied after each of them, and batch normalization is performed subsequently. The number of filters used in the conv layers starts at 256 and is halved in each subsequent block until reaching 64 filters. An additional 3D conv layer with a kernel size of \(3 \times 3\times 3\), same-padding, and the number of filters denoted as \(C=4\) is employed to reduce the number of filters to match the number of classes in the problem. This additional layer is activated by the softmax function. The \(L_{\textrm{p}}\) loss is defined as cross-entropy loss.

Parameter setting

The training process resized images from datasets S, T1, and T2 to \(256\times 256 \times 18\) pixels for both the primary task (IVD segmentation) and the pretext task. Optimization was carried out using the Adam optimizer for 100 epochs, with a fixed initial learning rate of 0.001. The primary task used a batch size of 1, while the pretext task used a batch size of 4. Dice loss (\(L_{\textrm{seg}}\)), effective for class-scarce contexts like IVDs, was the chosen loss function.

To improve generalization, on-the-fly data augmentation was applied during IVD segmentation training. This included geometrical transformations (horizontal flipping, \(\pm \,30^{\circ }\) rotation) and intensity transformations (random brightness correction), applied randomly in each training iteration. For the pretext task, only geometrical transformations (random vertical flipping and \(\pm \,30^{\circ }\) rotation) were used to enhance perspective generalization without affecting intensity transformation identification.

The best model was selected based on the lowest total loss (\(L_{\textrm{total}} = L_{\textrm{seg}} + L_{\textrm{p}}\)) in the validation set of S. All analyses were conducted using Tensorflow 2.x on an NVIDIA RTX 2080 TI, supported by a Xeon e5 CPU and 128 GB RAM.

Table 1 Results of the performance metrics computed on the test sets of the three datasets, obtained from the baseline model (i.e., U-Net trained only on the S dataset), the proposed model (t1t2s-int) and all the other dual-task models trained with various pretext task configurations

Performance metrics

We evaluated the performance of our end-to-end model by calculating metrics for 3D segmentation, as outlined in [26]. Hence, we computed overlap-based metrics on the testing datasets of S, T1, and T2 such as the Dice similarity coefficient (DSC), sensitivity (Sen), and specificity (Spec).

Furthermore, the Hausdorff distance (HD), a distance-based metric, was employed as an additional measure for assessing boundary delineation.

Baseline and domain adaptation comparisons

We first set the stage by evaluating our proposed strategy against a baseline model, namely the U-Net model trained only on the S dataset. We then conducted a comprehensive analysis comparing different training data configurations to investigate the impact of introducing different domains into the pretext task:

  1. 1.

    Dual-task model, trained by applying the pretext task exclusively on T1 (t1-int).

  2. 2.

    Dual-task model, trained by applying the pretext task exclusively on T2 (t2-int).

  3. 3.

    Dual-task model, trained by applying the pretext task on T1 and T2 (t1t2-int).

  4. 4.

    Dual-task model, trained by applying the pretext task on both T1 and T2 and on S (t1t2s-int).

Additionally, we compared the proposed intensity prediction pretext task (t1t2s-int) with a traditional rotation prediction task (t1t2s-rot). In t1t2s-rot, images were randomly rotated (0, 90, 180, or 270\(^{\circ }\)), and the model, same as in “Pretext tasks” section, was trained to identify the rotation degree. Similar to the previous configuration, the rotation pretext was applied to T1, T2, and S datasets. By comparing these two pretext tasks, we aimed to assess their influence on the dual-task framework and determine the relative effectiveness of intensity prediction against the conventional rotation approach. The combination of the two pretext tasks (t1t2s-rot-int) is also explored to estimate how additional tasks impact the result. The same branch from the intensity pretext has been replicated for the rotation pretext.

Results

Fig. 3
figure 3

In reviewing the qualitative outcomes of the U-Net and t1t2s-int training strategies on two randomly selected test images from target 1 and target 2, it is observed that the U-Net approach tends to yield less precise segmentation of the discs, particularly those situated in the outer regions of the image

Fig. 4
figure 4

Qualitative results from the comparison of all the tested configurations on two random test images from each of the three datasets (source, target 1, and target 2 from left to right). In the source domain, the improvements brought by the proposed model (t1t2s-int) are particularly evident in the accurate segmentation of the contour of the discs. In target 1 and target 2, the proposed model demonstrates fewer false negatives, successfully segmenting all the discs present in the images. Yellow boxes display close-ups of poorly segmented discs, whereas cyan boxes indicate areas of missing segmentation

Results of the performance metrics calculated on the baseline model (i.e., U-Net), and on the dual-task models obtained using different training data and pretext tasks (t1-int, t2-int, t1t2-int, t1t2s-rot, t1t2s-rot-int, and t1t2s-int), are presented in Table 1.

For the S dataset, all the tested models exhibited comparable performances. The proposed model demonstrated consistency in performance across the different pretext tasks. All training strategies performed well on T1, achieving notable peaks in terms of HD, Sen, and Spec when using the t1t2s-int configuration. The second best strategy is t1-int. Similar trends are observed for T2, even though with lower mean values of the metrics compared to T1. Also in this case the t1t2s-int model achieves the highest DSC and the lowest HD compared to the other tested models. The t2-int and t1t2-int models obtained slightly higher mean DSC values compared to t1-int. When comparing the performances of t1t2s-int with a more traditional pretext (t1t2s-rot), and with the combination of multiple pretexts (t1t2s-rot-int), the former achieved similar results for all metrics in the S dataset, for the HD, that was lower in the t1t2s-rot-int configuration. The proposed model achieved better results on both target datasets among all metrics. Moreover, in dataset T2, introducing only the rotation as pretext task results in a deterioration of performance if compared to U-Net. Qualitative results shown in Figs. 3 and 4 further support the effectiveness of the proposed model in accurately segmenting the IVDs while minimizing the presence of segmented spots outside the designated region.

Discussion

This research explores self-supervised learning for unsupervised DA in IVD segmentation, examining if pretext tasks enhance learning. IVD segmentation in MRI is complicated by their wide appearance variations and limited voxel representation, leading to indistinct boundaries due to the partial volume effect. The study introduces a novel pretext task focusing on predicting intensity variations, such as Gaussian noise, Gaussian blur, and contrast enhancement, which are more pertinent to IVD segmentation than traditional tasks like rotation prediction, aiming to improve model robustness across different imaging sources.

As a result, the model achieves robust segmentation performance across a source and two target datasets containing patients with different pathological conditions and acquired with various scanning devices (namely Philips Ingenia, Siemens, Philips Achiva, Philips Elition GE Sigma).

Evaluating the proposed strategy against the baseline model (U-Net), t1t2s-int reached the best performance metrics in both T1 and T2 while maintaining the same performances with respect to U-Net in S, as shown in Table 1. Figure 3 illustrates the qualitative results of this experiment, showing that the U-Net approach tends to produce sub-optimal disc segmentation, especially for discs situated in the outermost regions of the image, highlighting the superiority of the proposed DA model in these challenging regions. This indicates that the strategy not only excelled in the intended domains but also managed to preserve the effectiveness it had demonstrated in the source domain. This aspect is crucial as it ensures that implementing the strategy does not result in any detrimental effects or a decrease in performance in the original domain.

The best training strategy is achieved when incorporating intensity pretext tasks across multiple datasets, including the two target domains and the source domain. This is confirmed by comparing the proposed model with other pretext configurations, as it reached \({\text {DSC}} = 0.92 \pm 0.04\), \({\text {HD}} = 13.59 \pm 17.70\), \({\text {Sen}} = 0.90 \pm 0.07\) for T1 and a \({\text {DSC}} = 0.77 \pm 0.18\), \({\text {HD}} = 37.63 \pm 25.40\), \({\text {Sen}} = 0.90 \pm 0.07\) for T2. When dealing with pretext applied on only a specific target dataset (t1-int and t2-int), the performances are better for the dataset on which the pretext was carried out with respect to the other datasets on which the pretext task is not applied, as reported in Table 1. Spec achieved consistently high values of 0.99 in all experiments, underscoring the model’s ability to correctly identify negatives across all cases. A similar trend is observed when applying the pretext task to the two target datasets and not to S (t1t2-int). This behavior is expected as the model has learned features from just one specific dataset while t1t2s-int improves the generalization performance and adaptability across diverse data distributions. This is also evident when qualitatively evaluating the results, as can be observed from Fig. 4. The reduced presence of segmented spots outside the true label region is a critical advancement in IVD segmentation, as it improves the reliability of the segmentation results. This outcome is particularly important in medical applications where precise delineation of the anatomical region of interest (IVDs in this case) is crucial for accurate diagnosis and treatment planning.

Comparing our novel intensity-based pretext task (t1t2s-int) with a traditional rotation pretext (t1t2s-rot) and with the combination of intensity and rotation pretexts (t1t2s-rot-int), we found the intensity-based approach to be more effective, likely because intensities are crucial in differentiating IVDs from vertebral bodies across diverse MRI devices. Furthermore, our findings suggest that introducing a second task may distract the model from focusing on the primary task, resulting in a degradation of performance. This underscores the importance of maintaining task specificity, particularly in domains where nuances in data characteristics, such as MRI intensities, play a critical role.

Our model showed promising results but had limitations. It was tested on a small dataset; using larger datasets could improve its generalizability across populations and imaging protocols. We initially used a basic U-Net for feasibility; future enhancements could include multi-scale pyramid structures and self-attention modules for better performance [27]. Other future developments include the use of a wider variety of intensity-based pretext tasks such as predicting intensity histograms, shapeless local patch discrimination [22], to investigate to enable the model to learn more comprehensive and adaptable features. Additional adversarial loss could also be exploited to further reduce domain shift and improving model generalization [13].

Conclusions

In our study, we developed an innovative unsupervised domain adaptation method for IVD segmentation, using a dual-task model for segmentation and intensity distortion. Trained on unlabeled multi-domain data, the model learns domain-invariant features, enhancing MRI dataset segmentation. This strategy overcomes intensity variation challenges, outperforming traditional models like U-Net, presenting a promising direction in medical image analysis.