1 Introduction

With the advent of advancements in deep learning technologies, there is a significant gain in the momentum of its applications in biomedical image analysis such as classification, localization, segmentation, etc. (Ker et al., 2017). Most medical applications require segregating the objects or regions (damaged tissues, cells, nuclei, organs, etc.) with fine boundaries using medical imaging such as CAT scans, X-Rays, Ultrasound, etc. for diagnosis, monitoring and treatment. This delineation is generally performed by expert clinicians or radiologists which is a complex and tedious task. With biomedical image segmentation being a precursor to computer-aided classification/localization, various deep learning based approaches are developed to automate the segmentation process for faster diagnosis and better treatment (Haque and Neubert 2020). Among these approaches, U-Net (Ronneberger et al., 2015) based segmentation models gained significant popularity due to its mutable and modular structure that would result in the state-of-the-art diagnosis system. Following this context, several U-Net variants have been introduced to address vivid challenges associated with biomedical image segmentation applications (Punn and Agarwal 2021a). For instance, in light of such challenges, Isensee et al. (2021) proposed a self-adapting framework no-newU-Net (nnU-Net) that dynamically tunes the U-Net based segmentation pipeline covering data pre-processing, data augmentation and post-processing required for different applications such as tumor segmentation, nuclei segmentation, etc. Without manual tuning, nnU-Net achieved state-of-the-art performance in the majority of segmentation tasks.

However, such potential of deep learning segmentation models is only unlocked by training the models with a large amount of annotated data i.e., a fully supervised approach. Moreover, efficient generation of the annotations for such huge data requires expert biomedical analysts and extensive manual effort. It is a tedious and expensive task, while also being vulnerable to human error. To address this issue, various strategies are adopted to efficiently train the model with limited labelled data samples such as data augmentation, transfer learning, self-supervised learning, etc. In image data augmentation (Shorten and Khoshgoftaar 2019) the aim is to increase the number of labelled data by geometric transformations, feature space augmentation, generative adversarial networks, etc. However, the diversity of the augmented samples is limited by the available annotated samples which could cause an overfitting problem in the model. Several attempts are also made towards transfer learning to alleviate the performance of model with limited annotated data samples. Though this strategy works very well with natural images, but is ineffectual in biomedical image analysis (Alzubaidi et al., 2020; Raghu et al., 2019) due to large variation in the associated complex patterns of biomedical imaging as compared to natural images.

Self-supervised learning (Jing and Tian 2020) is an emerging technology that is effectively closing the gap with fully supervised methods on large computer vision benchmarks. It provides an effective solution to the limited availability of annotated data. Here, the aim is to perform pre-training with an unsupervised strategy for learning useful and better representations of the data samples. The pre-trained model is then fine-tuned with limited annotated samples to adopt the actual task such as segmentation, classification, etc. The recent development in self-supervised learning methods can be categorized as contrastive learning (MoCo (He et al., 2020), PIRL (Misra & Maaten, 2020), SimCLR (Chen et al., 2020)), clustering (DeepCluster (Caron et al., 2018), SeLA (Asano et al., 2019), SwAV (Caron et al., 2020)), distillation (BYOL (Grill et al., 2020), SimSiam (Chen and He 2021)) and redundancy reduction (Barlow Twins (Zbontar et al., 2021)). The approaches categorized under contrastive learning, clustering and distillations work based on the similarity maximization that requires efficient generation of the positive (related images) and negative (unrelated images) samples for pre-training. However, in biomedical image analysis identifying the negative samples is a tedious and complex task (Zeng et al., 2021) due to similarities and dissimilarities at low and high level feature representations respectively, whereas in Barlow Twins there is no such requirement; therefore, more suitable for biomedical image segmentation. With this motivation, in the present work a self-supervised learning framework called BT-Unet is proposed for biomedical image segmentation, where Barlow Twins strategy is integrated with U-Net segmentation models. The main contributions of the present research work are highlighted below:

  • The challenge of limited biomedical annotated data availability is addressed by integrating redundancy reduction based self-supervised learning approach with U-Net segmentation models.

  • The pre-training of the U-Net encoder is performed with the Barlow Twins strategy to learn feature representations in an unsupervised manner (without data annotations).

  • The effect of pre-training on biomedical image segmentation performance is analyzed with multiple U-Net models over diverse datasets.

The rest of the paper is divided into several sections, where Sect. 2 presents the literature review of the recent developments in the self-supervised segmentation approaches, followed by the methods adopted in the proposed framework in Sects. 3 and 4. Sections 5 and 6 present the experimental setup and the obtained results respectively. Finally, concluding remarks are presented in Sect. 7.

2 Related work

In recent years, due to developments in deep learning technologies, the researchers have developed a keen interest in computer-aided diagnosis systems to promote better healthcare services with a variety of applications (Lei et al., 2020) such as classification, detection, segmentation, etc. With segmentation being one of the critical aspects of diagnosis and follow up treatment plans, various deep learning based segmentation models are developed. However, the use of self-supervised learning strategies to improve the segmentation performance is relatively least explored.

In the context of biomedical image segmentation, most of these approaches can be grouped into pretext based and contrastive learning based strategies. In pretext based self-supervised learning, a proxy task is performed to learn the feature representations. There are variety of pretext or proxy tasks that can be used for pre-training such as inpainting (Pathak et al., 2016), jigsaw puzzles (Noroozi and Favaro 2016), predicting the position of image patches (Doersch et al., 2015), predicting rotations (Gidaris et al., 2018), etc. However, there is a huge gap or variation between these tasks and the actual or downstream tasks due to which these strategies achieved limited success in deep learning applications. In contrastive learning based strategies, the feature representations are learned by effectively distinguishing the positive (similar) and negative (dissimilar) pairs. Recently, contrastive learning based unsupervised feature representations have gained significant interest. Following this context, Chaitanya et al., (2020) proposed a contrastive learning framework that adapts global and local features using unannotated samples during pre-training in a stage-wise manner for biomedical image segmentation. Similarly, Zheng et al., (2021) proposed a hierarchical self-supervised framework, where multiple heterogenous datasets across multiple modalities are utilized for multi-level contrastive pre-training to adapt the multiple segmentation tasks by fine-tuning. Dhere and Sivaswamy (2021) performed kidney segmentation with a self-supervision strategy where contrastive learning is used with pretext task defined as the classification of the pair of kidneys belonging to the same side, where the pre-training is performed using a siamese network and the pre-trained encoder is fine-tuned in the U-Net model for final segmentation.

However, these contrastive approaches require generating effective positive and negative pairs which is not feasible in every task such as nuclei segmentation or skin lesion segmentation, where the input samples are almost related and it is relatively hard to generate negative pairs of the samples (Li et al., 2021). Following this context, a redundancy reduction based strategy is adopted that does not require generation of positive and negative pairs for pre-training. Here, the aim is to obtain invariant and independent feature representations for every neuron of a model by minimizing the true and observed cross-correlation matrices which is the opposite of mutual information between the representations.

3 Methods

In this section, the background knowledge of the redundancy reduction based Barlow Twins approach for self-supervised learning along with U-Net based models are presented that are integrated with Barlow Twins for biomedical image segmentation.

3.1 Barlow twins

Inspired by Horace Barlow’s efficient coding hypothesis, where neurons communicate via spiking codes which aim to reduce the redundancy between neurons, Zbontar et al., (2021) proposed a redundancy reduction based Barlow Twins (BT) framework for self-supervised learning. Here, the objective is to make each neuron satisfy two conditions that are producing feature representations: (1) Invariance - invariant under different augmentations, and (2) Reduce redundancy - independent of other neurons. The overall BT framework is presented in Fig. 1. In this, two identical encoders (\(f_\theta \), siamese net), sharing the same parameters and weights, generates feature representations (\(Z^A\) and \(Z^B\)) of the augmented or corrupted images (\(Y^A\) and \(Y^B\)). Later, a cross-correlation matrix (\(\mathcal {C}\)) is generated from batch normalized feature representations: \(Z^A\) and \(Z^B\). Finally, to satisfy the above two properties, the model is fine-tuned to make the matrix \(\mathcal {C}\) fairly similar to an identity matrix with a loss function, \( L_{{BT}} \) defined as shown in Eq. 1. In the BT-Unet framework, the encoder of the U-Net models is pre-trained with the BT strategy and later fine-tuned to perform actual segmentation.

Fig. 1
figure 1

Schematic representation of Barlow twins (Zbontar et al., 2021)

$$ L_{{BT}} = \sum\limits_{i} {\left( {1 - C_{{ii}} } \right)} ^{2} + \lambda \sum\limits_{i} {\sum\limits_{{j \ne i}} {C_{{ij}} } } ^{2} $$
(1)
$$ C_{{ij}} = \frac{{\sum\limits_{b} {z\frac{A}{{b,i}}z\frac{B}{{b,j}}} }}{{\sqrt {\sum\limits_{b} {\left( {z\frac{A}{{b,i}}} \right)} } ^{2} \sqrt {\sum\limits_{b} {\left( {z\frac{B}{{b,j}}} \right)} } ^{2} }} $$
(2)

where \(\sum _i (1-\mathcal {C}_{ii})^2\) is an invariance term (diagonal or identity term) to direct neurons to produce same output under different augmentations, and \(\sum _{i}\sum _{j \ne i} {\mathcal {C}_{ij}}^2\) is a redundancy reduction term (off-diagonal term) to make each neuron produce different output. The term \(\lambda \) is used to balance the contribution of invariance and redundancy reduction terms, which however is kept equal to 0.2 (Zbontar et al., 2021).

3.2 U-net models

U-Net (Ronneberger et al., 2015) is the most widely used model for biomedical image segmentation. As shown in Fig. 2, it follows symmetric encoder-decoder design to extract and reconstruct the feature maps respectively. The encoder phase uses the stack of ReLU activated convolution and pooling operations for feature extraction and later these feature maps are concatenated with the corresponding decoder block using the skip connections for feature up-sampling operation. Finally, \(1 \times 1\) convolution is used in the output layer to generate a segmentation mask and categorize each pixel of an input image. The model was trained with the pixel-wise weighted cross-entropy function as defined in the Eq. 3. The U-Net model achieved state-of-the-art results in the ISBI cell tracking challenge. With this potential of the U-Net model, various U-Net based models are developed for different biomedical image segmentation applications (Punn and Agarwal 2021a).

Fig. 2
figure 2

U-net architecture (Ronneberger et al., 2015)

$$\begin{aligned} E = \sum _{{x} \in \Omega }\left( w_c({x}) + w_0 \cdot \exp \left( - \frac{(d_1({x}) + d_2({x}))^2}{2\sigma ^2}\right) \right) \log ({p}_{\ell ({x})}({x})) \end{aligned}$$
(3)

where \(p_k(x)\) is the output softmax function, \(d_1\) and \(d_2\) indicate the distances to the nearest and second nearest boundary points, \(w_c\) represents weight map, \(w_o\) and \(\sigma \) are constants.

In the present article, U-Net, attention U-Net (AU-Net) (Oktay et al., 2018), inception U-Net (I-Unet) (Punn and Agarwal 2020) and residual cross spatial attention guided inception U-Net (RCA-IUnet) (Punn and Agarwal 2021b) are considered to establish better comparative analysis of the segmentation performance. In contrast to the U-Net model, A-Unet adds attention filters in the skip connection to suppress irrelevant features of an input image, while following a similar encoder-decoder structure. Later, to efficiently capture the varied shape, size and location of the target structure, I-Unet introduces inception convolution layers where multi-scale features are extracted at the same layer, thereby forming a wider network. Moreover, a hybrid pooling layer is proposed that exploits the features of spatial max pooling and spectral pooling. Inspired from the potential of A-Unet and I-Unet, the RCA-IUnet model advances the attention filter to capture multi-scale feature maps and generate better attention descriptors for target regions, while also using the hybrid pooling and inception convolution layers by reducing the cost of computation and training parameters with the help of depthwise separable convolution (Chollet 2017).

4 Proposed framework

In the present article, the state-of-the-art potential of U-Net models is expanded by integrating redundancy reduction based Barlow Twins self-supervised learning for better performance in the segmentation with limited annotated data samples. The schematic representation of the proposed framework is shown in Fig. 3.

Fig. 3
figure 3

BT-Unet framework. a Pre-training U-Net encoder network, and b Fine-tuning U-Net model that is initialized with pre-trained encoder weights

The BT-Unet framework is divided into two phases: 1) Pre-training, and 2) Fine-tuning. In pre-training, the aim is to learn the complex feature representations using unannotated data samples. Here, the encoder network of the U-Net model is pre-trained using BT self-supervised learning strategy. Initially, the input image is augmented or corrupted with certain distortions such as random crop and rotations to generate two distorted images. This type of distortion follows from the results acquired by Zbontar et al., (2021) while analyzing the effect of applying augmentations on pre-training performance. Each augmented image is analyzed with a U-Net encoder followed by a projection network to generate encoded feature representations in desired dimensions. The projection network follows from the feature maps produced by the encoder network with global average pooling and blocks of fully connected layers, ReLU activation and batch normalization (FC + ReLU + BN), and final encoded feature representations are generated by another FC layer. Following from the empirical observations, the number of neurons in each fully connected layer is kept half of the spatial dimension of an input image for efficient pre-training, e.g., if input, \(I\in \mathbb {R}^{s\times {s} \times {c}} \) then the number of neurons are s/2, where s is a spatial dimension of an image. The number of neurons could be further increased but at the cost of heavy computation. However, no significant improvement was observed with increased dimensions. Since, in later layers, the network learns task specific features that are not aligned with the downstream segmentation task, hence the weights learned by the projection network can be neglected, whereas the weights of the entire encoder network can be transferred to the U-Net model. In the second phase, the weights of the encoder network in the U-Net model are initialized with pre-trained weights (from the first phase), whereas the rest of the network is initialized with default weights. Finally, the U-Net model is fine-tuned with limited annotated samples for the biomedical image segmentation.

5 Experiment configuration

This section covers the details concerning the training and testing environment of the BT-Unet framework along with the datasets and modifications in U-Net models that are used for the comparative analysis. To establish robust results with the BT-Unet framework various state-of-the-art U-Net models are considered for experiments such as vanilla U-Net, attention U-Net (A-Unet), inception U-Net (I-Unet) and residual cross-spatial attention guided inception U-Net (RCA-IUnet). Inspired by the RCA-IUnet model, the following minor modifications for U-Net, A-Unet and I-Unet architectures are performed:

  • Standard 2D convolution operations are replaced with 2D depthwise separable convolution to reduce the number of training parameters and multiplication operations without affecting performance.

  • Batch normalization is performed after every convolution operation for stable training.

  • Each encoder layer is equipped with residual skip connection (mini-skip connection) to avoid the vanishing gradient problem.

  • Encoding and decoding phases are divided into four stages. With each stage in the encoding phase, the number of channels increases by a factor of 2 (starting with 16 in the first layer) and spatial resolution decreases by a factor 2 (starting with 256\(\times \)256).

5.1 Dataset description and setup

The performance of the BT-Unet framework is validated using four datasets with different segmentation tasks as shown in Table 1. The dataset comprises images of organs, cells and lesions acquired under different imaging protocols.

Table 1 Summary of biomedical datasets used in our experiment

The Kaggle data science bowl 2018 (KDSB18) challenge is developed for automated nuclei segmentation. It contains annotated histopathological images with varying nuclei shapes, cell types, magnification and modalities (fluorescence/brightfield). Breast ultrasound image segmentation (BUSIS) benchmark dataset comprises breast ultrasound scans annotated with a binary mask of the tumor regions. The dataset covers a wide diversity of samples collected from various medical institutes and organizations. In another dataset ISIC18, skin lesion segmentation is performed with the help of annotated dermoscopy images. To add more diversity in the datasets, brain tumor segmentation 2018 (BraTS18) challenge is considered, which comprises of 3D volumes of MRI modalities with high-grade gliomas (HGG) and low-grade gliomas (LGG) to highlight different tumor regions: whole tumor (WT), tumor core (TC), and emerging tumor (ET). This task is simplified by extracting 4,200 2D slices from the 3D volumes of FLAIR modality and analyzing the segmentation mask associated with the WT region.

5.2 Training and testing

The overall framework is developed using the TensorFlow v2.6 library on Nvidia RTX 2070 Max-Q GPU system. For all experiments, images from KDSB18, BUSI, ISIC18 and BraTS18 datasets are resized to 256\(\times \)256. The datasets are split into \(70\%\) training data and \(30\%\) testing data. The pre-training is performed with complete training data without considering annotations. To simulate the scenario of limited annotated data availability, \(20\%\) of KDSB18 and BUSI training data and \(10\%\) of ISIC18 and BraTS18 training data are considered for fine-tuning the segmentation models with 5-fold and 10-fold cross-validation respectively. Moreover, Adam optimizer with learning rate initialized at \(1e-3\) is used for all the experiments that decay by a factor of 0.1 once the learning stagnates for better segmentation results. The training phase is also assisted with the early-stopping strategy to avoid the overfitting problem by terminating the training process when the loss function stops decreasing. The pre-training is performed by minimizing the cross-correlation loss function defined in Eq. 1, whereas U-Net models are fine-tuned with segmentation loss function, \(\mathcal {L}\) defined as the average of binary cross-entropy loss, \(\mathcal {L}_{BC}\) and dice coefficient loss, \(\mathcal {L}_{DC}\) as shown in Eq. 4.

$$\begin{aligned} \mathcal {L}= & {} \frac{1}{2} \mathcal {L}_{BC}+\frac{1}{2}\mathcal {L}_{DC} \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{BC}\left( y,p\left( y\right) \right)= & {} -\sum ^N_i\left( y_i.{log \left( p\left( y_i\right) \right) }+\left( 1-y_i\right) .{log \left( 1-p\left( y_i\right) \right) }\right) \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{DC}\left( y,p\left( y\right) \right)= & {} 1-\frac{2\sum ^N_i{y_i.p(y_i)}}{\sum ^N_i\mid y_i\mid ^2+\sum ^N_i\mid p(y_i)\mid ^2} \end{aligned}$$
(6)

where y is the ground truth label of a pixel, p(y) is the predicted label of a pixel and N is the total number of pixels.

The performance of trained U-Net models is validated on the test sets by using various evaluation metrics such as precision (Pr.) (Eq. 7), dice coefficient (DC) (Eq. 8) and mean intersection-over-union (mIoU) (Eq. 9).

$$\begin{aligned} Pr.= & {} \frac{TP}{(TP+FP)} \end{aligned}$$
(7)
$$\begin{aligned} DC= & {} \frac{2TP}{(2TP+FN+FP)} \end{aligned}$$
(8)
$$\begin{aligned} mIoU= & {} \frac{1}{10}\sum _t\frac{TP}{(TP+FN+FP)};\;\;\; t+=0.5\le 0.95 \end{aligned}$$
(9)

where, TP - true positive, TN - true negative, FP - false positive, FN - false negative and t - prediction threshold.

6 Results and discussion

The proposed framework generates a segmentation mask for given medical imaging. The quantitative results of the U-Net models with and without the Barlow Twins based pre-training on four different biomedical imaging datasets is presented in Table 2 along with the percentage change in the segmentation performance of the models with Fig. 4. Moreover, Fig. 5 presents the qualitative comparison of the segmentation performance. Following are the observations made for each dataset:

Table 2 Quantitative comparative analysis of segmentation models. The best results with and without BT are shown with bold black and blue color respectively
Fig. 4
figure 4

Impact of BT pre-training on segmentation performance of the U-Net models

Fig. 5
figure 5

Qualitative comparative analysis of the segmentation performance

Table 3 Performance analysis of U-Net variants with and without pre-training using Barlow Twins (BT) over different fractions of training datasets (DS)
  • KDSB18 In the cell nuclei segmentation task of KDSB18, the performance of the BT enabled U-Net models exceeds as compared to the models without BT (as shown in Table 2). It is also observed that as the architecture design becomes more complex then pre-training exhibits positive influence on the segmentation performance (as shown in Fig. 4, RCA-IUnet model achieves maximum gain in the performance as compared to other models). Moreover, the minimum change in the performance of the vanilla U-Net model indicates that a simpler encoder structure (close to vanilla U-Net, e.g. A-Unet) face difficulty in extracting feature maps with limited annotated samples. A similar pattern can also be observed with qualitative results in Fig. 5.

  • BUSIS The automated segmentation of breast tumor using ultrasound imaging achieves promising results with BT pre-training (as shown in Table 2). It is observed that U-Net and A-Unet models are not able to learn and extract feature maps concerning tumor regions (achieved 0 precision, DC and mIoU), however with pre-training, these models achieved noticeable improvement (as shown in Fig. 5). In case of I-Unet and RCA-IUnet models, considerable improvements are observed with pre-training, where dice coefficient increases by \(5\%\) and precision increases by \(11\%\) respectively (as shown in Fig. 4).

  • ISIC18 Skin lesion segmentation is another challenging task, where U-Net models with BT achieved satisfactory improvements in segmentation. The I-Unet and RCA-IUnet models are the most influenced networks that achieved \(5.1\%\) and \(2.2\%\) increase in precision respectively. However, a slight decline in performance is observed with vanilla U-Net and A-Unet (as shown in Fig. 4) while using BT pre-training. In contrast to I-Unet and RCA-IUnet, these models have simpler encoder structures due to which in pre-training the model fails to learn efficient feature representation about complex lesion regions. Furthermore, as observed from Fig. 5, the BT+RCA-IUnet model achieved best skin lesion segmentation results with smoother boundaries.

  • BraTS18 In this challenge of brain tumor segmentation, the models performed similarly as with the ISIC18 dataset. I-Unet and RCA-IUnet models achieved significant gain in the segmentation performance while using the BT-Unet framework, whereas the same behaviour is not observed with vanilla U-Net and A-Unet models because of their inability to effectively capture tumor feature representations during pre-training. As observed from Fig. 4, the RCA-IUnet model achieved gains of \(5.3\%\), \(6.1\%\) and \(6.7\%\), while I-Unet achieved gains of \(4.4\%\), \(4.7\%\) and \(4.6\%\) in precision, dice coefficient and mIoU respectively.

Furthermore, the performance of U-Net variants with and without the pre-training is analysed with multiple fractions of training datasets as shown in Table 3. For all datasets with training fractions less than \(50\%\), similar change in performance is observed among the models as discussed in Table 2. Besides, in the case of without pre-training for U-Net and A-Unet, the increased fraction of BUSIS training samples improved the performance as compared to zero values (observed in Table 2) and the corresponding performance gain is also achieved with pre-training. However, for the fractions greater than \(50\%\), the performance gap is narrowed i.e. results produced by the models with and without pre-training are not significantly different. This indicates that it is beneficial to utilize the pre-training strategy when there are limited annotations within the large pool of data samples.

7 Conclusion

In this research work, self-supervised learning assisted biomedical image segmentation framework BT-Unet, is proposed to address one of the major challenges of limited annotated data availability. The BT-Unet framework uses redundancy reduction based Barlow Twins strategy for pre-training the encoder network of the U-Net model with feature representations of the data in an unsupervised manner, followed by fine-tuning of the U-Net model for downstream biomedical image segmentation task with limited annotated data samples. With exhaustive experimental trials, it is evident that BT-Unet tends to improve the segmentation performance of U-Net models in such situations. Moreover, this improvement is also influenced by the underlying encoder structure and nature of biomedical image segmentation task. In future, more experiments can be conducted by modifying or exploring different pre-training strategies to generate better feature representations and ensure finer biomedical image segmentation.