Introduction

The paranasal sinuses, air-filled spaces within the craniofacial complex, vary significantly and include the maxillary, frontal, sphenoid, and ethmoid sinuses [1]. Common pathologies like retention cysts, polyps, and mucosal thickening are identifiable through radiological screenings [2,3,4]. However, their diagnosis is challenging due to their incidental nature and the variability in sinus appearance [5]. Research underscores their prevalence and the importance of accurate diagnosis in patient care [6]. 3D imaging from computed tomography (CT) and magnetic resonance images (MRI) is vital for precise diagnosis, and misdiagnosis can lead to patient distress and increased healthcare costs [7, 8]. The anatomical variability of the sinuses [9] necessitates careful application of deep learning for reliable diagnoses.

Convolutional neural networks (CNNs) are recognized for diagnosing paranasal pathologies, evidenced in sinusitis classification [10, 11], differentiating inverted papilloma from carcinomas [12], and detecting MS fungal ball and chronic rhinosinusitis in CT scans [13]. Prior studies have explored contrastive learning and cross-entropy loss for MS anomaly classification [14], and MS extraction techniques from MRI [15]. However, all of the aforementioned methods use supervised learning. Given the difficulty in obtaining well-labelled datasets in clinical settings [16], and the relative ease of acquiring unlabelled data, self-supervised learning (SSL), which learns representations from unlabelled data to improve the downstream task, has not yet been explored for paranasal anomaly classification. SSL efficiently utilizes unlabelled data through tasks like nonlinear compression [17, 18], denoising [19], feature alignment from augmented images [20,21,22] and inpainting masked regions of images [23]. However, these methods are designed to improve the performance of models exposed to 2D natural images. Hence, they lack a specific focus on enhancing MS anomaly classification from 3D MRI. Our aim is to design an SSL task that enables the models trained on it to achieve maximum data efficiency in classifying paranasal anomalies. We hypothesize anomaly segmentation within MS could be a good SSL task. Without ground truth segmentation masks, we use a UAD framework, applied in brain [24, 25] and paranasal anomaly detection [26], to localize MS anomalies. A 3D convolutional autoencoder (CAE) trained on a labelled normal dataset is used to reconstruct MS volumes and localize anomalies in an unlabelled dataset by failing to reconstruct anomalies leading to reconstruction errors. These errors, serving as pseudo segmentation masks are used in the SSL task to localize anomalies. We investigate if a 3D CNN, predicting these errors as SSL task, can improve feature discrimination between anomalous and normal MS in our labelled dataset. Our SSL task leverages available normal MS data, essential for supervised downstream task training.

Overall, our main contributions can be summed up as follows:

  • We present a self-supervised method that improves the downstream classification of normal vs anomalous MS. Our self-supervision task explicitly learns to coarsely localize anomalies by reconstructing the residual volumes generated through the UAD-trained autoencoder. This distinguishes our approach from the compared methods, where anomaly localization is not a primary focus for the self-supervision task.

  • Our self-supervised method effectively utilizes labelled healthy MS data reserved for downstream tasks. Hence, we explore how varying the CAE training set impacts downstream classification performance.

  • We investigate post-processing strategies and loss function used in the self-supervision task for learning better transferable features for the downstream task.

Methods

Fig. 1
figure 1

a Extraction of MS volumes from cranial MRI, b Exemplary coronal images of normal MS volume and MS with mucosal thickening, polyp and cyst anomaly, c Our CAE architecture. Here, k refers to kernel size, s refers to stride, p refers to padding, c refers to channel where, for example, 1/16 refers to input channel of 1 and output channel of 16. Each stage of the encoder and decoder is formed using 3D convolution followed by batch normalization and leaky ReLU. Upsample refers to trilinear upsampling. d Generation of residual volume required for the self-supervision task using our CAE, e Our self-supervision task where the encoder and decoder is trained to reconstruct the residual volume, f Downstream task where the self-supervision trained encoder is trained to classify between normal and anomalous MS

Description of dataset

As part of the Hamburg City Health Study (HCHS) [27], cranial MRI scans were obtained from individuals aged 45-74 years to evaluate neuroradiological parameters. The scans were acquired using fluid attenuated inversion recovery (FLAIR) sequences in the NIfTI format at the University Medical Center Hamburg-Eppendorf. The MRI scans had a resolution of 173 mm x 319 mm x 319 mm. The labelled dataset consisted of 1067 patients. Among the patients, 489 exhibited no pathologies in their left and right MS, while 578 had at least one MS presenting polyp, cyst or mucosal thickening pathology. All these anomalies were grouped into the "anomaly" class. Our unlabelled dataset consists of 1559 patient MRIs. The diagnoses were established by two ENT specialists and one radiologist specialized in ENT. Figure 1b shows coronal slices highlighting the diverse set of anomalies that are present in our dataset.

Dataset preprocessing

In our dataset preprocessing, as outlined in previous work [14, 15], we first align MRIs with a fixed sample from our dataset. Centroid locations of left and right MS regions were recorded for 20 patients, guiding the extraction of MS volumes from larger cranial MRIs. This step isolates the relevant MS volumes for our task of classifying healthy and anomalous MS. We then used the mean centroid location from these 20 recordings to extract left and right MS volumes from all cranial MRIs in our dataset. The extracted volumes, sized 64 mm x 64 mm x 64 mm, cover the entire MS. Figure 1a illustrates this extraction process.

Each cranial MRI yielded one left and one right MS volume. To enhance symmetry, right MS volumes were horizontally flipped to match the left ones. All volumes were normalized to an intensity range of 0 to 1. We employed fivefold cross-validation for evaluation, ensuring diverse labelled datasets (10%, 20%, 40%, 60%, 80%) maintain the anomaly-to-normal ratio. The separation of training, validation, and test sets was strictly maintained, with left or right MS volumes from the same patient assigned to only one set. Table 1 details our dataset division across these sets.

Table 1 Statistics of our labelled dataset \(D_l\)

Architecture

Our CAE, depicted in Fig. 1c, uses 3D convolutional operations with a latent bottleneck dimension of 512. The CNN architecture is U-Net inspired, featuring a 3D ResNet18 encoder \(E(.)\) [28] with four stages and channel dimensions of 64, 128, 256, and 512. The decoder \(D(.)\) mirrors the encoder, with reverse channel dimensions and trilinear upsampling. Skip connections are used to pass encoder features to the decoder. For Bootstrap your own latent (BYOL), SimSiam, and SimCLR training, only the encoder \(E(.)\) is used, with an MLP attached to project the final layer features to a dimension of 512.

Autoencoder training and inference on unlabelled dataset

Consider \(D_{l}\) to be our labelled dataset containing normal and anomalous MS and \(D_{u}\) to be our unlabelled dataset. Further, let \(D_{l}^{n} \subset D_{l}\) be a dataset consisting of only normal MS volumes. Let \(x \in R^{64 \times 64 \times 64}\) be an MS volume in \(D_{l}\). Let the autoencoder be represented as \(A(.)\) such that \(x' = A(x)\) represents the reconstructed MS volume. We train the autoencoder using L1 reconstruction loss which may be written as \(\Vert x-x'\Vert \) on \(D_{l}^{n}\). Once trained, we use the autoencoder \(A(.)\) to generate residual volumes on \(D_{u}\). Figure 1d illustrates our residual volume generation method.

Fig. 2
figure 2

Our data processing pipeline comprises several steps: a The labelled dataset \(D_l\), b Splitting \(D_l\) into training, validation, and test subsets for downstream classification of normal versus anomalous MS. c Normal MS samples from the labelled training set form \(D_{l}^{n}\), used to train the 3D CAE \(A(.)\) within the UAD framework. d Unlabelled dataset \(D_u\), e This trained 3D CAE \(A(.)\) generates residual volumes from the unlabelled dataset \(D_u\), e Unlabelled dataset of residual volumes, f The 3D CNN undergoes self-supervised training to reconstruct these residual volumes. g The 3D CNN’s encoder is initialized with weights from the SSL task and then undergoes supervised training for the final task of classifying normal versus anomalous MS, using the training set created in step (a)

Transfer learning

Since transfer learning (TL) is a method to achieve data efficiency, we also trained our models initialized with transfer learning weights. However, since our downstream task involves MRI and is in 3D domain, ImageNet [29] weights may not be appropriate. Hence, the model weights we utilized as initial weights were obtained through training on eight diverse public 3D segmentation datasets, covering both MRI and CT modalities. We believe these weights are more suitable than those derived from natural image training and therefore employed them as the basis for our 3D CNN. For further information on the transfer learning model, please see the GitHub repository.Footnote 1

Self-supervised training

With the residual volumes generated for \(D_{u}\), we train \(E(.)\) and \(D(.)\) to reconstruct the residual volumes again. This, in effect, makes the encoder and decoder learn features relevant for anomaly localization within the unlabelled dataset \(D_{u}\). We train \(E(.)\) and \(D(.)\) using \(L_{recon}\) which in our case is binary cross-entropy (BCE) loss. Figure 1e illustrates our self-supervised training task. We evaluated our self-supervised learning method against autoencoder (AE), denoising autoencoder (DAE), BYOL, SimSiam, SimCLR and sparse masked modelling with hierarchy (SparK). These methods use similar encoders \(E(.)\) and decoders \(D(.)\), with BYOL, SimSiam, and SimCLR employing an additional MLP for feature projection. Pretraining with the SparK framework requires sparse encoder \(E'(.)\) and a special light decoder which contains 3 convolutional blocks and 3 upsampling blocks [23]. Patch size 8 \(\times \) 8 \(\times \) 8 and masking ratio of 60% was used during pretraining. Detailed description and implementation details of our state-of-the-art (SOTA) SSL methods is provided in the supplementary material section 1-7. More details about the other masking ratios and patch sizes tested for SparK can be found in the supplementary material section 11.

Table 2 The table displays the mean and 95% confidence intervals of metrics evaluating model performance in the downstream classification task

Finetuning

Having successfully trained the \(E(.)\) and \(D(.)\) using self-supervision, we move onto the finetuning phase. We discard \(D(.)\) and focus on training \(E(.)\) by leveraging samples from the labelled dataset \(D_{l}\). For TL models, we initialize \(E(.)\) with transfer learning weights. Next, we introduce a MLP as an additional component, responsible for projecting the encoder features from their original dimension of 512 to an intermediate dimension of 256. Subsequently, the MLP maps these features to a final dimension of 2, corresponding to the number of classes. We finetune \(E(.)\) using BCE loss.

Figure 2 illustrates the data processing pipeline and elucidates how the different components fit into our overall method.

Implementation details

Our PyTorch and PyTorch Lightning-based code accommodates a maximum batch size of 256 on NVIDIA A6000 with 48GB VRAM for self-supervised pretraining. We optimize models using LARS [30] with a learning rate of 0.2 across 500 epochs, incorporating a 20-epoch linear warmup and cosine annealing. For finetuning, AdamW [31] is employed with a constant rate of 1e-4 for 100 epochs at a batch size of 16. Models yielding the lowest validation loss are preserved for final evaluation with the test set. The CAE was trained on 708 normal MS volume samples without augmentation. For self-supervised methods and MS anomaly classification, we applied data augmentations such as random affine transformations, flipping, and Gaussian noise. The DAE specifically used Gaussian noise with a mean of 0 and standard deviation of 0.6 at 100% probability, while other augmentations were applied 50% of the time. Supplementary material offers comprehensive descriptions and visualizations of SOTA SSL methods.

Results

Comparison to state of the art

Results in Table 2 show our method outperforming others in AUROC, AUPRC, and F1 scores across different labelled dataset scenarios (10%, 20%, 100% of \(D_{l}\)). Our method demonstrated notable improvements in AUROC (3.34% and 4.93% over SimSiam) and AUPRC (5.33% over BYOL and 5.12% over AE) for 10% and 20% dataset scenarios, respectively. SparK trained models perform generally poorer compared to the other SSL and TL methods with the performance gap between SparK MAE and our method widening with increased training set percentage. Our method had AUPRC 8.21% higher than the TL method when finetuned on a 10% training set. Pretraining models using our method significantly boosted AUPRC by 14.49% and AUROC by 9.45% compared to no pretraining when trained on a 10% training dataset. At 100% dataset finetuning, our method achieved the highest scores, with AE and SimSiam showing similar performance. Compared to no pretraining, our method improved AUPRC by 3.33%. Figure 3 illustrates AUPRC and AUROC trends with increasing training set percentages, respectively. Our method excels in settings with 40% or less training data but aligns with SOTA performance beyond that.

Fig. 3
figure 3

(LEFT) AUPRC trend vs training set percentage (RIGHT) AUROC trend vs training set percentage

Table 3 The table shows the mean and 95% confidence intervals of metrics for evaluating model performance in downstream classification

Effect of varying the CAE training set

The effectiveness of our self-supervised task is contingent on the CAE’s proficiency in reconstructing healthy MS volumes. Inaccurate reconstructions yield unreliable residuals, affecting self-supervision. To assess the impact of training set size, the CAE was trained with different proportions (20%, 40%, 60%, 80%, 100%) of the healthy MS dataset \(D_{l}^{n}\). After training, the CAE processed dataset \(D_{u}\) to produce residual volumes, which were refined using a median filter with a kernel size of 5. Subsequent supervised training utilized 10% of our labelled dataset \(D_{l}\). Table 3 presents improvements in the downstream task metrics correlating with increased healthy MS training set sizes, suggesting that larger normal dataset \(D_{l}^{n}\) enhance normal MS representation learning and improve anomaly localization.

Discussion

Tailoring SSL tasks to specific downstream tasks offers distinct advantages [32]. Current SOTA SSL methods [20,21,22], primarily developed for 2D image classification on datasets like ImageNet, do not address the unique challenges of 3D MRI modalities and the specifics of paranasal anomalies. Our SSL task is specifically tailored to address the challenges associated with 3D environments, MRI modality, and the classification of paranasal anomalies.

We conjecture that segmentation of anomalies as a SSL task, requiring knowledge of anomaly locations, enhances the learning of class-discriminative features for distinguishing normal and anomalous MS. Our SSL task is a segmentation task therefore, it requires segmentation masks highlighting anomalies. To avoid the high costs of annotation, we use a CAE trained in the UAD framework for generating approximate annotations, effective in localizing paranasal anomalies [26]. This CAE training utilizes labelled normal datasets, typically accessible in supervised settings. Unlike generic SOTA SSL methods, which do not prioritize anomaly localization, our approach demonstrates improved AUROC and AUPRC (as shown in Table 2), suggesting that effective anomaly localization can enhance classification performance, even with limited labelled data. Methods like BYOL and SimSiam, which aim to maximize agreement between augmented views, are less effective for paranasal anomaly classification. SimCLR’s performance shortfall is likely due to smaller batch sizes, a necessity given the impracticality of large batches in 3D settings, despite SimCLR’s recommendation of 4096 [33]. Our method is more suited for such constrained computational resources. AE and DAE, focusing on compression-decompression and denoising, do not guarantee discriminative feature learning for downstream classification [34], and were found less effective in our context. When the entire training set is used, our method, AE, and SimSiam yield comparable results, with ours marginally outperforming. We also explored MAE-style pretraining using SparK. However, the results suggest that fine-tuning performance is notably weaker, particularly when fine-tuning with a training set percentage 40% and above. These findings imply that generating masked regions contributes to representation learning; however, the acquired representations do not appear to enhance downstream classification. It is noteworthy that the SparK framework was initially developed and evaluated for 2D natural images. Although we adapted the framework for 3D applications, our findings underscore the necessity for further methodological advancements to effectively support tasks in the 3D domain. Further, TL models exhibit comparable performance to SSL methods when fine-tuning on training sets exceeding 20%. This suggests that transfer learning methods remain viable for paranasal anomaly classification given an ample supply of labelled samples. However, in the scenario of an extremely limited labelled dataset, such as 10%, our method outperforms TL, indicating that the representations acquired by our approach are especially advantageous in low-data environments. Overall, compared to approaches without pretraining, our tailored SSL task consistently shows superior downstream classification performance, underlining its efficacy.

Our analysis regarding the impact of the CAE training set size shown in Table 3 has demonstrated that the inclusion of a substantial cohort of normal MS volumes yields notable benefits for both the self-supervision task and the subsequent downstream task suggesting that better anomaly localization by the CAE and thereby better representation learning by the CNN in the self-supervision task. We also analysed the influence of the loss function and post-processing used in the self-supervision task which can be found in the supplementary material section 8 and 9.

Our study has limitations that require further investigation. It is based on a single-centre MRI-only study, so multi-centre studies with varied imaging modalities are needed for generalizability. Our methods rely on a cohort of healthy MS volumes, unlike other self-supervised tasks. We focused on convolutional autoencoders, not exploring models like variational autoencoders generative adversarial networks, or transformer-based architectures and diffusion models, which might offer better anomaly localization. We compared L1, L2, and BCE loss functions but not others like the Structural Similarity Index or perceptual loss. Future research should examine these aspects and apply this self-supervision approach to other domains, like brain anomaly detection.

Conclusion

We developed a novel self-supervision task that focuses on anomaly localization to better classify paranasal anomalies in the maxillary sinus, addressing the lack of methods that effectively use unlabelled datasets to learn discriminative features for this purpose. Our approach uses an autoencoder trained on healthy MS volumes to generate residual volumes from an unlabelled dataset. These residuals serve as coarse segmentation masks for localizing anomalies. By training a CNN to reconstruct these volumes, it implicitly learns anomaly localization, thereby developing transferable features for the downstream classification task. Our method outperforms existing self-supervision techniques, proving its effectiveness in this specific domain.