Background

Artificial intelligence (AI) has become an important tool in healthcare and medical image analysis [1]. Its application in radiology [2], specifically in automated diagnosis of chest radiographs [3], has gained increasing traction. Given the intricate challenges posed by the complexity and variability of chest radiographs, leveraging AI for improved interpretation is an important area of research and application. Since the number of labeled chest radiographs with definitive diagnosis available for the training of AI models is limited, interest in self-supervised learning (SSL) has grown.

SSL is a learning paradigm that allows models to derive rich representations from unlabeled data [4,5,6]. Unlike traditional supervised learning (SL), which relies on accurately labeled datasets that can be laborious and resource-intensive to create, SSL can be used with images only that contain no labels, offering a promising alternative for robust feature extraction. In addition, exciting possibilities arise from AI advancements, such as the evolution of transformer architectures from the realm of natural language processing (NLP) to computer vision [7]. The “vision transformer” (ViT), introduced in 2021 by Dosovitskiy et al. [8], replaces traditional convolution-based techniques with self-attention mechanisms [7], showing promise for healthcare applications. Nevertheless, further exploration is needed to fully integrate these advancements with existing pretraining methodologies [9], and we tackle this problem in our investigation.

It has been established in the literature that selecting an appropriate weight initialization for deep neural networks is a critical step that can influence the performance of AI models [10,11,12]. Usually, this is done by pretraining the network with SL on an unrelated task before training on the actual task. Numerous large-scale, public, annotated pretraining image datasets are available for this paradigm. The most widely used such datasets are ImageNet [13], the dataset of the Canadian Institute for Advanced Research, CIFAR [14] (CIFAR-10 and CIFAR-100), PASCAL Visual Object Classes [15], Microsoft Common Objects in Context [16], and places [17]. These datasets provide a valuable resource for initializing network weights when dedicated task-related pretraining weights are not accessible. In particular, the ImageNet database and its extended versions like ImageNet-21 K [13], trained on roughly 14 million annotated images, have enabled substantial performance increases of AI models and are widely regarded as the benchmark for pretraining deep learning models for image classification tasks [10,11,12].

One drawback is that pretraining in this manner requires the images to be equipped with labels that depict what can be seen in the images. This naturally limits the number of available images, since labeling is a costly and resource-intensive procedure. Methods that use SSL, such as described in literature [4,5,6, 18,19,20], on the other hand have the advantage that images do not need to be labeled, and thus, much larger databases can be constructed (Fig. 1).

Fig. 1
figure 1

The process and advantages of utilizing self-supervised learning (SSL) as a pretraining method for medical AI models. a Supervised learning shows the traditional process of AI pretraining using labeled datasets, which can be resource- and time-intensive due to the need for manual annotation. b SSL paradigm where AI models are trained on unlabeled non-medical images, taking advantage of freely available data, bypassing the need for costly and time-consuming manual labeling. c Transfer of learnings from the SSL pretrained model using non-medical images to a supervised model for accurately diagnosing medical images, highlighting the potential for improved performance in medical AI models due to the large-scale knowledge gained from SSL

In this study, we investigate if pretraining with SSL on large unannotated image databases based on DINOv2 [18] can improve performance of medical AI models as compared to pretraining with SL. We examine this by training AI models to diagnose over 20 radiological imaging findings on an international multi-site dataset spanning three continents and comprising over 800,000 chest radiographs.

Methods

Patient cohorts

We analyzed frontal chest radiographs from six international patient cohorts across three continents, sourced from the VinDr-CXR [21], ChestX-ray14 [22], CheXpert [23], MIMIC-CXR [24], UKA-CXR [3, 25,26,27,28], and PadChest [29] datasets. Collectively, the study encompassed 805,805 radiographs from patients aged between 1 and 111 years. The median patient age was 61 years, with an average of 59 years and a standard deviation of 18 years. An overview of the characteristics for each dataset can be found in Table 1.

Table 1 Characteristics of the datasets utilized in this study

Label generation and parameters

This subsection delves into the label generation process, details the specific labels associated with each chest radiograph dataset, and references imaging parameters provided in the original studies. The labeled diseases within each dataset were not identical, but overlapped partially, details are given in Table 2.

Table 2 Distribution of different labels provided across datasets, considering only frontal images

VinDr-CXR

The VinDr-CXR [21] dataset, collected between 2018 and 2020, sourced over 100,000 chest radiographs from two Vietnamese hospitals’ picture archiving and communication system servers. These images were captured using a broad spectrum of scanners from different medical equipment brands. The dataset was carefully anonymized for patient privacy. A Python script removed digital imaging and communications in medicine (DICOM) tags with protected health information (PHI) [30], keeping only vital image processing attributes. Textual data on the images was auto erased, with a manual check ensuring no text remained. While the primary focus was on adult posteroanterior-view chest radiographs, the collection did have outliers, which were filtered using a binary classifier. The dataset was annotated for 28 findings and diagnoses, including 22 localized and 6 global labels. Expert radiologists curated these labels based on condition prevalence and visibility in chest radiographs. Using a web-based system [31], 17 radiologists labeled the data. From the refined data, 18,000 radiographs were selected, with 15,000 designated for training and 3,000 for testing. Three radiologists independently annotated each image, and for the test set, any disagreements were resolved by two senior radiologists to ensure label accuracy [21].

ChextX-ray14

The ChestX-ray14 [22] dataset targets fourteen common thoracic pathologies, identified through radiologists’ input. Using these pathologies as keywords, related radiological reports and images were extracted from the picture archiving and communication system. Through NLP techniques [32], reports were labeled based on the presence or absence of the specified pathologies while also excluding negations and uncertainties. The labeling process involved two main steps [22]: (i) initially detecting disease concepts primarily from report sections and then (ii) categorizing undetected reports as “normal.” Disease identification was enhanced using DNorm [33] and MetaMap [34]. To ensure accurate labeling, the team integrated advanced methodologies for handling negations and uncertainties, leveraging tools like NLTK [35], the Bllip parser [36], David McClosky’s biomedical model [37], and the Stanford dependencies converter [38]. A “normal” label was applied if no disease was detected or if the report indicated normalcy. The labeling approach’s accuracy was validated using the OpenI API [39, 40].

CheXpert

The CheXpert [23] dataset includes 224,316 frontal and lateral chest radiographs from 65,240 patients, collected from Stanford Hospital between 2002 and 2017. Each radiograph is annotated for 14 clinically relevant observations [41] as positive, negative, or uncertain. The selection of these observations emerged from the manual review of 1,000 associated radiology reports by a board-certified radiologist. The labeling process hinged on a rule-based NLP labeler and transpired in three stages. Key observations were gleaned from the Impression section of the radiology reports. This extraction used a comprehensive list of phrases, meticulously curated by radiologists. The subsequent phase saw these extracted mentions being classified as negative, uncertain, or positive. Any ambiguities in the report, or direct expressions of uncertainty by the radiologist, were categorized as “uncertain.” If a mention was not distinctly categorized, it defaulted to a positive label. Following a procedure similar to NegBio [42], this classification leaned on tools such as NLTK [35], the Bllip parser [36], and Stanford CoreNLP [43], seeking a universal dependency parse of the report. Finally, the individual mention classifications coalesced to assign a conclusive label to each of the 14 observations. The absence of a mention was labeled as blank [23].

MIMIC-CXR

The MIMIC-CXR [24] dataset encompasses 377,110 frontal and lateral images stemming from 227,835 radiographic studies conducted at Beth Israel Deaconess Medical Center, Boston, MA, USA. Chest radiographs from 2011 to 2016 were identified, and all corresponding reports within this timeframe were extracted. The radiographs, sourced in DICOM format, faced rigorous de-identification processes, particularly for potential PHI in meta-data and “burned in” annotations [24]. Further, the reports underwent a detailed, rule-based de-identification, producing two primary segments: an optional addendum and the primary report body—both penned by radiologists. Extraneous details were trimmed, and any PHI was uniformly replaced with underscores. Notably, the same NLP labeler employed in the CheXpert [23] dataset was applied to these reports. This facilitated the automatic generation of labels for the chest radiographs, categorizing the 14 imaging findings, consistent with CheXpert, as positive, negative, or uncertain. To validate the de-identification process, 2,238 radiology reports were manually annotated to detect PHI. This manual process identified eight tokens of PHI that the automated method overlooked, which were subsequently removed [24].

UKA-CXR

The UKA-CXR [3, 25,26,27,28], an internal dataset from University Hospital RWTH Aachen, Germany, includes frontal chest radiographs collected between 2009 and 2020. Captured across 10 varied intensive care units using 18 distinct mobile radiography systems by over 70 specialized radiologic technologists, the methodology evolved from conventional screen-film systems to digital flat-panel detectors by 2016. Despite diverse patient positioning and source-to-digital film distances, all images were consistently shot in the anteroposterior orientation, facilitated by automatic exposure control. Labeling involved a rigorous review of each radiograph by one of 98 radiologists on designated clinical workstations, employing a standardized template. These radiologists, accredited or guided by board-certified colleagues, adhered to established radiologic conventions while evaluating the images [3]. The dataset features labels like pleural effusion, pneumonia, atelectasis, congestion, and cardiomegaly, each segmented into five distinct severity or extent gradations. For instance, cardiomegaly ranged from “normal” to “massively enlarged,” whereas other labels spanned classifications such as “negative,” “mild,” “moderate,” “severe,” and “uncertain mild” [3, 25].

PadChest

The PadChest [29] dataset, derived from the Hospital Universitario de San Juan in Alicante, Spain, encompasses studies from 2009 to 2017, totaling 109,931 studies and 168,861 distinct frontal and lateral images. All data was de-identified. The images were dynamically rescaled based on DICOM parameters, with no resizing to maintain resolution. Projection and body position information were used to categorize images into six primary groups: standard posteroanterior, standard lateral, anteroposterior vertical, anteroposterior horizontal, pediatric, and rib views [29]; 27% of the reports, which translates to 27,593 studies, were manually annotated by radiologists. This was streamlined by an automated topic extraction process, which presented radiologists with frequently occurring sentences, allowing for more efficient and consistent labeling. Once this subset of data was labeled, it was used to train a multilabel text classifier which was then employed to automatically annotate the remaining 73% of the reports [29].

Experimental design

A schematic representation of the study methodology is presented in Fig. 2. The process commenced with step 1, i.e., the pretraining of a ViT [8] base model. This was achieved through three distinct strategies: (i) SSL with non-medical images, DINOv2 [18], (ii) SL on ImageNet-21 K [13], and (iii) SL with MIMIC-CXR chest radiographs [24]. Step 2 involved fine-tuning the models using labeled chest radiographs. Finally, in step 3, the refined models underwent an evaluation process, where they were tested using images from held-out test sets of chest radiographs from different domains.

Fig. 2
figure 2

General methodology. a Pretraining: the vision transformer base (ViT-B) undergoes pretraining through three avenues: (i) self-supervised learning (SSL) on non-medical images (DINOv2(18)), (ii) supervised learning (SL) using ImageNet-21 K [13], and (iii) SL based on MIMIC-CXR [24] chest radiographs. b ViT-B models are subsequently fine-tuned using labeled chest radiographs from various datasets. c Prediction: diagnostic performance of these models is assessed using images from unseen test sets from various datasets. Although this figure exemplifies pneumonia prediction using a single dataset, steps 2 (fine-tuning) and 3 (systematic evaluation) were consistently implemented across six major datasets: VinDr-CXR (n = 15,000 training, n = 3,000 testing), ChestX-ray14 (n = 86,524 training, n = 25,596 testing), CheXpert (n = 128,356 training, n = 39,824 testing), MIMIC-CXR (n = 170,153 training, n = 43,768 testing), UKA-CXR (n = 153,537 training, n = 39,824 testing), and PadChest (n = 88,480 training, n = 22,045 testing). The refined models identify a total of 22 distinct imaging findings

Network architecture

Our study employed the original 12-layer vision transformer (ViT) base (ViT-B) model as devised by Dosovitskiy et al. [8]. This network ingested image inputs of dimensions (224 × 224 × 3) in batches of 32. For compatibility with the red, green, and blue (RGB) format of pretraining images, grayscale radiographs were replicated across three channels while retaining their grayscale nature. The embedding layer featured dimensions of either (16 × 16) or (14 × 14), depending on the pretrained weights available. A convolution operation with strides of (16 × 16) or (14 × 14) ensued, followed by a positional embedding layer. This sequence generated an output sequence of vectors featuring a hidden layer size of 768. These vectors were subsequently inputted to a standard transformer encoder. A fully connected layer constituted the classification head, employing a binary sigmoid function to convert the output predictions into individual class probabilities.

Step 1: pretraining

SSL pretraining on non-medical images (DINOv2)

DINOv2 [18], an advancement of the DINO [44] method by Meta AI, focuses on self-supervised learning, striving to extract diverse visual features from a vast, curated dataset. Initially comprising 1.2 billion images drawn from a variety of online sources, the dataset went through a rigorous deduplication process [45, 46], culminating in the refined LVD-142 M [18] dataset with 142 million unique images. This curation integrated images from notable datasets like ImageNet, Google Landmarks, and an array of broader public and internal web repositories. Using embeddings from the “Huge” iteration of the ViT network architecture (ViT-H) [8] pretrained on ImageNet [13], a connection was established between curated and uncurated images, paving the way for the LVD-142 M dataset. From this foundation, several ViT models, aligned with the DINOv2 training methodology, were developed. The ViT base (ViT-B) [8] iteration of this model served as the weight reference for our study.

The essence of DINOv2 synthesizes elements from DINO [44] and iBOT [47] losses, enhanced by the centering technique of SwAV [48]. The approach incorporates dual primary objectives: image level and patch level. The image-level objective deploys a cross-entropy loss between features extracted from varying crops of an identical image using a ViT, from both a student and a teacher network built with an exponential moving average of past iterates [49]. In contrast, the patch-level objective operates by selectively masking certain input patches for the student, followed by the application of a cross-entropy loss between the patch features of both the student and teacher networks [47]. To combat issues of overfitting and underfitting, the weights associated with these objectives were decoupled. To ensure uniform feature distribution, the Sinkhorn-Knopp [50] normalization and the KoLeo regularizer [51] were employed [48, 52]. While models trained at a 416 × 416 resolution showcased optimal performance across various resolutions, they necessitated nearly triple the computational capacity compared to the 224 × 224 resolution. Nonetheless, a balanced approach was adopted by conducting self-supervised training at 224 × 224 and amplifying the resolution only in the concluding iterations, delivering near-optimal results without an exorbitant computational burden [53]. For more detailed information regarding data preparation, training, and optimization steps, please refer to the original paper [18].

SL pretraining on non-medical images (ImageNet)

ImageNet [13] is a vast database with diverse, annotated non-medical images. The subset, ImageNet-21 K, houses over 14 million images of various resolutions across 21,841 categories. Using supervised learning (SL), a ViT-B model (patch size 16 × 16, input size 224 × 224 × 3) was trained end to end on the complete ImageNet-21 K to predict among the 21,841 available categories.

SL pretraining on chest radiographs (MIMIC-CXR)

MIMIC-CXR [24] stands as the largest public chest radiograph dataset to date. Adopting a training approach similar to that of ImageNet [13], a ViT-B model was trained on MIMIC-CXR for classifying specific imaging findings relevant to our fine-tuning datasets. Unlike the foundational models established using DINOv2 [18] and ImageNet, this strategy directly targets the specific task at hand. Despite the smaller dataset size compared to the prior two methods, the task-specific nature and substantial scale of MIMIC-CXR suggest potential for enhanced performance at first glance.

Step 2: fine-tuning (SL training on chest radiographs)

Choice of the training chest radiographs for fine-tuning

For benchmarking, six chest radiograph datasets were standardized using only frontal images for both fine-tuning and evaluation. Original sets from VinDr-CXR and ChestX-ray14 were retained, while CheXpert, MIMIC-CXR, UKA-CXR, and PadChest were divided into 80% training and 20% test sets based on patients. This ensured radiographs from one patient stayed together, preserving patient-specific integrity. Training sets had 128, 356, 170, 153, 153, 537, and 88,480 images for CheXpert, MIMIC-CXR, UKA-CXR, and PadChest, respectively. Test sets contained 29, 320, 43, 768, 39, 824, and 22,045 images correspondingly. Consistent sets were used across all steps for comparable evaluations [25,26,27].

Label unification

In line with previous studies [25, 26, 28], a binary multilabel classification approach was employed, permitting each image to receive a positive or negative diagnosis for each disease. Optimization was centered on the average performance across all labels, without delving into detailed comparisons for individual diseases. For datasets with certainty levels (CheXpert and MIMIC-CXR), labels were converted to binary: classifications marked as “certain negative” and “uncertain” were categorized as negative, while “certain positive” was deemed positive. The final breakdown of the labels employed for each dataset’s multilabel diagnosis in this study is provided in Table 3. Labels with minimal representation were excluded from our final label selection, e.g., “lung cyst” and “edema” in the VinDr-CXR dataset had only 6 and 1 positive instances, respectively (refer to Table 2). Thus, they were excluded from our final label selection for the VinDr-CXR dataset (see Table 3).

Table 3 Breakdown of labels used for multilabel diagnosis across datasets in this study

Overall, our analysis encompassed 30 labels spanning all datasets. The specific number of these labels within the VinDr-CXR, ChestX-ray14, CheXpert, MIMIC-CXR, UKA-CXR, and PadChest datasets was 11, 14, 10, 10, 9, and 17, respectively. A detailed breakdown of these labels per dataset can be found in Table 3.

Standardized image preprocessing

To standardize and ensure equitable comparisons across various SL fine-tuning experiments, we uniformly applied a consistent image preprocessing approach to all chest radiograph datasets for fine-tuning. This preprocessing sequence began with resizing all images to a consistent dimension of 224 × 224 pixels. Subsequently, min–max feature scaling, as suggested by Johnson et al. [24], was employed. Finally, to enhance image contrast and thereby aid in more accurate disease identification, we applied histogram equalization to the processed images [25,26,27].

SL training configuration

All ViT models were optimized using the AdamW [54] optimizer with learning rates set at 1 × 10-5. The network comprised approximately 86 million trainable parameters. Data augmentation strategies included random rotation within the range of [0, 8] degrees and random flipping [25]. Each network was trained end to end, i.e., optimizing all the parameters, in a supervised learning manner employing each of the three sets of pretrained weights as initial weights.

It is noteworthy that class imbalance is a pervasive issue in numerous medical image datasets, often resulting in biased model training that disproportionately favors the majority class [55]. This is evidenced in our study by Table 2, which presents the distribution of positive labels for each dataset, revealing distinct variations in distributions. To address this concern, binary weighted cross-entropy [56], a modification of the standard binary cross-entropy, was utilized as our loss function. Weights for individual labels were determined based on the inverse frequency of each label within the training data for the respective dataset [3, 25,26,27].

Step 3: evaluation and statistical analysis

Test sets, held out from the training sets of each dataset, remained consistent across all experiments for benchmarking. The primary evaluation metric for our study was the area under the receiver operating characteristic curve (ROC-AUC), supported by accuracy, specificity, and sensitivity, calculated with a threshold that was determined according to the Youden’s criterion [57]. We employed bootstrapping [58] with replacement, on each test set with 1,000 redraws for each ROC-AUC value to determine the statistical spread in terms of mean ± standard deviation and to calculate p-values. Multiplicity-adjusted p-values were determined based on the false discovery rate to account for multiple comparisons, and the family-wise alpha threshold was set at 0.050.

Results

Pretraining with SSL versus SL for medical AI models

We compare two settings for the pretraining stage of AI models: in the first setting, pretraining is performed using SSL on the DINOv2 [18] dataset; in the second setting, pretraining is done with SL on ImageNet-21 K [13]. For both settings, we subsequently fine-tune the AI model on radiographs to classify the presence of a disease. We consistently observe superior classification performance for the first setting. The models that were pretrained with SSL exhibit significantly superior performance in terms of the average over all ROC-AUC values for individual labels as compared to those pretrained with SL for all datasets (VinDr-CXR 88.92 ± 4.59% [mean ± standard deviation] versus 86.38 ± 6.27%; ChestX-ray14 79.79 ± 6.55% versus 79.10 ± 6.34%; CheXpert 80.02 ± 6.60% versus 79.56 ± 6.51%; MIMIC-CXR 80.52 ± 6.17% versus 79.92 ± 6.35%; UKA-CXR 89.74 ± 3.57% versus 89.45 ± 3.62%; and PadChest: 87.62 ± 4.86% versus 87.12 ± 5.05%; p < 0.001 for all dataset pairs). Figures 3 and 4 display the receiver operating characteristic curves for all individual labels, encompassing a total of 30 unique labels, which consist of 22 specific imaging findings and healthy participants, across each dataset for both methodologies. Table 3 provides a detailed breakdown of the classification targets for each dataset, and Table 4 provides a comprehensive comparison of the average ROC-AUC, accuracy, sensitivity, and specificity for each fine-tuning dataset. For an even more detailed comparison, Supplementary Tables S1–S6 provide individual evaluation metrics for each label.

Fig. 3
figure 3

Evaluation contrasting pretraining using self-supervised learning (SSL) on non-medical images with supervised learning (SL). Models were either pretrained with SSL (DINOv2, shown in blue) or with SL (ImageNet [13], shown in orange) on non-medical, non-medical images. Subsequently, these models were fine-tuned on chest radiographs in a supervised manner for six datasets: (a) VinDr-CXR [21], (b) ChestX-ray14 [22], (c) CheXpert [23], (d) MIMIC-CXR [24], (e) UKA-CXR [3, 25,26,27,28], and (f) PadChest [29] with fine-tuning training images of n = 15,000, n = 86,524, n = 128,356, n = 170,153, n = 153,537, and n = 88,480, respectively, and test images of n = 3,000, n = 25,596, n = 39,824, n = 43,768, n = 39,824, and n = 22,045, respectively. The box plots present the mean area under receiver operating characteristic curve (ROC-AUC) values across all labels within each dataset. A consistent pattern emerges, showing SSL-trained models outperforming SL pretrained ones. Crosses denote means; boxes define the interquartile range (from Q1 to Q3), with the central line signifying the median (Q2). Whiskers stretch to 1.5 times the interquartile range above Q3 and below Q1. Points beyond this range are marked as outliers. Statistical differences between the DINOv2 and ImageNet approaches were evaluated through bootstrapping, with corresponding p-values displayed. Note the varying y-axis scales

Fig. 4
figure 4

Receiver operating characteristic (ROC) curves of individual labels comparing diagnostic models pretrained with self-supervised learning (SSL) on non-medical images against fully supervised learning (SL) on non-medical images. Models pretrained via SSL used DINOv2 (solid lines), while SL utilized ImageNet (dotted lines). These models were subsequently fine-tuned in a supervised manner on chest radiographs from six datasets: VinDr-CXR, ChestX-ray14, CheXpert, MIMIC-CXR, UKA-CXR, and PadChest. The number of training images for SL fine-tuning for each dataset was n = 15,000, n = 86,524, n = 128,356, n = 170,153, n = 153,537, and n = 88,480, and test images were n = 3,000, n = 25,596, n = 39,824, n = 43,768, n = 39,824, and n = 22,045, respectively. Corresponding area under ROC curve values for each label, presented as mean ± standard deviation (95% CI), is provided in the bottom right, contrasting DINOv2 versus ImageNet pretraining strategies

Table 4 Comparative evaluation of pretraining with self-supervision on non-medical images versus full supervision on non-medical images

SSL pretraining on non-medical images versus SL pretraining on radiographs

In the preceding experiment, we investigated pretraining using SSL and SL on non-medical images. An alternative approach to such pretraining on unrelated tasks is pretraining on medical images, potentially even with SL if labels are available. Here, we compare two settings: (i) pretraining with SSL on non-medical images (as before) versus (ii) pretraining with SL on 210,625 radiographs from the MIMIC-CXR [24] dataset. This dataset is currently the most comprehensive dataset of chest radiographs that is publicly available. We pretrained the network on this dataset by aligning all labels from the MIMIC-CXR dataset with each of the other datasets respectively, selecting all overlapping labels. This led to the identification of up to 10 different imaging findings for each dataset.

For both scenarios, we then trained networks for the task at hand, i.e., for classification in VinDr-CXR, ChestX-ray14, CheXpert, UKA-CXR, and PadChest. Table 5 presents the ROC-AUC values for individual labels for each dataset. We find that for large datasets, approach (i) performs better CheXpert (ROC-AUC 80.02 ± 6.60% [mean ± standard deviation] versus 79.45 ± 6.60%, p < 0.001) and UKA-CXR (ROC-AUC 88.49 ± 2.65% versus 88.32 ± 2.77%, p = 0.001). However, for small datasets, approach (ii) performs better VinDr-CXR (ROC-AUC = 91.58 ± 3.45% versus 94.47 ± 3.30%, p < 0.001); ChestX-ray14 (ROC-AUC 77.99 ± 6.38% versus 78.68 ± 6.77%, p < 0.001); and PadChest (ROC-AUC 87.89 ± 4.30% versus 89.30 ± 4.45%, p < 0.001).

Table 5 Comparison of pretrained weights: self-supervised learning with large non-medical images versus supervised learning with a large, task-specific chest radiograph dataset

Together, these results show that both approaches (i) and (ii) have their merits in different regimes: (i) can help to steer the network in the right direction when only few data are available for the training stage, while (ii) prevails when enough training images are available such that fine-tuning of the pretrained weights can be performed on an unrelated task.

Discussion

We investigated different pretraining methods for the task of image classification in thoracic radiographs. Since AI performance is often dependent on the training and testing domain, we gathered over 800,000 publicly available chest radiographs spanning six distinct institutions across the USA, Europe, and Asia to test our results over a wide variety of different data sources.

Our primary exploration centered around gaining an understanding of the effectiveness and benefits of SSL on non-medical images for the follow-up task of image classification on chest radiographs. We compared three different pretraining strategies: SSL pretraining on a dataset of non-medical images (DINOv2 [18]), supervised pretraining on non-medical images (ImageNet-21 K [13]), and supervised pretraining on medical images (MIMIC-CXR [24]). We employed a state-of-the-art vision transformer [8] architecture and found that SSL on non-medical images serves as a highly effective method for initializing network weights that significantly and consistently improve the ROC-AUC of AI models for chest radiograph classification. Notably, our results demonstrate that under specific circumstances, initializing networks with weights obtained via SSL from non-medical images such as the LVD-142 M dataset [18] can outperform initialization with weights derived from supervised learning on a task-specific, large-scale chest radiograph dataset. This research opens up new perspectives in the application of AI within the medical image analysis domain and has particular importance for situations where large, modality-specific public datasets for pretraining are not available.

The significantly superior performance of models pretrained with SSL on non-medical images based on the DINOv2 [18] method, compared to those pretrained with supervised learning on the ImageNet-21 K [13] dataset, substantiates the claim that weights derived from SSL with non-medical images might better generalize to non-related tasks than weights derived from SL on non-medical images.

It is important to note that these findings were consistent across a variety of imaging findings and across datasets of different origins covering over 800,000 images.

Interestingly, even when compared to supervised learning with a dedicated and the largest public chest radiograph dataset (MIMIC-CXR [24]) to date, the pretraining with SSL on non-medical images demonstrated competitive performance. These results hold promising implications, especially when access to large amounts of annotated medical data is a challenge. Hence, leveraging SSL on non-medical images can be an effective strategy to compensate for the scarcity of annotated medical datasets.

Our study, while yielding promising outcomes for SSL application with non-medical images in medical imagery interpretation, is not without constraints, suggesting avenues for prospective research. Firstly, despite our paired comparison design, we fine-tuned all models with radiograph inputs sized 224 × 224. However, prior studies [59, 60] employing convolutional networks have determined resolutions between 256 × 256 and 448 × 448 to be ample for diagnostic purposes in chest radiographs. Moreover, our chosen network architecture, the ViT [8], has consistently delivered competitive results in existing literature [61,62,63] with 224 × 224 inputs. Secondly, we propose to extend the analysis to other medical imaging modalities, such as magnetic resonance imaging, computed tomography, or gigapixel imaging in histopathology [64], and for further downstream tasks such as segmentation [65]. Our current endeavor serves as a starting point for exploration into leveraging freely available non-medical images via SSL for medical diagnostics. Third, given the multimodal nature of medical imaging [63], leveraging SSL for these different medical imaging types could yield even richer and more diverse representations, potentially enhancing the diagnostic capabilities of AI models. A persistent challenge, however, remains in sourcing vast volumes of medical images, even if they are unlabeled. Collaborative efforts might be the key to addressing data accessibility challenges.

Our findings highlight the potential of SSL on non-medical images for network initialization in the task of chest radiograph interpretation. The promising results of this approach could inspire further exploration of SSL strategies in the realm of medical imaging, particularly when access to large, annotated medical datasets is limited.