Introduction

Chest radiography or chest X-ray (CXR) is one of the most frequently used medical imaging method for timely and accurate diagnosis of various chest and pulmonary diseases. CXRs are easily acquired, are cost-effective, and contain large amount of information about the region under study that make them useful for early screening and diagnosis [1, 2]. CXRs can be used to identify diseases such as Tuberculosis [3], pneumonia [4], cancer [5], cardiomegaly [6], etc. However, one of the major issues with CXRs is that for accurate diagnosis, they require careful interpretation by experienced radiologists which can take a lot of time and resources [7]. Furthermore, the interpretation of CXRs varies from radiologist to radiologist with large discrepancy rates reported [8]. The factors that affect the diagnosis accuracy of radiologists include large workload, negligence, lack of knowledge, and faulty reasoning, among other reasons [8].

In recent years, due to the increase in computational power and availability of large amounts of data, deep learning techniques have emerged as the state of the art in various image processing and computer vision applications [9,10,11]. Consequently, many studies have been carried out to aid radiologists using deep learning approaches, especially the convolutional neural networks (CNNs), for classification, localization, and segmentation of medical images [12,13,14]. Recently, however, the CNNs have been outperformed by the attention-based architecture known as Transformer [15, 16].

Transformer architecture has since become the state-of-the-art for natural language processing (NLP). Transformers are built on a self-attention-based mechanism that learns dependencies between input and output sequences without relying on recurrence. This allows transformer implementations to be easily parallelized and computationally efficient. Inspired by the success of transformers in NLP, [15] modified transformer architecture for computer vision, which they called Vision Transformer (ViT). [15] modified the original transformer such that it takes as input a sequence of fixed-size image patches that are treated similarly as words are treated in NLP application, and performed image classification. ViT, when trained from scratch, achieved lower performance as compared to convolution neural networks (CNNs). However, when ViT is pre-trained on a large dataset and transfer learning is performed to a smaller dataset, it outperformed the CNN architectures in many computer vision tasks such as object detection [17, 18], semantic segmentation [19], and image classification [15]. Although vision transformers have seen success on natural images, little work has been done in the medical imaging domain. CNN-based architectures are still commonly used for medical imaging and diagnostics [20] Compared to transformers, CNNs have some disadvantages such as convolution operations are difficult to capture global information [21], and CNNs are not able to capture long-range dependencies between different images that may be present in medical datasets [20].

To address the CNN issues, some authors have proposed transformer-based architectures for medical imaging and diagnostics [22,23,24,25]. [20] proposed a multi-modal medical imaging classification method. The authors combined CNN with transformer to learn both the low-level features and global features for effective image fusion and classification strategy. In [26], authors employed ViT architecture for COVID-19 classification using computed tomography (CT) scans and outperformed CNN-based DenseNet [27] architecture. In another study, [28] evaluated several deep learning architectures including DenseNet, EfficientNet, ResNet, and ViT for COVID-19 diagnosis using CT images and found ViT to outperform all other architectures. Although the methods discussed here have shown ViT to outperform CNN-based architectures, they fail to analyze the pretraining and finetuning aspects of transformers. Transformers require a large amount of training data to effectively exploit their capability [15]. However, in the medical domain, there is limited availability of large datasets [29, 30]. When trained on small datasets, ViT suffers from a lack of inductive bias that results in poor generalizability [15].

It has previously been shown that CNN-based architectures show improvement when they are pre-trained on large natural image datasets such as ImageNet [31] and finetuned on medical datasets [32]. Therefore, in this study, we explore the transfer learning capability of pre-trained transformers when finetuned on a medical dataset. In our work, we apply a pre-trained ViT on the CheXpert dataset [1] and show performance improvements over pre-trained CNN-based VGG-16 [33] and ResNet [9] architectures. We also analyze the impact of pretraining by comparing the pre-trained model with training from scratch and show that the pre-trained model has the advantage. The rest of the paper is structured as follows. The “Literature Review” section discusses the related work. The “Transformer Background” section discusses the background of transformer model. The “Methodology” section presents the proposed methodology. The “Experiment and Results” section discusses the results, and finally, the “Datasets” section concludes the paper.

Literature Review

Transfer Learning in Medical Imaging

With the advancement in the computer sciences and technology, transfer learning, which is primarily a substantial feature of deep learning, is now become indispensable to many applications as an integral part. It has been used by different fields of research in order to apply it in the field of radiology, training Inception, ResNet on retinal fundus images [34,35,36,37]. DenseNet, ResNet on chest x-rays [38, 39], and same are applied to ophthalmology. Besides this, the FDA have approved the related research on ophthalmology [40], with proper clinical arrangement [41]. [42] extracts characteristics from chest x-ray pictures using different neural network models pre-trained on ImageNet, then prepares five distinct models, analyzes their performance, and proposes an ensemble model that integrates outputs from all pre-trained models. Detection of Alzheimers disease in early stages is also its prominent application [43]. In 3D medical data, there are various transfer learning applications, such as [44] which create a Med3D network for 3D medical data classification and segmentation using a pre-trained ResNet-152. Other applications include the identification of skin cancer via photographs of dermatologist’s level [45] and the determination of the quality of human embryo for the IVF procedures [46]. [47] demonstrates that deep CNNs like inception-V3 trained on real-world radiographs can be utilized to transfer learning for fracture detection. The results were comparable to state-of-the-art for automated fracture diagnosis after training the model with a small sample set. [48] proposed Multi-view Convolutional Recurrent Neural Network (MVCRecNet), a deep learning approach that uses shape, size, and cross-slice changes in CT scan pictures to train model to identify lung cancer nodules from CT scan images. The model is given several viewpoints, allowing it to generalize better by learning robust characteristics. The datasets LIDC-IDRI and ELCAP were used in this study. [49] proposed Bayesian-based Convolutional Neural Network (B-CNN) takes advantage of model uncertainty and Bayesian confidence to increase TB detection and validation accuracy. The Montgomery and Shenzhen TB benchmark datasets were used to test the suggested methodology and it shows significant results in terms of TB identification accuracy, according to the findings.

Using the transfer learning methodology, [50] developed an extra layer of convolutional neural network blocks to integrate pre-trained ResNet and DenseNet models to establish higher performance above either model, and the suggested network was able to accurately classify lung diseases. [51] present their findings on the classification of histopathology images of oral cancer using various image classification models like Inception, ResNet, and MobileNet, concluding that transfer learning models perform well on histopathology. Despite the popularity and significance of transfer learning in the field of medical imaging, there has been little work done or research conducted in the relevant field. Even many common beliefs have been challenged by the latest research in this field of transfer learning in the area of natural image setting [52,53,54,55,56]. For instance, it has been shown in [53] that a transfer that has taken place between tasks that are similar in nature are not always resulted in the improvement of performance, and it has also been illustrated [55] that generalization of pre-trained features might be less than they are to be thought. In the medical imaging setting, many such open questions remain. As described above, in medical imaging, where present standard is taking an existing architecture that has been designed for natural image datasets like ImageNet, along with equivalent pre-trained weights, for example, ResNet and Inception, afterwards, the model is being finetuned on medical imaging data.

Anyhow, there is a considerable difference between medical image diagnosis and ImageNet classification. The first prominent feature of medical imaging is that its tasks begin with a considerable large image of the region of interest in the body and afterwards to identify the pathologies, it uses local textures for variations. For instance, the small red spots or dots are the signs of diabetic retinopathy and microaneurysms in retinal fundus images [57], and the indication of pneumonia can be confirmed via chest x-rays by observing local white small opaque patches [1]. This is just the opposite of natural image as ImageNet, in which there has been a clear and transparent worldwide subject of image. Now there is an open and unanswered question like to what extent the ImageNet feature reuse is quite helpful for natural medical images.

Transformer for Medical Images

In the field of NLP self-attention models like transformers [16] are becoming very popular with time. The concept of self-attention is also tried in CNNs like for each query pixel, self-attention was only used in local neighborhoods instead of being global [58]. In [59] output of CNN is further processed by self-attention. The use of pre-trained transformers on a large corpus is widely used [60] and in medical field use of transformers on text is also well known like BioBERT [61], SciBERT [62], and ClinicalBERT [63]. It has become possible to train models of massive scales like 100B of parameters all this is due to computational efficiency and scalability of transformers such as massive models like generative pre-trained transformer (GPT-3) are state of the art in different NLP tasks [64].

Due to massive success of transformers in field of NLP, transformers are applied to computer vision tasks. Most recent transformers which are used for image classification include OpenAi Image GPT (iGPT) [65] which uses GPT for image generation and trained model on ImageNet but there are limitations in their model as it requires high computation power and low image quality, Google Vision transformer (Vit) [15] which uses original transformer architecture for image classification and converts images into patches and gives it to transformer, and Facebook Data efficient transformer (DeiT) [66] which also uses same architecture as of ViT and used knowledge distillation for better training of model in which CNN act as teacher model. Medical Transformer [23] proposed a novel transfer learning framework using transformer model. It models 3D volumetric images in the form of a sequence of 2D image slices. TransUNet [22] proposed a transformer-based U-Net architecture for medical image segmentation because CNN has limited capability while modeling long-range dependencies and transformers self-attention mechanism help in modeling better representation. Both CNN and transformer are used while modeling TransUNet. TransFuse [25] use transformer and CNN in parallel styles for medical image segmentation. Besides that, a BiFusion module is created to fuse features from both branches. Segtran [67] is a medical image segmentation system based on transformers. It contextualizes features by utilizing the limitless receptive fields of transformers. Segtran is able to see both the big picture and the minute details, resulting in excellent segmentation results. In image denoising [68] use a transformer-based neural network used to investigate long-range dependencies between low dose computed tomography (LDCT) pixels.

Fig. 1
figure 1

Image transformer

Transformer Background

Transformers are based on self-attention mechanism and already became a de facto standard in natural language processing (NLP) and state of the art in image classification and object detection. A key characteristic of transformers, that is well-adopted in NLP, is their effective transfer learning on downstream tasks.

Transformer model used in image classification is based on original transformer [16] model which consists of two blocks encoder block and a decoder block. For image classification purpose only encoder part of transformer is used (Fig. 1). To feed input to the transformer model encoder part embedding is generated from patches of an image and positional encoding is attached to this patch embedding to keep the order of patches then these positional encoded patch embeddings are passed to encoder block. Encoder block consists of multi-head self-attention, layer normalization, feed-forward layers and then another layer normalization followed by feed-forward layer. To find the relation between each patch attention scores are computed using query, key and value matrices in self-attention layer. Multiple self-attentions scores are computed which acts as multi-head to get better representation and outputs of these heads are concatenated into one vector and input vector is added to it using skip connection and normalization is applied. After that output of this normalization is fed to feed-forward layer which is again added with previous layer output with the help of skip connections and normalization is applied. These skip connections allow the representation of different levels to interact with each other. Multiple encoder blocks can be stacked to gather in image transformer. At the end output of transformer encoder block is fed to classifier for classification purpose. Transformer output acts as image representation.

Fig. 2
figure 2

Vision transformer

Methodology

To evaluate the transfer learning performance of ViT from natural images to medical images, we train a standard ViT model both from random initialization and doing transfer learning from ImageNet [31] dataset. The ViT model we are using is closely related to original Transformer [16] and inspired from [15]. The overview of the model is shown in Fig. 2 which is used for medical image classification. The input images are reshaped into fixed-size 2D patches which are flattened and combined with position embeddings before feeding them to the ViT in a sequence. The transformer encoder consists of repeated blocks that each contains normalization, multi-head attention, and multi-layer perceptron (MLP) layers. The output of the encoder blocks is connected to a classification head that consists of MLP that maps the encoded feature vector to one of the output classes. We compare the transfer learning performance of ViT with CNN-based architectures including VGG-16 [33] and ResNet-50 [9]. These CNNs are also trained by doing transfer learning from ImageNet [31]. The performance evaluation of ViT and CNNs is performed based on evaluation metrics including accuracy, precision, recall, and F1-score.

Transformer Encoder

The encoder block in vision transformer takes radiograph scans, sliced into patches of size 16 \(\times\) 16. The patches are represented using a patch feature matrix X after adding positional encoding. The purpose of positional encoding is to preserve spatial structure of radiograph scans. The dependencies between patches are modeled by using a self-attention mechanism which work based on three embeddings: Query (Q), Key (K), and Value (V), defined as follows:

$$\begin{aligned} Query(Q)=X \times W_q \end{aligned}$$
(1)
$$\begin{aligned} Key(K)=X \times W_k \end{aligned}$$
(2)
$$\begin{aligned} Vlaue(V)=X \times W_v \end{aligned}$$
(3)

\(W_q\), \(W_k\) and \(W_v\) are used to project patch features onto embeddings Q, K, and V respectively. The ViT encoder pipeline mainly works in two steps: self-attention and attention-based feature weighting. The self-attention mechanism is used to model the dependencies between patches. The self-attention pipeline works as follows: In the first step, a similarity between patches embeddings is computed by taking a dot product between Q and K as follows:

$$\begin{aligned} Q\times K^{T} \end{aligned}$$
(4)

The scores are then scaled down by dividing by the square root of the Q and K dimension. This allows for more stable gradients as multiplying values can have explosive effects:

$$\begin{aligned} \frac{QK^{T}}{\sqrt{d_k}} \end{aligned}$$
(5)

The softmax layer is used to convert similarity score between Q and K into a probability distribution. As a result, the model may be more certain about which patch to pay attention to.

$$\begin{aligned} Softmax(\frac{QK^{T}}{\sqrt{d_k}}) \end{aligned}$$
(6)

The objective of attention-based feature weighting step aims to weight chest embeddings V based on self-attention scores computed in the previous step.

$$\begin{aligned} Attention(Q,K,V)=Softmax_k(\frac{QK^{T}}{\sqrt{d_k}})V \end{aligned}$$
(7)

Attention scores as shown in above Eq. (7) is calculated as illustrated in Fig. 3. Where MatMul stands for matrix multiplication. concat is an abbreviation for concatenation.

Multi-headed Attention

In multi-headed attention, each attention mechanism acts as a head, and each head learns something distinct, resulting in a better representation power for the encoder model. Before applying self-attention, query, key, and value are divided into N vectors to make this a multi-headed attention calculation as shown in Fig. 3. After that, the divided vectors go through the self-attention process one by one. Each step of self-awareness is referred to as a head. Before passing through the final linear layer, each head produces an output vector that is concatenated into a single vector.

Fig. 3
figure 3

Transformer encoder block with multi-head attention and scaled dot product attention

Evaluation Metrics

To evaluate the performance of system model different evaluation metrics like Accuracy, Precision, Recall, and F1 score are used.

Accuracy

It is simply a ratio between correct predictions and total number of predication. It measures how many times a model correctly predicts label.

$$\begin{aligned} Accuracy=\frac{TP+PN}{TP+TN+FP+FN} \end{aligned}$$
(8)

Precision

It measures how many times a model correctly predicts positive out of all positive prediction made by model.

$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(9)

Recall / Sensitivity

It measures how many times a model correctly predicts a label positive from an overall positive class.

$$\begin{aligned} Recall / Sensitivity=\frac{TP}{TP+FN} \end{aligned}$$
(10)

F1 Score

It’s a combination of precision and recall and balance both. A perfect model have F1 score 1 and worst have 0. Better F1 score tells that model give low false positives and false negatives

$$\begin{aligned} F1 Score=2 \times \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$
(11)

Experiments and Results

The following section discusses the datasets used for experimentations and presents the results and discussion.

Datasets

For evaluation of proposed methodology 2 datasets have been used CheXpert [1] and Pediatric pneumonia dataset [69].

CheXpert

CheXpert [1] is a massive public dataset of 224,316 chest radiographs from 65,240 patients for chest radiograph analysis. CheXpert data is compiled from examinations performed at Stanford Hospital in both inpatient and outpatient settings between October 2002 and July 2017, as well as the radiology reports that accompanied them. CheXpert consists of chest X-rays of different sizes which are then resized to 224 \(\times\) 224. Each of these X-rays is labelled into 14 observations No finding, Enlarged Cardiom, Cardiomegaly, Lung Lesion, Lung Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Other, Fracture and Support Devices as positive, negative or uncertain. Distribution of CheXpert instances across 14 observations is shown in Table 1. We have decided to replace all uncertain labels with positive labels as it is also feasible in a real-world scenario, as if a patient gets a false negative result, the patient will accept it as compared to a false positive, then he or she is more likely to get a second opinion which will then clear the classification. Samples from CheXpert dataset are shown in Fig. 4

Fig. 4
figure 4

CheXpert samples

Table 1 Distribution of CheXpert images across different classes

Pediatric Pneumonia Dataset

Pneumonia, which outnumbers all other infectious diseases, is the greatest cause of death among babies [70]. Anterior-posterior chest X-ray pictures were chosen from retrospective cohorts of pediatric patients aged one to five years old at Guangzhou Women and Children’s Medical Center in Guangzhou for the labelled Chest X-Ray Images for classification dataset [69]. There are 5863 chest x-ray images in this collection, divided into two classes: normal and pneumonia. Sample images from both classes are shown in Fig. 5

Fig. 5
figure 5

Pediatric chest X-ray sample

Table 2 CheXpert and pediatric pneumonia dataset split for training, validation and testing

Network Training

For all networks, we use the transfer learning technique. As demonstrated in Fig. 6, many tactics are employed for this purpose. In some scenarios, the entire model is trained after being initialized with pre-trained weights. Freezing a few to several layers of the model is another strategy. The pre-trained model is loaded in our scenario, then the pre-trained model’s classification head is removed and replaced with a new head based on dataset classes, the network parameters are frozen, and the model is trained. Finally, the model is finetuned. Table 2 shows the training, validation, and test sets for both the CheXpert and Pediatric pneumonia datasets.

The ADAM optimizer is used to optimize all of the networks. 0.0001 and 32 are the learning rate and batch size, respectively. A patience of 10 epochs is chosen as the stopping condition. On a GPU-based desktop machine with 128 GB RAM, Nvidia TitanX Pascal (12 GB VRAM), and a ten-core Intel Xeon processor, we train networks.

Fig. 6
figure 6

Transfer learning strategies

Experiments on Pediatric Pneumonia Dataset

Pediatric pneumonia is a binary classification task in which X-rays images are classified as normal or pneumonia. Through experimentation on a dataset of pediatric pneumonia patients [70] it is revealed that pre-trained transformer transfer learning performs better as compared to other state-of-the-art CNN-based vision models. Performance comparison of vision transformer and other CNN-based deep learning models is shown in Table 3

Table 3 Performance comparison on pediatric pneumonia dataset

ResNet-50

Experimental results of ResNet-50 on Pediatric pneumonia are done using pre-trained ResNet-50 base model. ResNet model uses residual connections or skip connection for learning representation. On top of that base model an additional classifier is added to classify images of chest X-rays into normal and pneumonia classes. Training and validation accuracy curves of ResNet model can be seen in Fig. 8; accuracy and loss are computed on 30 epochs of training.

After training and validation of ResNet-50 model we have computed receiver operating characteristic curve or ROC for ResNet-50 model and it shows performance above 0.5 and area under ROC curve which is known as AUC is 0.72 as shown in Fig. 9.

Confusion matrix for ResNet-50 is also computed for better understanding of results and shows the number of TP, FP, TN and FN for classifying X-ray images into normal and pneumonia classes as shown in Fig. 10

Inception-V3

Inception-V3 experimental results on pediatric pneumonia are also based on a pre-trained model on ImageNet dataset. We have used Inception-V3 as base model with an extra classifier implemented on top of that base model to classify images of chest X-rays into normal and pneumonia classes. Figure 11 shows the training and validation accuracy curves for the Inception-V3 model. Accuracy and loss are computed during 30 training epochs.

We also computed a receiver operating characteristic curve (ROC) for Inception-V3 model after training and validation, and it demonstrates performance above 0.5 and an area under the ROC curve (AUC) of 0.78, as shown in Fig. 12.

For a better understanding of the results, Inception-V3’s confusion matrix is produced, and it indicates the number of TP, FP, TN, and FN for classifying X-ray pictures into normal and pneumonia classes, as shown in Fig. 13.

VGG-16

The results of the VGG-16 on pediatric pneumonia experiment are also based on a pre-trained model on the ImageNet dataset. To classify images of chest X-rays into normal and pneumonia classes, we utilized VGG-16 as the base model and added an additional classifier on top of it as a transfer learning technique. The training and validation accuracy curves for the VGG-16 model are shown in Fig. 14. During 30 training epochs, accuracy and loss are calculated. After training and validation, we also computed a receiver operating characteristic curve (ROC) for the VGG-16 model, which shows performance above 0.5 and an area under the ROC curve (AUC) of 0.8, as shown in Fig. 15. The confusion matrix of VGG-16 is produced and shows the number of TP, FP, TN or FN classified into normal and pneumonia classes for the classification of X-ray images as shown in Fig. 16.

Vision Transformer

The results of the Vision transformer experiment with pediatric pneumonia dataset are also based on the pre-trained ImageNet model. We used Vision transformer as the basic model to classify images of chest X-rays into normal and pneumonia classes for that we added a further classifier head on top of vision transformer as a transfer learning strategy. The training and validation curves are shown in Fig. 17 for the vision transformer model. Accuracy and loss are calculated during 30 training periods. After training and validation, we also computed a receiver operating characteristic curve (ROC) for the Vision Transformer model, which shows performance above 0.5 and an area under the ROC curve (AUC) of 0.87, as shown in Fig. 18. The confusion matrix of Vision Transformer is produced and shows the number of TP, FP, TN or FN classified into normal and pneumonia classes for the classification of X-ray images as shown in Fig. 19.

Pre-trained Vision Transformer vs Training from Scratch

The results of the pre-trained vs training from scratch of Vision transformer on pediatric pneumonia dataset are shown in Fig. 7. We used Vision transformer as the basic model to classify images of chest X-rays into normal and pneumonia classes for that we added a further classifier head on top of vision transformer as a transfer learning strategy. The training and validation accuracy curves are shown in Fig. 7 for the vision transformer pre-trained model (ViT-PTM and vision transformer not pre-trained model (ViT-NPTM). Accuracy is calculated during 30 training periods model stops after convergence of 10 epochs. Results shows that pre-trained vision transformer performs better as compared to training a model from scratch.

Fig. 7
figure 7

Pre-trained vision transformer vs training from scratch

Experiments on CheXpert Dataset

CheXpert is a multi-label classification task in which X-rays images are classified in 14 observations. Through experimentation on a CheXpert dataset it is revealed that pre-trained transformer transfer learning performs better as compared to other state-of-the-art CNN-based vision models. Performance comparison of vision transformer and other CNN-based deep learning models are shown in Table 4

ResNet-50

Experimental results of ResNet-50 on CheXpert are done using pre-trained ResNet-50 base model. ResNet model uses residual connections or skip connection for learning representation. On top of that base model an additional classifier is added to classify images of chest X-rays into 14 classes. Classification report of ResNet model can be seen in Table 5; accuracy and loss are computed on 30 epochs of training.

VGG-16

The results of the VGG-16 on CheXpert are also based on a pre-trained model on the ImageNet dataset. To classify images of chest X-rays into 14 classes, we utilized Vgg-16 as the base model and added an additional classifier on top of it as a transfer learning technique. The classification report for the VGG-16 model is shown in Table 6. During 30 training epochs, accuracy and loss are calculated.

Vision Transformer

The results of the Vision transformer experiment with CheXpert dataset are also based on the pre-trained ImageNet model. We used Vision transformer as the basic model to classify images of chest X-rays into 14 classes for that we added a further classifier head on top of vision transformer as a transfer learning strategy. The classification report can be seen in Table 7 for the vision transformer model. Accuracy and loss are calculated during 30 training periods.

Table 4 Performance comparison on CheXpert dataset

Conclusions

The transfer learning of transformers for medical imaging is evaluated in this paper. For this purpose, a transformer-based strategy is used to classify chest X-ray images. To assess performance, CheXpert and the Pediatric Pneumonia dataset are used. Transfer learning in the proposed vision transformer outperforms existing CNN-based models in identifying medical images. Our method is based on the original architecture of the transformer as well as transfer learning techniques. For the image classification model, the transformer’s Encoder block is used. In the future different new models, as well as a combination of CNN and transformer architectures, may be used to evaluate model efficacy in medical imaging.