Introduction

Effectively classifying medical images play an essential role in aiding clinical care and treatment. For example, Analysis X-ray is the best approach to diagnose pneumonia [1] which causes about 50,000 people to die per year in the US [2], but classifying pneumonia from chest X-rays needs professional radiologists which is a rare and expensive resource for some regions.

The use of the traditional machine learning methods, such as support vector methods (SVMs), in medical image classification, began long ago. However, these methods have the following disadvantages: the performance is far from the practical standard, and the developing of them is quite slow in recent years. Also, the feature extracting and selection are time-consuming and vary according to different objects [3]. The deep neural networks (DNN), especially the convolutional neural networks (CNNs), are widely used in changing image classification tasks and have achieved significant performance since 2012 [4]. Some research on medical image classification by CNN has achieved performances rivaling human experts. For example, CheXNet, a CNN with 121 layers trained on a dataset with more than 100,000 frontal-view chest X-rays (ChestX-ray 14), achieved a better performance than the average performance of four radiologists. Moreover, Kermany et al. [3] propose a transfer learning system to classify 108,309 Optical coherence tomography (OCT) images, and the weighted average error is equal to the average performance of 6 human experts.

The medical images are hard to collect, as the collecting and labeling of medical data confronted with both data privacy concerns and the requirement for time-consuming expert explanations. In the two general resolving directions, one is to collect more data, such as crowdsourcing [5] or digging into the existing clinical reports [6]. Another way is studying how to increase the performance of a small dataset, which is very important because the knowledge achieved from the research can migrate to the research on big datasets. In addition to this, the most significant published chest X-ray image dataset (ChestX-ray 14) is still far smaller than the biggest general image dataset-ImageNet which has reached 14,197,122 instances at 2010 [7, 8].

CNN-based methods have various strategies to increase the performance of image classification on small datasets: One method is data augmentation [9,10,11,12]. Wang and Perez [13] researched the effectiveness of data augmentation in image classification. The authors found the traditional transform-based data augmentation has better performance than generative adversarial network (GAN) and other neural network-based methods. Another method is transfer learning [3, 12, 14, 15]. Kermany et al. [3] achieved 92% accuracy on a small pneumonia X-rays image dataset by transfer learning. The third method is the capsule network. Sabour et al. [16] invented a new neural network structure-capsule network, which achieves state-of-the-art performance on the Modified National Institute of Standards and Technology (MNIST) database [17]. And, also the best performance on other small datasets. Afshar et al. [18] have utilized capsule network to detect brain tumors and got 86.56% accuracy.

However, some gaps are needing to be noticed. A limitation of Kermany’s research is they use the InceptionV3 model and stop retrain the convolutional layer of InceptionV3 because of the overfitting. Therefore, other models and the effects of retraining the convolutional layer will be evaluated in this research. Moreover, Afshar et al. [18] did not compare the performance of capsule network with other methods. Therefore, the contributions of this report include:

  • Performance comparison of three different classification methods: SVM classifier with oriented fast and rotated binary robust independent elementary features (ORB), transfer learning of VGG16 and InceptionV3, and training capsule network from scratch.

  • An analysis of the effects of data augmentation, network complexity, fine-tuned convolutional layer, and other preventing overfitting mechanics on the classification of small chest X-ray dataset by transfer learning of CNN.

This article conducts four groups of experiments. The SVM with ORB runs on a standard Machine. The convolutional neural network (CNN) related analyses are all run on a virtual machine with an Nvidia Tesla K80 Graphic card in Google Cloud [19].

The remainder of the article ordered as follows: “Literature review” section reviews the related literature on medical image classification. “Experimental design” section describes the design of experiments. “Experimental results” section presents the result of the experiments, and “Discussion” section discusses the results. Finally, the conclusion is drawn, and the future work described, followed by references.

Literature review

Medical image classification is a sub-subject of image classification. Many techniques in image classification can also be used on it. Such as many image enhanced methods to enhance the discriminable features for classification [20]. However, as CNN is an end to end solution for image classification, it will learn the feature by itself. Therefore, the literature about how to select and enhance features in the medical image will not be reviewed. The review mainly focuses on the application of traditional methods and CNN based transfer learning. And, on the capsule network on medical image related paper to investigate what factors in those models are essential to the final result and the gaps they haven’t included in their work.

ORB and SVM application on medical image classification

Paredes et al. [21] use small patches of medical images as local features and k-nearest neighbor (k-NN) to classify the categorization of the whole medical image, finally achieving start-of-art accuracy. Parveen and Sathik [22] researched to detect Pneumonia from X-rays. The authors extracted features by discrete wavelet transform (DWT), wavelet frame transform (WFT) moreover, wavelet packet transform (WPT) and used Fuzzy C-means to detect Pneumonia. Caicedo et al. [23] use scale-invariant feature transform (SIFT) as a local feature descriptor and use support vector machines (SVM) classifiers to classify medical images and get state-of-art precision at 67%. However, SIFT is a patent algorithm. Thus, Rublee et al. [24] propose a free, faster local feature descriptor-oriented fast and rotated binary robust independent elementary features (ORB), which has the same performance as SIFT and even better performance than SIFT under some condition. SVM is also a high-performance classification algorithm, widely used in different medical image classification tasks by other researchers, and achieves an excellent performance [25, 26]. Therefore, this report uses ORB and SVM as the representation of the traditional methods.

CNN on medical image classification

With the different CNN-based deep neural networks developed and achieved a significant result on ImageNet Challenger, which is the most significant image classification and segmentation challenge in the image analyzing field [27]. The CNN-based deep neural system is widely used in the medical classification task. CNN is an excellent feature extractor, therefore utilizing it to classify medical images can avoid complicated and expensive feature engineering. Qing et al. [28] presented a customized CNN with shallow ConvLayer to classify image patches of lung disease. The authors also found that the system can be generalized to other medical image datasets. Moreover, in other research, it also found that CNN based system can be trained from big chest X-ray (CXR) film dataset and state-of-art with high accuracy and sensitivity results on their dataset, like Stanford Normal Radiology Diagnostic Dataset containing more than 400,000 CXR and a new CXR database (ChestX-ray8), which consist of 108,948 frontal-view CXR [29]. Moreover, using limited data makes it hard to train an adequate model. Therefore the transfer learning of CNN is wildly used in medical image classification tasks. Kermany et al. [3] use InceptionV3 with ImageNet trained weight and transfer learning on a medical image dataset containing 108,312 optical coherence tomography (OCT) images. They got an average accuracy of 96.6%, with a sensitivity of 97.8% and a specificity of 97.4%. The authors also compared the results with six human experts. Most of the experts got high sensitivity but low specificity, while the CNN-based system got high values on both sensitivity and specificity. Moreover, on the average weight error measure, the CNN-based system exceeds two human experts. The authors also verified their system on a small pneumonia dataset, including about five thousand images, and achieved an average accuracy of 92.8%, with a sensitivity of 93.2% and a specificity of 90.1%. This system finally may help in accelerating diagnosis and referral of patients and therefore introduce early treatment, resulting in an increased cure rate. Moreover, Vianna [30] also studied how to utilize transfer learning to build an X-ray image classification system that is the critical component of a computer-aided-diagnosis system. The authors found a fine-tuned transfer learning system with data augmentation effectively alleviate overfitting problem and yield a better result than two other models: training from scratch and a transfer learning model with only a retrained last classification layer.

Capsule neural network on medical image classification

As mentioned in the previous section, the CapsNet was invented in 2017 [16]. Therefore, the research about it is not as fruitful as CNN. However, there is still some research on applying them to the different datasets and varying fields due to its excellent feature—Equivariance. This means the spatial relationship of objects in an image is kept, and at the same time, the result does not impact the object’s orientation and size. Afshar et al. [18] applied CapsNet to classifying brain tumors on Magnetic Resonance Imaging (MRI) images and got 86.56% prediction accuracy with a modified CapsNet that reduces the feature maps from the original 256 to 64.

Moreover, Tomas and Robertas [31] presented a CapsNet based solution to classify four types of breast tissue biopsies from breast cancer histology images. They achieved 87% accuracy with the same high sensitivity. Jimenez-Sanchez et al. [5] evaluated the CapsNet on medical image challenges. The authors selected a CNN with three layers of ConvLayer as the baseline and compared CapsNet’s performance with LeNet and the baseline on four datasets, MNIST, Fashion-MNIST, mitosis detection (TUPAC16) and diabetic retinopathy detection (DIARETDB1), with three conditions: the partial subset of the dataset, the imbalanced subset of the dataset and data augmentation. The final result shows CapsNet performed better than the other two networks in a small, imbalanced dataset. Beşer et al. [32] implemented a sign language recognizing system by CapsNet and achieved 94.2% validation accuracy. Moreover, some researchers studied internal mechanics by varying network structures under different conditions. Xi et al. [33] studied the impact of different network structures on a complex dataset CIFAR10. The authors choose the following options:

  1. 1.

    Increase the number of primary capsule layers.

  2. 2.

    Increase the capsule number in primary capsule layer.

  3. 3.

    Assemble multiple models and average the result.

  4. 4.

    Adjust the scaling factor of reconstruction loss.

  5. 5.

    Add more ConvLayer.

  6. 6.

    Evaluate other activation function.

Finally, the authors found more ConvLayers and more models assembled, which have more effect on improving the final accuracy. Moreover, also they achieved the highest result with a 7-model assembled CapsNet with a more ConvLayer than the original version of Sabour’s. Furthermore, The CapsNet of Tomas and Robertas used to classify breast cancer increased the ConvLayer to five layers. On the other hand, Afshar et al. [18] also evaluated the different options of CapsNet. They fine-tuned the input size, number of feature maps, number of ConvLayers, capsule number in primary CapsLayer, dimension number in Primary Capsule, and the neuron number in reconstruction layers. The authors got the best results with a CapsNet having a \(64\times 64\) input image (original is \(28\times 28\)) and fewer feature map, which reduces to 64 from the original 256. Also, the authors found that increasing the routing iteration number beyond three will not improve the performance on the four datasets: MNIST, Fashion-MNIST, The Street View House Numbers (SVHN) dataset, and Canadian Institute for Advanced Research 10 (CIFAR10) dataset. From the previous reviews, it can be seen that the traditional method (SVM with ORB feature), CNN based transfer learning, and Capsule network can all use on the medical image dataset. Just looking at the value of accuracy on different datasets, CNN based transfer learning looks have better performance than the other two methods. However, they have not been compared to the same dataset. Therefore, this paper will compare their performance on the same dataset-the pneumonia dataset.

Moreover, there are so many different options when fine-tuning the parameter of those methods. The traditional method has so many features and classifying algorithms which can be evaluated. They cannot be iterated in this paper due to the limited time. As the baseline, the traditional method choose ORB as the feature and linear SVM as the classifier. As the data augmentation is a data preprocessing method that can apply to all three methods, it also will be evaluated on the traditional method. For CNN-based transfer learning, the layers of retrained ConvLayer, the complexity of classification layers, the dropout rate has significant effects on the final result. Therefore, they will be evaluated by this research. Based on the same research, the critical fact in capsule network: the number of the feature map, the number of the capsules, and the channels of the capsule will also be evaluated in this report.

Experimental design

Data neural network on medical image classification

The Dataset comes from the work of Kermnay et al. [34]. It contains two kinds of chest X-ray Images: NORMAL and PNEUMONIA, which are stored in two folders. In the PNEUMONIA folder, two types of specific PNEUMONIA can be recognized by the file name: BACTERIA and VIRUS. Table 1 describes the composition of the dataset. The training dataset contains 5232 X-ray images, while the testing dataset contains 624 images. In the training dataset, the image in the NORMAL class only occupies one-fourth of all data. In the testing dataset, the PNEUMONIA consists of 62.5% of all data, which means the accuracy of the testing data should higher 62.5%.

Table 1 The composition of chest X-ray dataset

Figure 1 shows examples of chest X-rays from the dataset. The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image.

Fig. 1
figure 1

Examples of chest X-rays [3]

Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in the right upper lobe (red rectangle), whereas viral pneumonia (right) manifests with a more diffuse interstitial pattern in both lungs.

Environment setup

Hardware

For ORB and SVM classification, an ordinary high-performance computer is enough, like 16G memory, i7 (2.3 GHz), and a 256G solid-state drive (SSD) disk. However, training a deep neural network should use GPU to accelerate the process. In this report, a Google Cloud GPU is used. A virtual machine instance with four core of CPU, 16G memory, and an NVIDIA Tesla K80 is used. Concerning the detail setup guide, please refer Google guide and other web pages [19, 35].

Software

To test the ORB and SVM classification, A python program which was initially used to classify plants are ported [36]. It was modified to use the new dataset and ran it on a laptop. An iteration of the test needs about four hours [15]. Because of the CNN-based method is computing intensively, so it needs to run on a VM in Google GPU Cloud. To test the Capsule network, a python capsule network implementation that aims to detect brain tumors was ported to the pneumonia dataset [37]. It also needs to be run on the GPU VM.

Data augmentation design

In this paper, three data augmentation algorithms evaluated. It can be seen from Table 2, Aug0 means using the original dataset without augmentation. Aug1 means simple geometrical transform of the image: such as randomly flip horizontally and vertically, randomly rotates within 0.05\(^\circ\), horizontal shear within the range 0.05 times the image width and zoom in within 0.05 times while Aug2 is a more complicated transform than Aug1. Besides all transforms of Aug1, it also does a slightly horizontal and veridical shift. To avoid the data exploration of the combination of the data augmentation models and classification algorithms, this paper only evaluates the effects of different augmentation algorithms on VGG16. This is the best classification algorithm for this paper. However, to analyze the effects of data augmentation, all three classification algorithms also evaluated on Aug0, and the best augmentation model got from this test.

Table 2 Augmentation models

ORB and SVM application experiments design

The ORB, VLAD, and SVM classification are chosen as the baseline. Two experiments conducted: First is classifying Normal and Pneumonia with the original dataset. Second does the same classification but with the best augmentation models.

Transfer learning experiments design

Because the chest X-ray dataset is small and different from ImageNet, whose weight used in the transfer learning experiments, therefore three group experiments conducted to fine-tune the final model. The first group of experiments aims to evaluate the effect of classification layer size on the final classification accuracy. Five models used on two CNN: VGG16 and InceptionV3 showed in Table 3. In the 2nd column, the classification model described. Example, model3 consists of eight layers after the ConvLayers which are: global average pooling (GAP) layer, fully connected (FC) layer with 512 neurons, dropout layer with 50% drop rate, second FC layer with 256 neurons, second dropout layer with 50% drop rate, third FC layer with 128 neurons, third dropout layer with 50% drop rate and a classification layer with a SoftMax activation function. The last two columns list the parameters needed to be trained in VGG16 and InceptionV3. They used to indicate the complexity of the model.

Table 3 Classification layer model configuration

The second group experiment aims to evaluate how many ConvLayers should be unfrozen and trained. A total of three experiments conducted (showed in Table 4). The first experiment evaluates the results of the best classification model with an unfrozen ConvLayer. Because the training parameter of the last ConvLayer is quite large, to prevent overfitting, the second experiment uses a smaller classification model. For testing the limit of the number of unfrozen ConvLayers, the third experiment unfreezes one more ConvLayers of the better ones in the previous two models.

The third group experiment fine-tunes other parameters based on the best model in the previous two experiment groups, such as increasing the drop rate of the dropout layer, reducing the learning rate, adding a batch normalization layer, which makes learning more stable and quicker.

Table 4 Configuration of fine-tuned Convlayer model
Table 5 Experiments configuration of CapsNet

Capsule neural network design

For CapsNet, the feature map number, the size of the PrimaryCaps layer and input image size impact on the performance of the classification. Thus, Table 5 shows the experiments aimed at evaluating the effects of those parameters.

Experimental results

In Table 6, the first column is the augmentation algorithms used in the test, the second column is the total training images generated by augmentation, and the last column is the average accuracy achieved by VGG16 transfer learning with all default parameter. From the result, it can be seen that the Aug1 is a better augmentation model than Aug2, therefore in the following experiments uses Aug1 as the default augmentation model.

Table 6 Data augmentation experiments result

ORB and SVM classification

In Table 7, the first column is the augmentation methods, and the second column is the average accuracy of the linear SVM classifier with ORB features. It can be seen the augmentation with more images increase the accuracy.

Table 7 ORB And SVM classification experiments results

Transfer learning classification

In Table 8, it can be seen that VGG16 is better than InceptionV3 and Model3 in VGG16 is the best classification model. Thus, in the following experiment, VGG16 and Model3 continue to be fine-tuned.

Table 8 Experimental result of evaluating classification model

From Table 9, it can be seen that the classification model2 with the last unfrozen ConvLayer was the best model in all three experiments. Thus, the following experiments will continue to be fine-tuned.

Table 9 Experiments result of fine-tuned Convlayer

The experiments in Table10 are explorational testing. The most successful models in previous experiments: Model3 and Model2 with the last unfrozen ConvLayer chosen as the baseline. Then according to the results and the effects of dropout, learning rate, and more training data, the parameters adjusted. In all the experiments, Model2 with the last unfrozen ConvLayer and all other default parameters still has the best results (showed in Tables 9, 10).

Table 10 Experiments result of fine-tuned other parameters

Capsule neural network

Table 11 shows the experiments processes and results by capsule network. The augmentation first is evaluated. The Aug1 is better than no augmentation, and the number of feature maps, input size, number of primary capsules, number of capsule channels, and number of training images vary. The best result comes from tests no. 11 and 13. They both have fewer feature maps, primary capsules, and capsule channels. The difference between them is the number of feature maps. One has 24 feature maps, while the other has 32 feature maps.

Table 11 Experiments result of CapsNet

Verify on OCT dataset

To check if the findings can be used on other datasets, some experiments conducted on the OCT dataset which was published together with the Pneumonia dataset but included 108,309 OCT images. From Table 12, it can be seen, the best result obtained from test 5.

Table 12 Experiments result on OCT dataset

Discussion

The effects of data augmentation

Table 13 shows a summary of all the experiment test result between no augmentation and augmentation.

Table 13 The comparison of data augmentation experiments

It can be seen that augmentation improves performance regardless of the model. That is because augmentation geometrically transforms the picture, which facilitates the machine learning algorithm to learn the underground feature without the impact of rotation and scale. However, from Table 6, it can be seen that complicated transforms are not always better than simple ones. Too complicated transforms introduce some noise in the feature that disturbs the learning process.

The finding on fine-tune of transfer learning

  1. 1.

    Effects of model complicity of neural network: The left table of Fig. 2 is a combination of Tables 3 and 8 and sorted by the number of parameters in ascending order. It can be seen that the number of parameters has a significant impact on accuracy. Too many and too few parameters get poor results. The right graph of Fig. 2 shows that the highest results of VGG16 and InceptionV3 are in model3 that has the proper size of parameters that match the size of the database.

  2. 2.

    The effects of techniques to preventing overfitting: Table 14 shows the explorational test results of model2 with the last unfrozen ConvLayer. Because the whole training process tends to overfit, no single factor has a stable and significant impact on final accuracy. When comparing the result of model3 with different conditions (as in Table 15), it can be seen that the increasing dropout rate and augmentation number in each training iteration continually increase the accuracy. The opposite is the model with the last unfrozen ConvLayer. That is understandable because the last ConvLayer has too many parameters. Therefore, the training process is overfitting.

Fig. 2
figure 2

The evaluation of model complexity: a Combinations of Tables 3 and 8, sorted in ascending order by the number of parameters. b The evaluation model complexity graph of the VGG16 and InceptionV3

Table 14 Evaluation of dropout, batch normalization and learning rate for model2 with last unfrozen Convlayer
Table 15 Evaluation of dropout, batch normalization and learning rate for model3

The finding on capsule network

  1. 1.

    The effects of feature maps: A series of experiments can unveil the effects of feature maps through fixing the input size (64), number of primary capsules (8), number of capsule channel (32) and varying the number of feature maps. The results in Fig. 3 show that the model with 24 and 32 feature maps got the best results.

  2. 2.

    The effects of input size: A series of experiments can unveil the effects of feature maps through fixing feature map size (32), number of primary capsules (8), number of capsule channel (32) and varying the input size. Figure 4 shows that the model with input size 64 got the best accuracy.

  3. 3.

    The effects of primary capsule: A series of experiments can unveil the effects of feature maps through fixing feature maps size (32), input size (64), number of capsule channels (32) and varying the number of primary capsules. It can be seen in Fig. 5 that the model with primary capsule 4 got better accuracy than primary capsule 8.

  4. 4.

    The effects of capsule channel: A series of experiments can unveil the effects of feature maps through fixing feature maps size (32), input size (64), number of primary capsules (4) and varying the number of capsule channels. It can be seen in Fig. 6 that the model with capsule channel 16 got the best accuracy.

  5. 5.

    The best model: The model with a combination of feature maps size (32 or 24), input size (64), the number of the primary capsule (4) and the number of capsule channels (16) should get the best results. This can be verified by the results of test 11 and 13 in Table 11: they are the best of all the tests. This also agrees with the finding in transfer learning: The complexity of a model should match the scale of a dataset.

Fig. 3
figure 3

The effects of feature maps in CapsNet

Fig. 4
figure 4

The effects of input size in CapsNet

Fig. 5
figure 5

The effects of primary capsule in CapsNet

Fig. 6
figure 6

The effects of capsule channel

Horizontal comparison

To evaluate the performance of the models in this paper, Table 13 compares the best results of different models on the same pneumonia dataset. From Table 16, it can be seen that a neural network-based method is significantly better than the traditional method because it is a useful feature learner during the traditional method, just a feature-ORB. In version 2 of the dataset, the best model, VGG16 in this paper, got slightly lower accuracy and recall than the state-of-art result but obtained a higher specificity. On the latest dataset, the performance of VGG16 was generally higher. The VGG16 model released the last ConvLayer so that it would learn the specific features of the dataset. That should significantly help to improve performance very much. Kermany’s work also retrains the ConvLayer of InceptionV3, but the model overfits too much to get an excellent test performance. The reason why our model does not overfit too much maybe because the VGG16 model is not as complicated as InceptionV3.

Table 16 Evaluation of dropout, batch normalization and learning rate for model2 with last unfrozen Convlayer

Finding in verifying on OCT dataset

From Table 12, it can be seen that the best model comes from test 5 instead of test 1. The new model adds complicated FC layers; therefore, the full complicity is better matched with the new dataset. The unfrozen two ConvLayers will make the system too complicated for the new dataset and, therefore, cannot find the local maxima. The best result is slightly lower than the start-of-art result of Kermany’s work (96.6%). However, this experiment result also confirms our findings. The specific feature is most important to improve accuracy—the proper model complexity help to find the best result.

Conclusions and future work

Due to the importance of medical image classification and the particular challenge of the medical image-small dataset, this paper chose to study how to apply CNN-based classification to small chest X-ray dataset and evaluate their performance. From the experiments, the following finding presented. CNN-based transfer learning is the best method of all three methods. The capsule network is better than the ORB and SVM classifier. Generally speaking, CNN based methods are better than traditional methods because they can learn and select features automatically and effectively; The best results come from the transfer learning of VGG16 with one retrained ConvLayer, which is slightly higher than the start-of-art result. With the unfrozen ConvLayer, the specific feature can learn from the new dataset. Therefore, the specific feature is an essential factor to improve accuracy; The balance of a model’s power of expression and overfitting is necessary. A too simple network usually cannot learn enough from the data, and therefore cannot get high accuracy. On the other hand, a very complex network is hard to train and tends to overfit quickly. As a result, accuracy is still low. Only a network model with proper size and other effective methods preventing overfit, such as proper dropout rate and proper data augmentation, can get the best results. However, because of the limited time, future research needs to be done: In transfer learning, training a fine-tuned deep neural network with unfrozen ConvLayers tends to overfit. What can effective methods be done to stabilize the training process? Other more powerful CNN model, such as ResNetv2 and ensemble of multiple CNN models have not been evaluated, but they could improve the results; Visualization needs to be added to improve the understanding and explanation of the results of the CNN-based system, because those are essential for the adoption of a CNN-based system in real clinical applications.