1 Introduction

Classification of medical images is considered to be an important component of computer aided detection and diagnosis systems [1]. Automatic localization or identification is very useful in initializing organ specific processing such as detecting liver tumors [2]. It is a challenging task to achieve high accuracies for automated classification of anatomy, because of the variability’s in the anatomical structures due to varying contrast, deformed shapes due to pathologies and occlusion. In image classification problems the descriptiveness and discriminative power of features extracted are important to achieve good classification results. The feature extraction techniques that have been used in medical imaging commonly include filter based features [3] and the very popular scale invariant feature transform (SIFT) [4].

Neural Networks (NN) has been studied for many years to solve complex classification problems including image classification. The distinct advantage of neural network is that the algorithm could be generalized to solve different kinds of problems using similar designs. Convolutional Neural Network (CNN) is a successful example of attempts to model mammal visual cortex using NN. The reason for using convolutional neural nets (CNNs) for anatomy specific classification is that these CNNs outperformed the contemporary methods in natural image classification [5]. Also CNNs have made substantial advancements in biomedical applications [6]. In addition to this recent work has shown how the implementation of CNNs can significantly improve the performance of the state-of-the-art computer aided detection systems (CADe) [79]. In this study we are evaluating the comparative performance of three milestones in the development of Convolutional Neural Networks for anatomy specific classification, i.e. LeNet [10], AlexNet [5] and GoogLeNet [11].

2 Related Work

2.1 Convolutional Neural Nets

Convolutional Neural Networks (CNNs) are a special kind of deep prototypes that are in charge for numerous exhilarating recent results in computer vision. Initially proposed in the 1980’s by K. Fukushima and after that developed by Y. LeCun and teammates as LeNet [10], CNNs picked up acclaim through the accomplishment of LeNet on the challenging task of handwritten digit recognition in 1989 It took a few decades for CNNs to create another leap forward in computer vision, commencing with AlexNet [5] in 2012, which won the overall ImageNet challenge.

In a CNN, the key calculation is the convolution of a feature detector with an input signal. Convolution with a pool of filters, like the learned filters in Fig. 1, augments the representation at the first layer of a CNN, the components go from individual pixels to straightforward primitives like even and vertical lines, circles, and fixes of shading. Rather than ordinary single-channel picture processing filters, these CNN filters are processed over all of the input channels. Convolutional filters are translation-invariant so they yield a high reaction wherever a feature element is identified.

Fig. 1
figure 1

Filters learned by convolutional layer

2.1.1 LeNet

LeNet [10] comprises of five layers that contains trainable parameters as shown in Fig. 2. The input is a 28 × 28 pixel image.

Fig. 2
figure 2

LeNet architecture

Layer 1 represents convolutional layer that contains 20 feature maps with kernel size of 5, which depicts that each unit in each feature map is connected to 5 × 5 neighborhood in the input. Conv1 contains 1520 learned parameters. Layer 2 i.e. pool1 is a pooling layer that aggregates the learned parameters to make the invariant to the transformations. Pool1 represents a layer with 20 feature maps of size 12 × 12. Layer 3 is again a convolutional layer conv2 that produces 25,050 learned parameters by convolving the pooled feature maps. Layer 4 i.e. pool2 aggregates the convolved features from layer 3 i.e. conv2. After convolutions and pooling, in layer 5, i.e. ip1, an inner product operation trailed by rectified linear unit activation (ReLU) function is applied, that resulted in 400,500 learned parameters. After this in layer 6 i.e. ip2 an inner product operation is again applied, that resulted in a reduced set of learned parameters, i.e. 2505. So a total of 429,575 parameters are learned which are then passed to a softmax classifier to determine the loss from the actual output.

2.1.2 AlexNet

AlexNet [5] proposed by Alex Krizhhevsky as shown in Fig. 3 is a convolutional neural net that revolutionized the image classification task by beating the state of the art image classification methods in 2012.

Fig. 3
figure 3

AlexNet architecture comprising of 8 layers

AlexNet comprises of 11 layers. i.e. conv1 added with relu1 and norm1, with kernel size 11 and stride of 4, which means after every four pixels perform the convolution. Which produces some learned parameters. The first layer i.e. conv1 layer is followed by pooling i.e. pool1 as explained above for the LeNet. The kernel size for the pooling is set to 3 with stride 2. Pool1 is followed by convolution conv2 with kernel size 5 and stride 2. On conv2 parameters relu2 is applied, that is followed by norm2. The conv2 parameters are again pooled in pool2 layers by applying maxpooling with kernel size 3 and stride 2. The pooled feature maps are again convolved in layer conv3, with parameter setting of kernel size equal to 3, stride of 1 and padding of 1. These convolved features are again convolved in layer conv4 with parameter setting same as in layer conv3, followed by relu4. The features from layer conv4 are again convolved in layer conv5, with the same parameter setting as in layer conv4 followed by relu5. The features from layer conv5 are pooled in layer pool5. Which is followed by fully connected layers, i.e. fc6, fc7 and fc8. In the layer fc6 two operations are applied, i.e. relu6 and drop6. Dropout operation prevents the deep nets from over fitting. The layer fc6 is followed by fc7, which is accompanied with relu7 and drop7. The features are finally fully connected through layer fc8 to the softmax classifier that determines the loss from the actual output.

2.1.3 GoogLeNet

GoogleNet [11] is a deep learning framework in which authors proposed an inception architecture that is based on how an optimal local sparse structure in a convolutional vision network can be approximated and covered by available components [11]. The architecture is based on the Hebbian principle, which states that neurons that fire together-wire together. According to this architecture and Hebbian principle, in images correlation tend to be local cover very local clusters by 1 × 1 convolutions. After that cover more spread out clusters by 3 × 3 convolutions as illustrated in Fig. 4.

Fig. 4
figure 4

Convolution for local and spread out clusters that are correlated

After 3 × 3 convolution, the cluster that are more spread out cover those with 5 × 5 convolution, that will result in a heterogeneous set of convolutions. GoogLeNet comprises of 9 inception modules.

3 Experimental Evaluation of LeNet, AlexNet and GoogLeNet for Anatomy Specific Classification

We started our experimentation with the data set acquired from the U.S. National Library of medicine, national Institutes of Health, Department of Health and Human Services. This is an open access medical image database that contains thousands of anonymous medical imaging data, ranging from various modalities like CT, MRI, PET, XRAY etc. this database also contain images with various pathologies. For our experimental evaluation we downloaded 5500 images of various anatomies. The anatomies we considered for our experimentation are lung, liver, heart, kidney and lumbar spine. We downloaded the normal and pathological images, so that these frameworks should be generalized to classify any image of the same organ if it varies in shape or contrast. We supplied 1000 images per category for the training purpose, out of which 25 % were used validation. For the testing purpose we used the different test set also acquired from the same database. The test set contains 66 images of different anatomies as mentioned above. We used 3851 images for training and 1149 for validation.

3.1 Experimental Evaluation of LeNet

We started our experimentation with LeNet. Before training the net we resized the images to the size of 28 × 28 and preprocessed them by subtracting the mean image from each pixel. After that we trained the LeNet with the batch size of 50. Which means 50 images were supplied at a time for each epoch for training and we used stochastic gradient descent as a training algorithm with a learning rate of 0.01. The training of the LeNet is shown in Fig. 5, which depicts how the accuracy and training loss goes with each iteration.

Fig. 5
figure 5

Training loss and validation accuracy with each epoch

Fig. 6
figure 6

Training loss and validation accuracy for AlexNet

Fig. 7
figure 7

Training loss and validation accuracy for GoogLeNet

This figure gives us the accuracy of 45 % on the validation data, whereas the training loss decreases and validation loss is greater than the validation accuracy with each iteration depicting that the model is over fitting. After that we tried to see how this network performs on the unknown data i.e. the test data. The test data is evaluated on AlexNet and the top nine predictions to classify the data into respective classes is shown in Figs. 8 and 9. The summarized results of training and validation is shown in Table 1.

Fig. 8
figure 8

Top 9 predictions of AlexNet for each anatomy

Fig. 9
figure 9

Top 9 predictions of AlexNet for each anatomy

Table 1 Comparative results of LeNet, AlexNet and GoogLeNet

3.2 Experimental Evaluation of AlexNet and GoogLeNet

The parameter setting for AlexNet is different from LeNet. The image dimensions for AlexNet are set as 256 × 256. The images are mean subtracted also and network is trained with the same training algorithm i.e. stochastic gradient descent. The batch size for AlexNet is 50 while as the default batch size is 100. But because of the limiting capability of our machine we choose the 50 batch size and same setting has been adopted for the GoogLeNet. The training of the AlexNet and GoogLeNet with each iteration is shown in Figs. 6 and 7 respectively. It is evident from the figures that GoogLeNet does not perform well on the medical imaging data, whereas AlexNet has much higher validation accuracy then LeNet and GoogLeNet. But its training error increases with each epoch but still performs better than other two CNNs in terms of validation accuracy (Figs. 8 and 9).

4 Conclusion

In this study we compared three state-of-the-art convolutional neural networks for anatomy specific classification. We experimented with five different anatomies. It is evident from the results that CNN with the AlexNet architecture performs quite good then other two architectures. While as one of the good outcomes of this study is that it gave an insight into an important factor i.e. increasing the number of layers in case of GoogLeNet does not always increase the performance. So in order to get the better accuracies an optimization with solution to over fitting is needed in the future to train these nets to perform better on medical image data.