Background

Each year, almost 10 million individuals in the United States suffer from macular degeneration, also known as age-related macular degeneration (AMD), and more than 200,000 people develop choroidal neovascularization, a severe blinding form of advanced AMD [1, 2]. Additionally, nearly 750,000 individuals aged 40 or older suffer from diabetic macular edema (DME) [3], a vision-threatening form of diabetic retinopathy that causes fluid accumulation in the central retina. Many researchers have attempted to develop effective artificial intelligence algorithms by using medical images to diagnose key pathologies of AMD and DME quickly and accurately.

Naz et al. [4] addressed the problem of automatically classifying optical coherence tomography (OCT) images to identify DME. They proposed a practical and relatively simple approach to using OCT image information and coherent tensors for robust classification of DME. The features extracted from thickness profiles and cysts were tested using 55 diseased and 53 normal OCT scans in the Duke Dataset. Comparisons revealed that the support vector machine with leave-one-out had the highest accuracy of 79.65%. For identifying DME, however, acceptable accuracy (78.7%) was achieved by using a simple threshold based on the variation in OCT layer thickness. Najeeb et al. [5] used a computationally inexpensive single layer convolutional neural network (CNN) structure to classify retinal abnormalities in retinal OCT scans. After training using an open-source retinal OCT dataset containing 83,484 images from patients, the model achieved acceptable classification accuracy. In a multi-class comparison (choroidal neovascularization (CNV), DME, Drusen, and Normal), the model achieved 95.66% accuracy. Nugroho [6] used various methods, including histogram of oriented gradient (HOG), local binary pattern (LBP), DenseNet-169, and ResNet-50, to extract features from OCT images and compared the effectiveness of handcrafted and deep neural network features. The evaluated dataset contained 32,339 instances distributed in four classes (CNV, DME, Drusen, and Normal). The accuracy values for the deep neural network-based methods (88% and 89% for DenseNet-169 and ResNet-50, respectively) were superior to those for the non-automatic feature models (50% and 42% for HOG and LBP, respectively). The deep neural network-based methods also obtained better results in the underrepresented class. In Kermany et al. [7], a diagnostic tool based on a deep-learning framework was used to screen patients with common treatable blinding retinal diseases. By using transfer learning, the deep-learning framework could train a neural network with a fraction of the data required in conventional approaches. When an OCT image dataset was used to train the neural network, accuracy in classifying AMD and DME was comparable to that of human experts. In a multi-class comparison among CNV, DME, Drusen, and Normal, the framework achieved 96.1% accuracy. In Perdomo et al. [8], an OCT-NET model based on CNN was used for automatically classifying OCT volumes. The OCT-NET model was evaluated using a dataset of OCT volumes for DME diagnosis using a leave-one-out cross-validation strategy. Accuracy, sensitivity, and specificity all equaled 93.75%. The above results of research in AMD indicate that automatic classification accuracy needs further improvement.

Therefore, the motivation of this study was to find CNN-based models and their appropriate hyperparameters that use transfer learning to classify OCT images of AMD and DME. The CNN-based models were used for transfer learning included an 8-layer Alexnet model [9], a 22-layer Googlenet model [10], 16- and 19-layer VGG models (VGG16 and VGG19, respectively; [11]), and 18-, 50- and 101-layer Resnet models (Resnet18, Resnet50, and Resnet101, respectively; [12]). The algorithm hyperparameters included optimizer, mini-batch size, max-epochs, and initial learning rate. The experiments showed that, after transfer learning, the VGG19, Resnet101, and Resnet50 models with their appropriate algorithm hyperparameters had excellent performance and capability in classifying OCT images of AMD and DME.

This paper is organized as follows. The research problem is described in Sect. 2. Section 3 describes the research methods and steps. Section 4 presents and discusses the results of experiments performed to evaluate performance in classifying OCT images of AMD and DME. Finally, Sect. 5 concludes the study.

Problem description

AMD and DME

The macula, which is located in the center of the retina, is essential for clear visualization of nearby objects such as faces and text. Various eye problems can degrade the macula and, if left untreated, can even cause loss of vision. Age-related macular degeneration is a medical condition that can cause blurred vision or loss of vision in the center of the visual field. Early stages of AMD are often asymptomatic. Over time, however, gradual loss of vision in one or both eyes may occur. Loss of central vision does not cause complete blindness but can impair performance of daily life activities such as recognizing faces, driving, and reading. Macular degeneration typically occurs in older people. The classifications of AMD are early, intermediate, and late. The late type is further classified as “dry” and “wet” [13]. In the “dry” type, which comprises 90% of AMD cases, retinal deterioration is associated with formation of small yellow deposits, known as Drusen, under the macula. In the “wet” AMD type, abnormal blood vessel growth (i.e., CNV) occurs under the retina and macula. Bleeding and fluid leakage from these new blood vessels can then cause the macula to bulge or lift up from its normally flat position, thus distorting or destroying central vision. Under these circumstances, vision loss may be rapid and severe. A DME is characterized by breakdown of blood vessel walls in the retina resulting in accumulation of fluid and proteins in the retina. The resulting distortion of the macula then causes visual impairment or loss of visual acuity. One precursor of DME is diabetic retinopathy, in which blood vessel damage in the retina causes visual impairment [5].

OCT images of AMD and DME

In this study, all OCT images of AMD and DME used in the experiments were obtained from Kermany et al. [14]. The images were divided into four classes: CNV, DME, Drusen, and Normal. Figure 1 shows representative images of the four OCT classes.

Fig. 1
figure 1

Representative optical coherence tomography images of the CNV, DME, Drusen, and Normal classes. CNV choroidal neovascularization, DME diabetic macular edema

Considered problem

The considered problem was how to classify large numbers of different OCT images of CNV, DME, Drusen, and Normal efficiently and accurately. Since OCT images of CNV, DME, Drusen, and Normal can differ even for the same illness, a specialist or machine learning is needed to assist the physician in classifying the images.

Methods

The research methods and steps were collecting data, processing OCT images of AMD and DME, selecting a pre-trained network for transfer learning, classifying OCT images of AMD and DME by CNN-based transfer learning, comparing performance among different CNN-based transfer learning approaches, and comparing performance with other approaches in classifying OCT images of AMD and DME. The detailed steps were as follows.

Collecting data and processing OCT images of AMD and DME

The OCT images of AMD and DME in Kermany et al. [14] were split into a training set and a testing set of images. The training set had 83,484 images, including 37,205 CNV images, 11,348 DME images, 8,616 Drusen images, and 26,315 images of a normal eye condition. The testing set used for network performance benchmarking contained 968 images, 242 images per class. To maintain compatibility with the CNN-based architecture, each OCT image was processed as a 224 × 224 × 3 image, where 3 is the number of color channels.

Selecting pre-trained network for transfer learning

Transfer learning is a machine learning method in which a model developed for a task is reused as the starting point for a model developed for another task. In transfer learning, pre-trained models are used as the starting point for performing computer vision and natural language processing tasks. Transfer learning is widely used because it reduces the computation time, the computational resources, and the expertise needed to develop neural network models for solving these problems [15]. In his NeurIPS 2016 tutorial, Ng [16] highlighted the potential uses of transfer learning and predicted that, after supervised learning, transfer learning will be the next major commercial application of machine learning. In transfer learning, a pre-trained model is used to construct a predictive model. Thus, the first step is to select a pre-trained source model from available models. The pool of candidate models may include models developed by research institutions and trained using large and complex datasets. The second step is to reuse the model. The pre-trained model can then be used as the starting point for a model used to perform the second task of interest. This may involve using all or parts of the model, depending on the modeling technique used. The third step is to tune the model. Depending on the input–output pair data available for the task of interest, the user may consider further modification or refinement of the model.

The widely used commercial software program Matlab R2019 by MathWorks has been validated as effective for pre-training neural networks for deep learning. The starting point for learning a new task was pretraining, in which the image classification network was pretrained to extract powerful and informative features from natural images. Most pre-trained networks were trained with a subset of the ImageNet database [17] used in the ImageNet Large-Scale Visual Recognition Challenge [18]. After training on more than 1 million images, the networks could classify images into 1000 object categories, e.g., keyboard, coffee mug, pencil, and various animals. Transfer learning in a network with pre-training is typically much faster compared to a network without pre-training.

Classifying OCT images of AMD and DME by CNN-based transfer learning

Fine-tuning a pre-trained CNN with transfer learning is often faster and easier than constructing and training a new CNN. Although a pre-trained CNN has already learned a rich set of image features, it can be fine-tuned to learn features specific to a new dataset, in this case, OCT images of AMD and DME. Fine-tuning a network is slower and requires more effort than simple feature extraction. However, since the network can learn to extract a different feature set, the final network is often more accurate. The starting point for fine tuning deeper layers of the pre-trained CNNs for transfer learning (i.e., Alexnet, Googlenet, VGG16, VGG19, Resnet18, Resnet50, and Resnet101) was training the networks with a new data set of OCT images of AMD and DME. Figure 2 is a flowchart of the CNN-based transfer learning procedure.

Fig. 2
figure 2

Flowchart of CNN-based transfer learning procedure

Comparison of transfer learning performance in different CNN models

In the experiments, Alexnet, Googlenet, VGG16, VGG19, Resnet18, Resnet50, and Resnet101 were used to classify OCT images of AMD and DME in five independent runs. The results recorded for the training set and the testing set included (1) the accuracy in each run of the experiment, (2) the average accuracy for five independent runs, and (3) the standard deviation in the accuracy for five independent runs. Accuracy was defined as the proportion of true positive or true negative results for a population.

Classification performance in comparison with other approaches

The accuracy, precision, recall (i.e., sensitivity), specificity, and F1-score values were used to compare performance with other approaches. Precision was assessed by positive predictive value (number of true positives over number of true positives plus number of false positives). Recall (sensitivity) was assessed by true positive rate (number of true positives over the number of true positives plus the number of false negatives). Specificity was measured by true negative rate (number of true negatives over the number of false positives plus the number of true negatives). The F1-score, a function of precision and recall, was used to measure prediction accuracy when classes were very imbalanced. In information retrieval, precision is a measure of the relevance of results while recall is a measure of the number of truly relevant results returned. The formula for F1-score is

$${\text{F}}_{1} {\text{ - score}} = 2 \times \frac{precision \times recall}{{precision + recall}}$$
(1)

Results

The proposed CNN-based transfer learning method with appropriate hyperparameters was experimentally used to classify OCT images of AMD and DME. The OCT images in Kermany et al. [14] were used to train models and to test their performance. The experimental environment was Matlab R2019 and its toolboxes developed by The MathWorks. The network training options were the options available in the Matlab toolbox for CNN-based transfer learning with algorithm hyperparameters, i.e., ‘Optimizer’, ‘MiniBatchSize’, ‘MaxEpochs’ (maximum number of epochs), and ‘InitialLearnRate’.

The experimental data for OCT images of AMD and DME included a training set and a testing set. To maintain compatibility with the CNN-based architecture, each OCT image was processed as a 224 × 224 × 3 image, where 3 is the number of color channels. Table 1 shows the training and testing sets of OCT images of AMD and DME.

Table 1 Training and testing set of OCT images of AMD and DME

For training, different CNN-based models require different algorithm hyperparameters. The hyperparameter values are set before the learning process begins. Table 2 shows the selected CNN-based models with algorithm hyperparameters. The training option was use of ‘sgdm’, a stochastic gradient descent with a momentum optimizer. MiniBatchSize used a mini-batch with 40 observations at each iteration. MaxEpochs set the maximum number of epochs for training. InitialLearnRate was an option for dropping the learning rate during training.

Table 2 Selected CNN-based models with algorithm hyperparameters

For each CNN-based model, Table 3 shows the accuracy in each experiment, the average accuracy for all experiments, and the standard deviation (SD) in accuracy in classifying OCT images of AMD and DME. Data are shown for five independent runs of the experiments performed in the training set and in the testing set.

Table 3 shows that the average accuracy in the testing set ranged from 0.9750 to 0.9942 when using the CNN-based models with appropriate hyperparameters for transfer learning. For the testing set, the VGG19, Resnet101, and Resnet50 models had average accuracies of 0.9942, 0.9919, and 0.9909, respectively, which were all very high (all exceeded 0.99). Moreover, the SDs in accuracy obtained by VGG19 and Resnet101 were all 0.0005. That is, the VGG19 and Resnet101 had the most robust performance in classifying OCT images of AMD and DME.

Table 3 Accuracy of CNN models and standard deviation (SD) for classification of OCT images of AMD and DME in five independent runs of the experiments

Figure 3 shows how model training progressively improved accuracy in VGG19. The selected training option was sgdm optimizer. MiniBatchSize used a mini-batch with 40 observations at each iteration. Iterations per epoch were 2087(= 83,484/40), which was the number of training images/MiniBatchSize. MaxEpochs, the maximum number of epochs, were set to 3. Therefore, maximum iterations were 6261(= 2087 × 3), which was iterations per epoch × MaxEpochs. The blue line shows the progressive improvement in accuracy for the training set, and the black line shows the progressive improvement in accuracy for the testing set.

Fig. 3
figure 3

Progressive improvement in accuracy of VGG19

Figures 4 and 5 show how model training progressively improved accuracy in Resnet101 and Resnet50, respectively. The training option was sgdm optimizer. MiniBatchSize used 40 observations at each iteration. Iterations per epoch were 2087. MaxEpochs were set to 5. Therefore, the maximum iterations were 10,435(= 2087 × 5). The blue line shows the progressive improvement in accuracy when using the training set, and the black line shows the progressive improvement in accuracy when using the testing set.

Fig. 4
figure 4

Progressive improvement in accuracy of Resnet101

Fig. 5
figure 5

Progressive improvement in accuracy of Resnet50

The accuracy metric was used to measure the transfer learning performance of the CNN-based models. Precision, recall, specificity, and F1-score were further used to validate classification performance. The results were depicted by creating a confusion matrix of the predicted labels versus the true labels for the respective disease classes. Tables 4, 5 and 6 show the confusion matrices used in multi-class comparisons of Normal, CNV, DME, and Drusen for VGG19, Resnet101, and Resnet50 for the testing data.

Table 4 Confusion matrix for Normal, CNV, DME, and Drusen obtained by VGG19 in Experiment #5
Table 5 Confusion matrix for Normal, CNV, DME, and Drusen obtained by Resnet101 in Experiment #5
Table 6 Confusion matrix for Normal, CNV, DME, and Drusen obtained by Resnet50 in Experiment #5

Table 4 shows that, in Experiment #5, VGG19 achieved an accuracy of 0.9948 with an average precision of 0.9949, an average recall of 0.9948, an average specificity of 0.9983, and an average F1-score of 0.9948.

Table 5 shows that, in Experiment #5, Resnet101 achieved an accuracy of 0.9928 with an average precision of 0.9928, an average recall of 0.9928, an average specificity of 0.9976, and an average F1-score of 0.9928.

Table 6 indicates that, in Experiment #5, Resnet50 achieved an accuracy of 0.9917 with an average precision of 0.9918, an average recall of 0.9917, an average specificity of 0.9972, and an average F1-score of 0.9917.

Next, the performance of the proposed CNN-based transfer learning approach in classifying OCT images of AMD and DME was compared with the results reported in Kermany et al. [7], Najeeb et al. [5], and Nugroho [6]. Table 7 shows the confusion matrix for Normal, CNV, DME, and Drusen obtained by Kermany et al. [7]. The model in Kermany et al. [7] achieved an accuracy of 0.9610 with an average precision of 0.9610, an average recall of 0.9613, an average specificity of 0.9870, and an average F1-score of 0.9610. Table 8 shows the confusion matrix for Normal, CNV, DME, and Drusen obtained by Najeeb et al. [5]. The model in Najeeb et al. [5] achieved an accuracy of 0.9566 with an average precision of 0.9592, an average recall of 0.9566, an average specificity of 0.9855, and an average F1-score of 0.9563.

Table 7 Confusion matrix for Normal, CNV, DME, and Drusen obtained by Kermany et al. [7]
Table 8 Confusion matrix for normal, CNV, DME, and Drusen obtained by Najeeb et al. [5]

For the testing set, Table 9 shows the classifier accuracy, average precision,average recall/sensitivity, average specificity, and average F1-score obtained by the different CNN-based models. When the testing set was used in Experiment #5, the accuracies obtained by VGG19, Resnet101, and Resnet50 were 0.9948, 0.9928, and 0.9917, respectively, which are all very high and were superior to the accuracies obtained by the models in Kermany et al. [7], Najeeb et al. [5], and Nugroho [6]. In Experiment #5, other measures (i.e., average precision, average recall/sensitivity, average specificity, and average F1-score) obtained byVGG19, Resnet101, and Resnet50 were higher than those obtained by the models in Kermany et al. [7], Najeeb et al. [5], and Nugroho [6]. That is, by using transfer learning with appropriate hyperparameters, the proposed CNN-based models VGG19, Resnet101, and Resnet50 had excellent performance and capability in classifying OCT images of AMD and DME.

Table 9 Classifier accuracy, precision, recall/sensitivity, specificity, and F1-score obtained by different CNN-based models for testing set

Discussions

In this study, the appropriate algorithm hyperparameters for CNN-based transfer learning were very important for classifying OCT images of AMD and DME. This phenomenon was demonstrated by experiments in which the VGG19, Resnet50, and Resnet101 models achieved a classification accuracy exceeding 99%. If an inappropriate combination of algorithm hyperparameters is used, the classification accuracy will be reduced. For example, the algorithm hyperparameters for Googlenet transfer learning and the results in Table 10 indicates that an appropriate set of hyperparameters can provide good performance for transfer learning, where Optimizer of sgdm and InitialLearnRate of 10–4 are identical. Therefore, the combination of algorithm hyperparameters of the third case (i.e., Optimizer of sgdm, MiniBatchSize of 40, MaxEpochs of 5, and InitialLearnRate of 10–4) was selected for the study because it achieved high accuracy in the training and testing sets. Tables 11 and 12 show the algorithm hyperparameters for Resnet50 and Resnet101 transfer learning and their respective results. Tables 11 and 12 show that, if all other hyperparameter are identical (Optimizer of sgdm, MiniBatchSize of 40, and InitialLearnRate of 10–4), changing MaxEpochs from 3 to 5 improves accuracy for the test set by more than 0.99. Therefore, this combination of algorithm hyperparameters (i.e., Optimizer of sgdm, MiniBatchSize of 40, MaxEpochs of 5, and InitialLearnRate of 10–4) was selected for Resnet50 and Resnet101 transfer learning in classifying OCT images of AMD and DME.

Table 10 Googlenet model with different algorithm hyperparameters: accuracy for training and testing sets
Table 11 Resnet50 model with different algorithm hyperparameters: accuracy for training and testing sets
Table 12 Resnet101 model with different algorithm hyperparameters: accuracy for training and testing sets

Figure 6 displays four sample images with predicted labels and the predicted probabilities of images with those labels. The results for four randomly selected sample images were very similar to the results for the predicted category, and the probabilities of prediction approached 100%, indicating that the model established by CNN-based transfer learning had high classification ability.

Fig. 6
figure 6

Four sample images with predicted labels and predicted probabilities of images with those labels

Presently, CNN-based transfer learning is very efficient and stable [19, 20]. The key to successful image classification is ensuring that the original images are correctly classified. This phenomenon was demonstrated by experiments in this study in which the CNN-based model achieved a classification accuracy exceeding 99%. Therefore, CNN-based transfer learning with appropriate hyperparameters has the best performance in classifying OCT images of AMD and DME.

Conclusions

This study used CNN-based transfer learning with appropriate algorithm hyperparameters for effectively classifying OCT images of AMD and DME. The main contribution of this study is the confirmation that suitable CNN-based models with their algorithm hyperparameters can use transfer learning to classify OCT images of AMD and DME. Various metrics were used to verify the usability of the adopted CNN-based models. As Table 3 shows, the average accuracies of models VGG19, Resnet101, and Resnet50 in the testing set were 0.9942, 0.9919, and 0.9909, respectively, which were all very high (greater than 0.99). Moreover, the SDs of accuracy obtained by VGG19 and Resnet101 were all 0.0005. That is, VGG19 and Resnet101 were robust models for classifying OCT images of AMD and DME. Table 9 shows that, when the testing set was used in Experiment #5, the accuracies of VGG19, Resnet101, and Resnet50 were 0.9948, 0.9928, and 0.9917, respectively, which were all higher than the accuracies obtained by the models in Kermany et al. [7], Najeeb et al. [5], and Nugroho [6]. Other measures (average precision, average recall/sensitivity, average specificity, and average F1-score) obtained by VGG19, Resnet101, and Resnet50 in Experiment #5 were also higher than those obtained by the models in Kermany et al. [7], Najeeb et al. [5], and Nugroho [6]. That is, the CNN-based models VGG19, Resnet101, and Resnet50 obtained by transfer learning with appropriate algorithm hyperparameters were effective and useful for classifying OCT images of AMD and DME.