Introduction

Eye disorders have become a growing concern among older individuals in recent years. Often, these conditions progress unnoticed until symptoms appear, emphasizing the importance of regular eye examinations for early detection1. This is especially critical when it comes to Age-related Macular Degeneration (AMD), a prevalent condition that affects individuals over 45 and is one of the leading causes of vision impairment in the elderly2. Traditional diagnosis methods like slit-lamp examinations by ophthalmologists have limitations due to skill variations and record-keeping issues. However, there is a promising avenue for AMD diagnosis and management through the use of machine learning (ML) algorithms for classifying fundus images3.

Located on the outer pole of the retina, the macula plays an important role in sharp color vision. Abnormalities in this region can cause blurred vision, dark circles, and malformations. The exact causes of AMD are not well understood, but genetics, chronic light exposure and nutritional imbalances are involved. To better understand AMD progression and effective treatment, it is necessary to classify fundus images into groups such as geographic atrophy (GA), intermediate AMD, normal, wet AMD, etc.4,5. GA stands for dry advanced stage with progressive retinal pigment epithelium (RPE) cell loss, resulting in distinct atrophic patches and central vision loss. Intermediate AMD falls between early and advanced stages, characterized by drusen pigment changes. The “normal” category includes images without AMD-related changes, which generally do not show clinical evidence of the disease. Wet AMD, the severe form, involves abnormal blood vessel growth beneath the retina, leading to retinal degeneration and, if left untreated, rapid loss of central vision6,7,8.

ML techniques hold significant promise in precisely categorizing fundus images into different AMD stages. These methods efficiently analyze extensive datasets and acquire intricate patterns and relationships from images, allowing for the identification of subtle indicators of various AMD phases. By automating the classification process, ML algorithms offer consistent and objective assessments, reducing discrepancies between observers and facilitating prompt diagnoses. A range of ML algorithms have been explored for the classification of AMD fundus images, as documented in9.

Traditional ML methods10 such as random forest (RF), multilevel perceptron (MLP), decision tree (DT), logistic regression (LR), support vector machines (SVM), and K-nearest neighbors (KNN) have shown promising results in previous studies, showing exceptional performance in image classification tasks11. Reliable ML-based settings for classifying fundus images into GA, intermediate AMD, normal, wet AMD categories have great potential for clinical application. Such systems can help eyesight specialists to provide accurate and timely diagnosis, facilitating appropriate treatment options for patients12. Furthermore, it can play an important role in large-scale screening programs, enhancing early detection and intervention, and ultimately improving the visual outcome of individuals with AMD13.

The aim of the present study is to evaluate and investigate the accuracy of ML methods for classifying AMD stages from fundus images. Aside from revealing the entire selection process, the analysis also considers the metrics employed to examine the developed classification model. The paper further explores the findings, possible obstacles and future approaches to make the ML-based AMD classification systems more accurate and clinically useful. In this regard, this review advances knowledge in this domain and paves the way for better patient care and outcomes associated with computer-aided AMD. The following points provide a summary of the present study’s contributions:

  • Development of a non-invasive CAD system: the study successfully developed a non-invasive CAD system for AMD using ML methods, which provides a valuable tool early diagnosis of his disease.

  • Improved AMD classification: through extensive research and testing, the study enhanced the accuracy and reliability of AMD classification from fundus images. This improvement ensures a more equitable classification of AMD within different stages.

  • Enhanced patient care: the CAD system, by automating the AMD diagnostic process and bridging gaps between caregivers, has the potential to significantly improve patient care. This advancement ensures timely and accurate inspections, contributing to enhanced overall healthcare delivery.

Paper organization

The paper is organized as follows: “Related studies” discusses related work for AMD classification. “Materials” describes the research materials. “Methodology” presents the proposed approach for AMD classification and its phases in details. “Experiments” presents the experimental result and discussion. “Overall discussion” discusses the work and experiments. “Limitations” highlight the study’s limitations. Finally, “Conclusions and future directions” addresses the conclusions and future directions.

Related studies

Recently, several algorithms have been developed to address the challenge of classifying fundus images of age-related macular degeneration (AMD) by leveraging patterns and features present in the data. These efforts have resulted in a significant body of academic work focused on the classification of AMD fundus images. Furthermore, various classification techniques and methodologies have been explored in these studies.

Notable examples include the work of Bhuiyan et al.14, who used convolutional neural networks (CNNs) to classify Referable AMD using the AREDS dataset which contains about 116,875 images. The results show that the classification for Disease/no disease provides better results with about 0.992 accuracy and for AMD severity (4 classes) with about 0.961 accuracy. Zapata et al.15 Proposed a classification approach using CNNs for AMD Disease/no disease using the Optretina dataset which contains about 306,302 images. This research achieved an accuracy of 0.863 and an AUC of 0.936.

Bulut et al.16 proposed a deep learning approach (i.e., Xception model) for detecting retinal abnormalities based on color fundus images. During the analysis, the Xception model containing 50 different parameter combinations was trained. The highest accuracy achieved was 82.5%. Gayathri et al.17 proposed an automated binary and multiclass classification of diabetic retinopathy. The proposed work focuses on the extraction of Haralick and Anisotropic Dual-Tree Complex Wavelet Transform (ADTCWT) features that can perform reliable DR classification from retinal fundus images. The evaluation results show that by applying the proposed feature extraction method, Random Forest outperforms all the other classifiers with an average accuracy of 99.7% and 99.82% for binary and multiclass classification, respectively.

Furthermore, Rajagopalan et al.18 proposed a deep convolution neural network (DCNN) architecture for the classification and diagnosis of average diabetic macular edema (DME) and drusen macular degeneration (DMD) efficiently. Firstly, the despeckling of the input OCT image is executed by the Kuan filters to remove inherent speckle noise. Furthermore, the CNN networks are tuned with hyper-parameter optimization methods. Moreover, K-fold validations are performed to guarantee full use of the datasets. Chakravorti et al.19 proposed an efficient CNN for AMD classification. The network was trained on fundus images to classify them into the four AMD categories, achieving high accuracy with reduced computational complexity. Thomas et al.20 developed an algorithm for the diagnosis of AMD in retinal OCT images based on the detection of RPE layers and the baseline estimate of statistical approaches and randomization.

Additionally, Zheng et al.21 designed a five-category intelligent auxiliary diagnosis model for common fundus diseases. The accuracy rates of the 3 intelligent auxiliary diagnosis models were all above 90%, and the kappa values were all above 88%. For the 4 common fundus diseases, the best results of sensitivity, specificity, and F1-scores were 97.12%, 99.52%, 96.43%, and 98.21%, respectively. Vaiyapuri et al.22 presented a new multi-retinal disease diagnosis model using the IDL-MRDD technique to determine different types of retinal diseases. The experimental values pointed out the superior outcome over the existing techniques with a maximum accuracy of 0.963. Lee et al.23 proposed two deep learning models, CNN-LSTM and CNN-Transformer, which use a Long-Short Term Memory (LSTM) and a Transformer, respectively with CNN, to capture the sequential information in longitudinal CFPs. The proposed models outperformed the baseline models that utilized only single-visit CFPs to predict the risk of late AMD (0.879 vs. 0.868 in AUC for 2-year prediction, and 0.879 vs. 0.862 for 5-year prediction).

Moreover, Kar et al.24 introduced an innovative method for precise retinal blood vessel detection in fundus images. Their approach features a generative adversarial network (GAN)25 with a unique architecture, combining a multi-scale residual convolutional neural network as the generator and a vision transformer as the discriminator. The GAN model, employing adversarial learning, achieves state-of-the-art results. Preprocessing involves contrast enhancement using a contrast-limited adaptive histogram equalization algorithm. Rigorous evaluations on multiple databases confirm the method’s robustness and efficacy, outperforming existing approaches with notable accuracy scores on CHASE_DB1, DRIVE, HRF, and ARIA databases.

In addition, Elangovan et al.26 proposed a robust automated glaucoma diagnosis system utilizing a deep ensemble model and stacking ensemble learning. The study focuses on the efficiency of thirteen pre-trained models, including Alexnet, Googlenet, VGG-16, VGG-19, Squeezenet, Resnet-18, Resnet-50, Resnet-101, Efficientnet-b0, Mobilenet-v2, Densenet-201, Inception-v3, and Xception. The ensemble model, evaluated in 65 configurations, employs a two-stage ensemble selection technique and a probability averaging approach. The final classification integrates an SVM classifier. The method demonstrates exceptional performance on modified publicly available databases (DRISHTI-GS1-R, ORIGA-R, RIM-ONE2-R, LAG-R, and ACRIMA-R), achieving overall classification accuracies of 93.4%, 79.6%, 91.3%, 99.5%, and 99.6%, respectively.

Furthermore, Haider et al.27 introduced ESS-Net and FBSS-Net for accurate OD and OC segmentation in retinal fundus images, addressing challenges like size and pixel variations. Both networks, with 3.02 million trainable parameters, demonstrated excellent segmentation on datasets like REFUGE and Drishti-GS, providing efficient solutions for computer-assisted glaucoma diagnosis. Additionally, Arsalan et al.28 introduced the vessel segmentation ultra-lite network (VSUL-Net) to accurately extract retinal vasculature without image preprocessing. With only 0.37 million trainable parameters, VSUL-Net utilizes a retention block for improved sensitivity, eliminating the need for expensive preprocessing schemes. Tested on DRIVE, STARE, and CHASE-DB1 datasets, the method achieved robust segmentation with Sensitivity, Specificity, Accuracy, and Area Under the Curve values of 83.80%, 98.21%, 96.95%, and 98.54% for DRIVE, 81.73%, 98.35%, 97.17%, and 98.69% for CHASE-DB1, and 86.64%, 98.13%, 97.27%, and 99.01% for STARE datasets.

Similarly, Singh et al.29 proposed an efficient glaucoma detection system using customized particle swarm optimization (CPSO) and four state-of-the-art machine-learning classifiers. The interconnected architecture involves pre-processing, segmentation, feature extraction, selection of critical features, and classification using CPSO-machine learning. The study focuses on a public dataset, Digital Retinal Images for Optic Nerve Segmentation. Unlike using all 20 extracted features, the system selects critical features based on univariate and feature importance methods. The best performance is achieved with a CPSO-K-nearest neighbor hybrid method, recording a maximum accuracy of 0.99, specificity of 0.96, sensitivity of 0.97, precision of 0.97, F1-score of 0.97, and Kappa of 0.94. Singh et al.30,31 also addressed feature selection challenges in machine learning, focusing on glaucoma detection using benchmark datasets. The study introduces a metaheuristics-based feature selection technique employing emperor penguin optimization and bacterial foraging optimization, proposing a hybrid algorithm. From 36 features extracted from retinal fundus images, the technique minimizes the feature set while enhancing classification accuracy. Six machine learning classifiers evaluate smaller subsets provided by the optimization techniques. The hybrid optimization technique, paired with random forest, achieves the highest accuracy at 0.95410.

In summary, these studies collectively provide valuable insights into the performance of diverse classification techniques for AMD fundus image classification. The results highlight the effectiveness of deep learning methods and the importance of feature extraction techniques in achieving accurate and reliable classifications. While Random Forest and SVM often excel in terms of classification accuracy, it is crucial to consider the dataset, feature extraction methods, and evaluation metrics when interpreting specific results from these studies.

While previous investigations have provided valuable insights into the performance of diverse classification techniques for AMD fundus image classification, there remains a notable gap in the interpretation of the data by extracting both local and global features from it. Many of the mentioned studies have focused on the application of deep learning and convolutional neural networks (CNNs) for classification, achieving impressive accuracy rates. However, these studies have predominantly emphasized the utilization of deep learning methods without comprehensive exploration of feature extraction techniques that capture both local and global characteristics of fundus images.

The extraction of local features, which pertain to specific regions or structures within the image, can provide valuable information about subtle abnormalities in the retina. Similarly, global features, which encompass broader characteristics of the entire image, can offer insights into overall patterns and structures. Combining both types of features can enhance the interpretability of the classification process and potentially lead to more robust and explainable results.

Materials

Patient selection and characteristics

Patient selection required the collection of retinal fundus images from a diverse group of real patients who showed symptoms associated with AMD, including different stages and types of disease. The database used in this study consists of more than 864 retinal images, including AMD. Each patient’s demographic and clinical characteristics, including age, sex, and their specific AMD category, were recorded. The experimental protocols were approved by the authors’ and patients’ institutions: University of Louisville and Mansoura University.

Imaging techniques

The retinal imaging techniques used in this study primarily used fundus color imaging. These two-dimensional images were obtained by light-reflecting retina32. Complete data were obtained by performing the left and right eyes of each patient. Using state-of-the-art imaging equipment, high-quality, and high-resolution images were obtained.

  • Fundus color images were taken with standard retinal imaging techniques.

  • High quality/resolution imaging equipment was used to ensure image quality and detail.

  • Images of the left and right eyes were obtained for each patient for all analyses.

Data collection and analysis

Data collection involved the systematic acquisition of retinal images from the fundus images maintained for this study. The collected data were intensively analyzed and subsequently extracted relevant features and attributes necessary for an ML-based diagnostic model.

Data categorization

The data used in this study included four distinct types of AMD, namely geographic atrophy (GA), central, normal AMD, and wet AMD. Each category represents a different stage or form of AMD. Classification of cases was based on clinical evaluation and expert review, which ensured classification accuracy.

Study design and ethical considerations

In the context of this research, a study design was established to investigate the use of ML-based medical diagnostic techniques for the classification of AMD using retinal fundus images. Ethical considerations were taken into account, and all procedures adhered to the relevant ethical guidelines and regulations. Informed consent was obtained from all patients involved in the study. The current study does not contain any studies with human participants and/or animals performed by any of the authors.

Consent to participate

All patients have provided informed consent for the present study.

Methodology

A detailed CAD framework for AMD analysis is introduced and shown graphically in Fig. 1. This involves the procedures involved in the data acquisition to obtain the necessary information. Pre-processing techniques are then applied to improve subsequent data quality. The extraction methods used to extract meaningful patterns from the data. The classification stage helps in an accurate classification of AMD cases. Moreover, Tree of Parzen Estimators (TPE) is used as another tool to enhance the models. This method is reliable and guarantees a precise diagnosis of AMD; thus, it offers a viable strategy for improving people’s health.

Figure 1
figure 1

The proposed framework for AMD diagnosis in the current study comprises distinct stages, including data acquisition, pre-processing, feature extraction, classification, and the utilization of Tree of Parzen Estimators (TPE) for optimization.

Data pre-processing phase

Data preprocessing is the important stage in image analysis consisting of some steps which aim at improving data quality and its preparation for further analysis33,34. The data preprocessing pipeline (Fig. 2) denotes a systematic way of preparing a dataset for analysis with an appropriate basis as outlined above. The suggested pipeline is split into a set of step where each is designed to enhance the upcoming classification quality and consistency:

  • Average CDF calculation: calculate the average CDF for the different classes to understand the pixel intensity distribution.

  • CLAHE enhancements: apply the CLAHE algorithm to improve image contrast and reducing noise.

  • Interpolation: interpolate the individual image CDFs with the mean CDF to standardize the intensity distribution.

  • ROI extraction: use masks to extract regions of interest from images, focusing on relevant regions.

  • Contour analysis: use contour analysis to adjust ROIs, define object boundaries, and calculate object properties.

Figure 2
figure 2

Image pre-processing steps: illustration of the various stages of image preparation for dataset samples. The top row displays the original data. The middle row shows the data after applying CLAHE. The bottom row exhibits the resized data following CDF interpolation. On the right side, you can see contours at different distances from 0 (representing the original mask) up to 1500.

Average Cumulative Distribution Function for Class-Specific Pixel Intensities

An important concept in data preprocessing is the cumulative distribution function (CDF). An CDF represents the cumulative probability distribution of pixel intensities in an image, and provides valuable insight into its properties. To prepare our data, we calculate the average CDF (ACDF) for each class within our dataset. This average CDF, denoted as \(ACDF_C(x)\) for class C, is computed by aggregating the individual CDFs of all images in that class. The mathematical representation is as follows: \(ACDF_C(x) = \frac{1}{N_C} \times \sum _{i=0}^{N_C}{{CDF_{C_{i}}(x)}}\) where, \(ACDF_C(x)\) represents the average CDF for class C at pixel intensity x, \(N_C\) is the total number of images in class C, and \({{CDF_{C_{i}}(x)}}\) is the CDF of the i-th image in class C. Figure 3 shows the normalized average CDFS for each class.

Figure 3
figure 3

Visualization of the normalized average CDFs for each class: GA, Intermediate, Normal, and Wet. The x-axis is the intensity while the y-axis is the probability.

Contrast limited adaptive histogram equalization (CLAHE)

Histogram equalization is a technique used to enhance image contrast by redistributing pixel intensities. However, it can inadvertently amplify noise35. Contrast Limited Adaptive Histogram Equalization (CLAHE) builds on this concept by applying histogram equalization locally, in small regions of an image. The theoretical foundation of CLAHE involves several key aspects:

  • Histogram equalization: traditional histogram equalization stretches the intensity values across the entire image, aiming for a uniform distribution. Mathematically, it can be represented as: \(CLAHE(x,y)=CDF_{clip}(I(x,y))\) where CLAHE(xy) denotes the CLAHE-enhanced pixel intensity at coordinates (xy), and \(CDF_{clip}(I(x,y))\) is the clipped CDF of the pixel intensity I(xy).

  • Adaptive approach: CLAHE adapts the histogram equalization process by dividing the image into smaller tiles or regions. Each region is equalized independently, allowing for localized contrast enhancement.

  • Contrast limiting: to prevent excessive amplification of pixel values, CLAHE limits the slope of the CDF within each region. This limitation balances contrast enhancement with noise control.

Interpolate the average CDF with images

Interpolation is a vital step in making the individual images to conform to the average CDF of their classes. This guarantees that images in a given class are uniform in terms of intensity distribution and this, therefore, ensures ease of comparisons and later analysis. The interpolation operation is defined as follows: \(I_{eq}(x,y) = InterpolateCDF(I(x,y), targetFreq, targetBins)\) where \(I_{eq}(x,y)\) represents the equalized pixel intensity at coordinates (xy), and InterpolateCDF(I(xy), targetFreqtargetBins) is the interpolation function. The goal is to adjust pixel values of each image so that they align with the target CDF, effectively normalizing the data.

Extract the ROIs using masks

The ROIs are important to the analysis, since they represent parts of images that contain important information36,37. This is achieved by means of binary masks, where values of 1 represent the region of interest and 0 indicates background. The image is multiplied with a binary mask, that is, the pixel-wise product, where the areas of interest are selected while the background is set to zero.

Contours handling

Contour analysis is a pivotal step in refining ROIs and identifying object boundaries within images38,39. Contours represent the boundaries of objects and offer valuable properties like area, perimeter, and centroid40. The centroid \((C_x,C_y)\) of an object within a contour is calculated using moments, which are mathematical descriptors of the shape and spatial distribution of an object. Moments are defined as: \(C_x=\frac{M_{10}}{M00}\) and \(C_y=\frac{M_{01}}{M_{00}}\) where \(C_x\) and \(C_y\) are the x and y coordinates of the centroid, respectively, and \(M_{00}\) is the zeroth moment (i.e., total area of the contour). Contour analysis allows us to precisely locate object boundaries, measure object properties, and define buffer regions, essential for various image analysis tasks. The contours are take at different distances from 0 (representing the original mask) up to 1500. Figure 4 represents a visualization of the extracted ROIs on a sample.

Figure 4
figure 4

Contours handling: visualization of the extracted ROIs on a sample. It shows the the ROIS at distances from 0 (representing the original mask) up to 1500.

Features extraction phase

In the field of image processing and texture analysis, feature extraction plays a crucial role in quantifying the characteristics of an image. The study worked on extracting both first-order and second-order features using GLCM (Gray-Level Co-occurrence Matrix) and GLRLM (Gray-Level Run-Length Matrix) methods for each extracted contour resulted from the pre-processing step41.

GLCM is a statistical method used to capture the spatial relationships between pixel values in an image. It is defined based on the co-occurrence of pairs of pixel values at various distances and angles in the image42. GLCM is typically calculated for a given gray-level image I, with discrete gray levels \(\{0, 1,\ldots , L-1\}\). Second-order texture features consider the spatial relationships between pixel pairs in an image. These features provide information about how pixel intensities are distributed relative to each other. Equations of the first- and second-order features were presented in the appendix.

Scaling phase

The current study utilized several data scaling techniques to preprocess the dataset effectively, enhancing the performance of ML models. These scaling methods are crucial for ensuring that features are on compatible scales and optimizing the behavior of various algorithms during the analysis43,44. The study employed four main scaling techniques: Standardization (Z-score Scaling), Min-Max Scaling (Normalization), and Max Absolute Scaling.

Standardization (Eq. 1) transforms data to have a mean of 0 and a standard deviation of 1. It is beneficial when dealing with features of different units, making them comparable by subtracting the mean and dividing by the standard deviation45. Min-Max scaling (Eq. 2) scales data to a specified range, often between 0 and 1. It maintains relative relationships between data points by subtracting the minimum value and dividing by the range46. Max Absolute (Eq. 3) scaling scales data based on the maximum absolute value within each feature, maintaining the sign of the data while restricting it within a consistent range47.

$$\begin{aligned} X_{i_{new}}= & {} \frac{X_i-\mu _i}{\sigma _i}. \end{aligned}$$
(1)
$$\begin{aligned} X_{i_{new}}= & {} \frac{X_{i}-X_{i_{min}}}{X_{i_{max}}-X_{i_{min}}}. \end{aligned}$$
(2)
$$\begin{aligned} X_{i_{new}}= & {} \frac{X_{i}}{\max (|X_{i}|)}. \end{aligned}$$
(3)

Classification and optimization phase

The present study used advanced classification schemes to locate and analyze the data. These algorithms include various methods, such as LightGBM (LGBM), Histogram-based Gradient Boosting (HGB), XGBoost (XGB), AdaBoost, Random Forest (RF), Multi-Layer Perceptron (MLP), Decision Tree (DT), logistic regression (LR), support vector machine (SVM), and K-nearest neighbor (KNN)48. This wide selection of algorithms was selected to evaluate their performance and suitability for the particular classification task at hand49,50,51.

LGBM, known for its speed and accuracy, was used to efficiently handle large data sets by creating unique vertical decision trees. Besides, calculating HGB uses histogram-based methods to optimize computing and memory usage, particularly useful for data structure52,53.

XGB, the versatile gradient worsting algorithm was also included because of its ability to handle complex data relationships54. AdaBoost is an ensemble method focusing on combining weak learners and it utilizes an iterative approach to improve classification accuracy55. RF is used for its robustness and flexibility for data types for classification and regression tasks56.

Furthermore, MLP, a neural network with multiple interconnected layers, was used to tackle complex, nonlinear data57. DT provided a straightforward yet powerful method for data partitioning based on key features, offering interpretability58. LR served as a simple yet effective baseline model, particularly suited for binary and multiclass classification tasks59.

SVM was leveraged for its versatility in handling high-dimensional data and linear or nonlinear classification tasks60. Finally, KNN offered an intuitive approach to classification, considering the majority class among the nearest neighbors of data points61.

The current study utilized Bayesian optimization with the Tree of Parzen Estimators (TPE) to optimize ML models, a powerful and efficient approach for hyperparameter tuning. Tree of Parzen Estimators is a probabilistic model-based optimization algorithm that effectively navigates the hyperparameter search space to find optimal configurations for ML models62.

TPE is particularly well-suited for hyperparameter optimization because it leverages a probabilistic model to make informed decisions about where to explore the hyperparameter space. Its working flow can be summaized as follows62,63:

  • Modeling probability distributions: TPE begins by modeling the probability distributions of the hyperparameters. It maintains two distributions, one for promising configurations (i.e., exploitation) and another for less promising ones (i.e., exploration).

  • Sampling: the algorithm then samples hyperparameters from these distributions. It does so in a way that favors promising regions based on the exploitation distribution but also explores other areas based on the exploration distribution.

  • Evaluating the objective function: the sampled hyperparameters are used to train and evaluate the ML model using a chosen objective function, such as accuracy or loss. The performance of the model is recorded.

  • Updating distributions: based on the performance of the sampled configuration, TPE updates the probability distributions for both exploitation and exploration. It allocates more samples to regions of the hyperparameter space that have shown promise.

  • Iterative process: TPE iteratively repeats the process of sampling, evaluating, and updating distributions over a predefined number of iterations. This allows it to gradually refine its search and converge towards the optimal hyperparameters.

  • Final configuration: at the end of the optimization process, TPE provides the best-found hyperparameters, which can then be used to train the final ML model.

TPE is known for its efficiency in finding near-optimal hyperparameter configurations with a relatively small number of model evaluations. It is particularly valuable when the hyperparameter space is high-dimensional or when manual tuning becomes impractical63. Table 1 presents the different hyperparameters for each ML classifier in the current study.

Table 1 The different hyperparameters for each utilized ML classifier in the current study.

Performance evaluation phase

In the present study, different performance metrics were used to evaluate the effectiveness of ML models in the AMD classification task. They play an important role in evaluating model quality and in determining decisions in model selection and optimization50,64.

Confusion matrix is a tabular representation of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It is important for evaluating model performance and deriving other metrics such as accuracy and F165,66,67. Accuracy is a widely used metric that measures the average of correctly classified observations across all observations68. This provides a general understanding of the efficiency of the classification model, but can be misleading in cases of imbalanced datasets.

Sensitivity indicates how well the model is able to identify good patterns69. It calculates the proportion of true positives out of all actual positives. It is important when reducing false negatives is paramount, such as in clinical research70. Specificity examines how well the model is able to detect negative cases71. It calculates the proportion of true negatives out of all actual negatives. It is important when avoiding false positives is important, as seen in applications such as fraud detection72.

The receiver operating characteristic (ROC) is a graphical representation that illustrates a model’s performance across different thresholds73. It plots the TP rate against the FP rate at various threshold settings, with the area under the ROC curve (AUC-ROC) quantifying the model’s ability to distinguish between +ve and −ve instances74,75. BAC is an adjusted accuracy measure that is used for imbalanced datasets. It provides a more reliable performance assessment in skewed class distributions. Equations (4) to (10) presents the utilized metrics with their equations.

$$\begin{aligned}{} & {} \text {Accuracy} = \frac{\text {TP}+\text {TN}}{\text {TP}+\text {FP}+\text {FN}+\text {TN}} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \text {Specificity} = \frac{\text {TN}}{\text {FP}+\text {TN}} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \text {Recall (or Sensitivity)} = \frac{\text {TP}}{\text {TP}+\text {FN}} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} \text {Precision} = \frac{\text {TP}}{\text {TP}+\text {FP}} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \text {ROC} = \frac{1}{\sqrt{2}} \times \sqrt{(\text {Sensitivity} ^ 2 + \text {Specificity} ^ 2)} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \text {F1-score} = \frac{2 \times (\text {Precision} \times \text {Recall})}{\text {Precision}+\text {Recall}} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \text {Balanced Accuracy (BAC)} = \frac{\text {Recall}+\text {Specificity}}{2} \end{aligned}$$
(10)

To reach a precise decision, a weighted sum metric (WSM) as presented in Eq. (11). It combines the mentioned performance metrics into a single comprehensive metric of model performance. This WSM is designed to consider the overall effectiveness of the models76.

$$\begin{aligned} \text {WSM} = w_1 \times \text {Accuracy} + w_2 \times \text {Sensitivity} + w_3 \times \text {Specificity} + w_4 \times \text {Precision} + w_5 \times \text {F1} + w_6 \times \text {ROC} + w_7 \times \text {BAC} \end{aligned}$$
(11)

By assigning weights to the individual metrics such as accuracy, sensitivity, and F1, the WSM can be aligned with the specific objectives of the classification task. In Eq. (11), \(w_1\) to \(w_7\) represent the weights assigned to each respective performance metric. This WSM provides a clear and interpretable way to balance trade-offs between different types of classification errors77. It enables decision-makers to make informed choices about the model performance that align with the study goals.

Experiments

For all experiments, the reported performance metrics for the various ML models are: accuracy, sensitivity, specificity, precision, F1, ROC, BAC, and WSM. Also, as mentioned, The experimental protocols were approved by the authors’ and patients’ institutions: University of Louisville and Mansoura University.

Table 2 presents the performance results of the implemented framework across different phases, each corresponding to a mask positioned at varying distances from 0 to 1500. The distances are measured in units that align with the dimensions of the mask. The table provides a detailed overview of various evaluation metrics for each configuration, shedding light on the framework’s efficacy under different spatial settings. The “Distance” column specifies the distance of the mask from its original position, and the “Combinations” column denotes the specific combinations of parameters used in each experiment.

The subsequent columns contain performance metrics, including Accuracy (ACC), Sensitivity (SNS), Specificity (SPC), Precision (PRC), F1 score (F1), Receiver Operating Characteristic (ROC) score, Balanced Accuracy (BAC), and Weighted Similarity Metric (WSM). Examining the results, notable trends emerge. For instance, as the distance increases, there is a discernible impact on metrics such as accuracy, sensitivity, and specificity.

The table indicates that the performance varies across different combinations of parameters and distances. Notably, at 1500 units, the framework achieves impressive results across all metrics, indicating its robust performance when the mask is positioned farther from its original location. Tables 5, 6, 7, 8, 9, 10 and 11 in the Appendices present the inner details of each row/distance.

Table 2 The performance results obtained by implementing the framework phases with the mask positioned at a polygon distanced from 0 to 1500, which corresponds to the mask itself, are presented.

The experiment conducted at a distance of 1500 units stands out as the most noteworthy configuration in Table 2. In this setting, the mask is positioned at a considerable distance from its original location, and the framework achieves outstanding performance across all evaluated metrics. With an impressive accuracy (ACC) of 96.31%, the model demonstrates a high degree of correctness in its predictions. Moreover, the sensitivity (SNS) and specificity (SPC) scores, measuring the model’s ability to identify positive and negative instances, are notably high at 92.65% and 97.54%, respectively. The Precision (PRC) and F1 score (F1) further highlight the framework’s precision and balance between precision and recall. The receiver operating characteristic (ROC) score, balanced accuracy (BAC), and weighted similarity Metric (WSM) collectively reinforce the exceptional performance of the model in accurately classifying instances when the mask is positioned at this substantial distance. This result underscores the model’s resilience and effectiveness, particularly when faced with spatial variations in the positioning of the mask.

Majority voting: by applying the weighted majority voting on the best classifiers regarding each polygon distanced from 0 to 1500, the performance is enhanced with an overall accuracy of 96.85%, sensitivity of 93.72%, specificity of 97.89%, precision of 93.86%, F1 score of 93.72%, ROC of 95.85%, BAC of 95.81%, and WSM of 95.38%. The fact that the performance is enhanced by applying majority voting on the best classifiers regarding each polygon distanced from 0 to 1500 suggests that the model is able to learn different patterns for different polygon distances. This is important as the model is more likely to be able to generalize to new instances, even if the polygon distances in the new instances are different from the polygon distances in the training instances.

Enhancing interpretation through contour overlay: in image analysis and classification, the overlay of contours on an image plays an important role in improving the interpretation of the results. This method is particularly valuable when dealing with multiple classes, each of which is assigned a specific label.

Figure 5 displays the overlaid contours, where most of them are correctly identified, except for two. The Wet class is depicted in red, the GA class in green, and the Normal class in yellow, all with an opacity set to 0.1. This visual representation allows for a quick and intuitive assessment of which contours have been inaccurately diagnosed. By assigning distinct colors to each class, it becomes evident which specific classes are affected by misclassifications.

Figure 5
figure 5

Visualization of an image with overlaid contours, where most of them are correctly identified, except for two. The Wet class is depicted in red, the GA class in green, and the Normal class in yellow, all with an opacity set to 0.1. White contours have been incorporated to enhance the visualization.

While the visual identification of incorrectly diagnosed contours is vital, it is also essential to consider the broader context. Figure 6 demonstrates an overlaid image with different contours, all correctly diagnosed. This comprehensive visualization not only assures the accuracy of individual contours but also provides confidence in the overall diagnosis of the entire image.

Figure 6
figure 6

Visualization of an overlaid image featuring the Wet class with accurately diagnosed contours. The overlay is in red with an opacity of 0.1. White contours have been incorporated to enhance the visualization.

Overall discussion

In our study, we have presented a detailed Computer-Aided Diagnosis (CAD) framework for Age-related Macular Degeneration (AMD) analysis. The CAD system is designed to assist in the diagnosis of AMD through a multi-stage process involving data acquisition, pre-processing, feature extraction, scaling, classification, and optimization. The entire methodology is encapsulated within a systematic framework, as illustrated in Fig. 1.

The framework begins with the data acquisition phase, where necessary information is obtained. This is followed by the pre-processing stage, where various techniques such as Average CDF calculation, CLAHE enhancements, interpolation, ROI extraction, and contour analysis are applied to improve data quality and prepare it for analysis. The feature extraction phase involves extracting meaningful patterns from the pre-processed data using Gray-Level Co-occurrence Matrix (GLCM) and Gray-Level Run-Length Matrix (GLRLM) methods. This step is crucial in quantifying the characteristics of the images.

In our research, the choice of feature extraction methods was driven by a consideration of the unique characteristics of the data and the specific requirements of the AMD diagnosis task. For capturing global appearance markers from the entire fundus image, we opted for techniques such as average cumulative distribution function (ACDF) calculation and contrast limited adaptive histogram equalization (CLAHE). These methods are adept at providing an understanding of pixel intensity distributions and enhancing contrast throughout the entire fundus image. This global perspective is essential for ensuring that the diagnostic model considers the overall structure and characteristics of the retina.

For local appearance markers, particularly in the optical disc section, we employed techniques such as region of interest (ROI) extraction and contour analysis. By focusing on specific regions of interest within the image and using binary masks and contour analysis, we aimed to capture detailed local information. This emphasis on local features is critical for identifying subtle patterns or abnormalities around the optical disc, contributing to a nuanced and accurate AMD diagnosis.

After feature extraction, the data goes through a scaling phase, where different scaling techniques such as standardization, min-max scaling, and max absolute scaling are applied to ensure that features are on compatible scales, optimizing the behavior of various machine learning (ML) algorithms.

The heart of the CAD system lies in the classification and optimization phase, where a variety of ML algorithms, including LightGBM, Histogram-based Gradient Boosting, XGBoost, AdaBoost, Random Forest, Multi-Layer Perceptron, Decision Tree, Logistic Regression, Support Vector Machine, and K-nearest neighbor, are employed. The Tree of Parzen Estimators (TPE) is used for hyperparameter optimization, ensuring the ML models are finely tuned for the specific task.

Finally, the performance evaluation phase employs various metrics such as confusion matrix, accuracy, sensitivity, specificity, ROC, F1-score, and balanced accuracy (BAC) to comprehensively evaluate the effectiveness of the ML models in AMD classification. These metrics are not only presented individually but are also combined into a Weighted Sum Metric (WSM) to provide an insightful measure of model performance.

Concerning the consideration of using transformers and convolutions for feature extraction, it is important to note that our decision was influenced by several factors. In the context of medical imaging, obtaining large labeled datasets for training deep learning models can be challenging. Conventional methods remain robust and effective even with smaller datasets. Additionally, the interpretability of results is a crucial consideration in medical contexts. The use of ACDF, CLAHE, and contour analysis provides transparency and interpretability, allowing healthcare professionals to understand the features contributing to a diagnosis. Furthermore, computational efficiency is a key factor, and deep learning models, particularly those involving Transformers and convolutions, may demand substantial computational resources. The simplicity of our chosen methods ensures computational efficiency without compromising the diagnostic accuracy required for AMD diagnosis. While deep learning approaches have shown success in various domains, the specific requirements of our AMD diagnosis task, including interpretability, dataset size considerations, and computational efficiency, led us to choose the outlined feature extraction methods.

Limitations

Several limitations should be noted in this study. The effectiveness of the CAD system relies on data quality and quantity, including variations in image quality and demographic representation, potentially affecting its generalizability. Labeling fundus images with AMD stages can introduce inter-observer variability, impacting ML model training and evaluation. Additionally, the study primarily focuses on broad AMD stage classification, excluding finer subtypes. External validation in diverse clinical settings is necessary to confirm real-world applicability. Regulatory approvals, clinical integration, and addressing ethical concerns are essential for the CAD system’s responsible deployment. Deep learning models lack transparency, and the CAD system should always complement clinical expertise. Long-term follow-up and adaptation to geographic variability are areas for future exploration.

Despite these limitations, this research represents a significant step towards enhancing AMD diagnosis through ML. Addressing these challenges and conducting further research can contribute to the continued improvement and responsible implementation of AI-driven diagnostic tools for AMD.

Conclusions and future directions

This study has critically examined the effective application of ML methods to accurately classify the AMD stage from fundus images All aspects of AMD diagnosis are discussed, from data acquisition to preprocessing to feature extraction and selection of ML algorithms therefore. The goal was to develop a non-invasive CAD system that maximizes the accuracy and clinical utility of AMD classification. By applying weighted majority voting on the best classifiers, the performance is enhanced with an overall accuracy of 96.85%, sensitivity of 93.72%, specificity of 97.89%, precision of 93.86%, F1 score of 93.72%, ROC of 95.85%, BAC of 95.81%, and WSM of 95.38%. These results suggest a successful CAD system that can play an important role in the early detection and diagnosis of AMD.

One of the noteworthy outcomes of this study is the improvement in the classification of AMD stages. Through rigorous experimentation and analysis, an advance in the accuracy and reliability of categorizing AMD into geographic atrophy (GA), intermediate AMD, normal, and wet AMD categories has been achieved. This enhanced precision is a critical step towards facilitating appropriate treatment strategies for patients. Furthermore, intricate patterns and relationships within fundus images have been illuminated by the study. These patterns enable the identification of subtle indicators of different AMD phases, which in turn can aid in early intervention and treatment. Through the automation of the AMD diagnosis process and the reduction of inter-observer discrepancies, enhanced patient care is facilitated by our CAD system.

Looking ahead, several promising directions for future research in this field can be explored. Firstly, further optimization of the ML models can be investigated. Techniques like hyperparameter tuning and the integration of more advanced techniques such as transformers and deep learning can potentially boost the system’s accuracy even further. Additionally, the scalability and applicability of the CAD system to larger datasets and diverse populations can be explored. Robustness across different demographic groups and geographic regions can ensure its broad clinical utility. Moreover, the performance of the CAD system can be enhanced by the incorporation of more extensive image datasets and advanced imaging technologies. This could involve the utilization of high-resolution images and multimodal data fusion for a more comprehensive assessment. In terms of clinical application, validation studies involving real-world patient data and collaboration with healthcare institutions can validate the effectiveness of the CAD system in a clinical setting. Regulatory approval and integration into routine clinical practice would be significant milestones. Lastly, ongoing research into the underlying mechanisms of AMD, including genetics, biomarkers, and treatment options, can complement the diagnostic capabilities of the CAD system. Combining ML-based diagnosis with cutting-edge treatment strategies can usher in a new era of precision medicine for AMD.