Introduction

Glaucoma is the second cause of irreversible blindness and the fourth cause of Moderate and Severe Vision Impairment (MSVI) in the world.[1] Glaucoma prevalence rises with age increasing, and it is one of the main risk factors for blindness and MSVI in people older than 50.[1].

The average number of persons who go blind from glaucoma has been increased from 2.5 million persons in 1990 to 3 million persons in 2015. Moreover, the average number of persons and the average number of persons who suffer from MSVI due to glaucoma has been raised from 3 million persons to 4 million persons from 1990 till 2015.[2].

It has been predicted that the number of persons suffering from glaucoma aged between 40 and 80 years old would be increased from 64.3 million persons in 2013 to 76 million persons in 2020 and 111.8 million persons in 2040.[3].

Previous studies have identified different risk factors such as age, gender, race, Interocular Pressure (IOP), diabetes, and family history for glaucoma.[4,5,6,7].

Among different studies performed in the population older than 40, the glaucoma prevalence has been reported from 1.44 to 4 %. In contrast, more than 50 % of patients suffering from glaucoma have been unaware of their illness.[5, 8, 9] The identified risk factors for glaucoma in Iran are age, IOP, diabetes, axial length, and gender.[5].

Glaucoma is usually an asymptomatic disorder. If a severe visual impairment is not detected, the patient will not be aware of their disease. Therefore, some physicians have called glaucoma the silent thief of sight.[10].

The blindness occurring due to glaucoma is irreversible, but some early treatment such as reducing IOP and some surgical interventions can cease blindness due to glaucoma. However, some early treatment activities such as reducing IOP and some surgical operations can help control the disease progression. Early detection and diagnosis of glaucoma and identifying high-risk groups can reduce the irrecoverable adverse effects of glaucoma.

Artificial intelligence and machine learning methods have various applications in solving medical and healthcare problems[11,12,13], such as in ophthalmology[14]. Developing automatic methods for ophthalmological diseases[15], ophthalmologic image analysis[16], network analysis for gene expression data for eye diseases[17], predicting the progressions of ophthalmologic diseases[18], and evaluating eye diseases’ progression[19,20,21] are some of these applications.

The automatic methods for predicting and diagnosing glaucoma can be used as the computer-assisted diagnosis (CAD) method and decision support system (DSS) to improve glaucoma diagnosis and management accuracy. In this study, the main aim is to design and develop machine learning models for glaucoma prediction. For this purpose, demographic characteristics, optometry, biometry, perimetry features, and ophthalmologic examination results are used as the input variables.

The main novelty of this study lies in several folds, including:

  • Proposing and designing a two-step classification task: The first step includes training the base and single classifiers on the training dataset and evaluating their performance based on a subset of the training dataset named as validation dataset, finding the superior classifiers. The second step is designing a novel stacked ensemble classifiers based on the superior classifiers.

  • Using comprehensive ophthalmology features like perimetry and biometry to develop the glaucoma prediction model without any fundus features. Features of this study come from non-interventional ophthalmologic examinations.

  • To address a highly imbalanced dataset with 87 instances in the glaucoma class (1.9 % of all instances) without generating artificial instances.

  • Analysing a cohort dataset.

Literature review

Different methods based on artificial intelligence and machine learning have been used in various applications for glaucoma management in recent years. [22] For instance, building glaucoma interaction networks [17], assessing the optic disk [23, 24], and detecting visual field progression[19,20,21], diagnosis, and screening.[10, 11, 25,26,27,28,29,30,31,32,33].

To propose and develop models to screen and diagnose diseases using structured datasets and complex data such as medical images and gene expression data can assist the physicians in early managing and diagnosing glaucoma. [10, 11, 31, 32]

Table 1 summarizes the related works considering model development for screening and/or diagnosis of glaucoma.

Table 1 Summary of the previous studies considering glaucoma prediction, screening and/or diagnosis

As listed in Table 1, several previous studies have proposed models for diagnosing and screening glaucoma using different features and datasets. The mentioned models can be used to assist physicians in early diagnosis and decision-making. Fundus images in most previous studies were used as the vital input data for developing glaucoma diagnosing and screening models. In a few previous studies, genome data were used to improve glaucoma diagnosing and screening models because genetic and race are two common risk factors for glaucoma identified in previous researches. Although fundus images and genome data have an excellent performance in diagnosing glaucoma, these data types increase the costs and complexity of models. These two data types with some structured data from other ophthalmologic examinations and clinical data can improve the performance of models.

This study used different features, including demographic characteristics, optometry, biometry, perimetry, and ophthalmologic examination results, to help glaucoma prediction.

Materials and methods

In this study, the ‘Shahroud eye cohort study’ dataset [34] is analyzed for predicting glaucoma based on Cross-Industry Standard Process for Data Mining (CRISP-DM) [35] methodology. Figure 1 shows the main steps of this study and the proposed method to predict glaucoma.

Fig. 1
figure 1

The main steps of this study proposed method for glaucoma prediction from Shahroud eye cohort study dataset

All preprocessing, modeling, and evaluation were done in Python language, and visualization was done in Python and Microsoft Excel in this study.

The steps as shown in Fig. 1 are described with more details in the following subsections:

Data description

Shahroud eye cohort study was started in 2009 to diagnose and detect visual impairments and eye diseases in Shahroud city, Northeast Iran.[34].

In the first phase of the Shahroud eye cohort study, 6311 people aged 40-64 years were selected by random cluster sampling. Among them, 5190 individuals, including 2990 females, have been participated in the study. Fundus imaging and perimetry examination have been performed for 4694 participants. Assessing fundus and perimetry images by the ophthalmologists for glaucoma detection was performed. According to this study, the prevalence rate of glaucoma was estimated at 1.92 % of the population[5] among the considered persons, eighty-nine participants diagnosed as glaucoma patients.

Table 2 lists the demographic characteristics, optometry, ophthalmology, biometry, and perimetry examinations for the contributors in the first phase of the Shahroud eye cohort study.

Table 2 List of the considered variables in this study

However, some participants who have more than 30 % missing values have been eliminated from this study. Therefore, 67 variables describing 4474 non-glaucoma persons and 87 glaucoma persons are considered in this study.

Data preprocessing

The dataset should be partitioned into two non-overlapping datasets, including training and test datasets. For this purpose, K-fold Cross-Validation (C.V.) was used once with K=5 and K=10. The main steps of data preprocessing and preparation performed in this study were divided into two main categories: data cleaning and balanced sampling. Data cleaning steps are outlier detection and removal, missing value handling, data normalization, and one-hot encoding. Three different scenarios were evaluated and compared for data balancing or not. The first scenario was imbalanced data. The second and third scenarios used over-sampling and under-sampling strategies for balancing the training dataset.

For outlier detection, numerical variables are analyzed using the interquartile range (IQR) as a commonly used outlier detection method[36]. According to this method, no outlier is detected in numerical variables.

Categorical nominal variables were converted to dummy binary variables. Missing value imputation was performed for variables having a missing value rate lower than 30 %, and other variables were excluded from the study. Missing values were replaced with mean and mode for numerical variables and binary variables, respectively.

Features in our analysed dataset have different units or come from different examinations, and features have different value range. Some learning algorithms, similar distance-based methods like K-Nearest Neighbors (K-NN) or kernel machines like Support-Vector Machines (SVM), are sensitive to features ranges. This sensitivity can cause models bias to features with higher variations. Data normalization was applied to the Min-Max normalization method to avoid dominating variables with a low, extensive range of variations and improve evaluation metrics.

Since our analyzed dataset’s glaucoma prevalence rate was about 1.9 %, the class distribution was significantly imbalanced. On the other hand, more than 92 % of the considered persons belong to the non-glaucoma group (majority class), and only 1.9 % of the persons are assigned to the glaucoma class (minor class). The previous studies have shown that the classifiers trained on imbalanced datasets can have higher accuracy for classifying the major class. However, the minor class cannot be trained with high accuracy [37].

To overcome the imbalance distribution in this study, dataset between the classes, three different strategies applied were under-sampling, over-sampling, and combining over-sampling and under-sampling strategy used in the previous studies [37].

In this study, three different scenarios were designed and compared to address the imbalance dataset’s challenges. The first was sampling from data without balancing the class distribution (Scenario 0). The second was uses over-sampling from the minor class (Scenario1). The last was the balanced bagging ensemble method, which has been proposed to overcome the imbalance dataset challenges in a previous study (Scenario 2) [38]. The bagging ensemble using a high number of estimators can guarantee that the major class observations contribute to training one of the estimators.

Modeling

Classifiers considered in this study include well-known classifier Decision Trees (DT) [39], Support Vector Machines, and ensemble classifiers having DT as their base classifiers, such as Extra Tree (ET) [40] and Random Forests (RF) [41]. These classifiers have different advantages and were used as a base classifier in three scenarios. For tuning the hyperparameters of the classifiers and choosing the resampling ratio, the Grid search method was used in this study. For DT and the ensemble based on DT like RF and ET, the splitting criterion was the ‘Information Gain,’ and the number of trees was 200. The kernel function of SVMs was ‘Radial Basis Function (RBF)’ in Scenario1 and Scenario2, and Polynomial’ in Senario3. The number of neighbours in the K-NN was seven, and the distance metric was ‘Euclidean.‘ In Scenario2, the oversampling ratio was different for each classifier and was determined by the Grid search. In Scenario3, the number of estimators of the balanced bagging ensemble method was 300. This number of the base estimator guarantee that every sample in the training set contributes at least to train one of the estimators and avoid overfitting.

Evaluation

As illustrated in Fig. 1, evaluation tasks include the K-fold C.V. strategy for sampling from data, choosing the best scenario for handling imbalanced data, choosing the best classifiers, identifying the important features, building the proposed stacking ensemble model, and error analysis.

As mentioned in the preprocessing subsection, data was partitioned into training and test datasets based on the K-fold C.V. strategy for K=5 and K=10. The best scenario for handling imbalanced data and classifiers was chosen by comparing different combinations of the scenarios and classifiers based on their performance on the validation dataset. Then, important features ranked with the best classifiers were identified. Afterward, the proposed two-layered stacking ensemble classifier was built in which the first layer included different classifiers and the second layer used a logistic regression model.

For error analysis, evaluating and comparing the classifiers and different scenarios used for balanced sampling, various standard performance measures which were calculated and reported were accuracy sensitivity, specificity, F1-Score, and area under receiver operating characteristics (ROC) curve (AUC) as Eqs. (1)-(4).

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(1)
$$Sensitivity= \frac{TP}{TP+FN}$$
(2)
$$Specificity= \frac{TN}{FP+TN}$$
(3)
$$F1-score= \frac{TP}{TP+ \frac{(FP+FN)}{2}}$$
(4)

Experimental results

For evaluating the proposed scenarios, every classifier in each scenario was executed 30 times, and the performance measures, including accuracy, sensitivity, specificity, and F1-Score, are reported in Table 3.

Table 3 Comparing the performance of the classifiers in each scenario

As illustrated by Table 3, the last scenario’s classifiers had better performance for predicting the non-glaucoma class and glaucoma class. Decision trees and random forests which outperform the other classifiers were contributed to the balanced bagging ensemble as its base classifiers.

The ROC curves are illustrated in Fig. 2 to compare different scenarios and classifiers.

Fig. 2
figure 2

ROC curve for models in each scenario

According to DT and RF’s reasonable accuracy in the last scenario, top-ranked features were selected based on the feature importance scores assigned to the variables with DT and RF. 10-top ranked variables in the first quartile for glaucoma prediction from the Shahroud eye cohort dataset based on their average score on 300 executions of the classifiers are listed in Table 4.

Table 4 Feature importance in the best models

As illustrated by Table 4, eight variables among 10-top ranked features identified by DT and RF are common, and they include NVD, VCDR, WTW, Systolic BP, PCY, AST, Age, and AL, which come from the different examination sources.

Finally, a novel two-layered stacking ensemble classifier is proposed in which the first layer combines two superior classifiers of the last scenario and the second layer uses logistic regression. Table 5 shows the performance measures of the proposed stacking ensemble classifier for Glaucoma prediction.

Table 5 Performance of stacking models

Discussion

For assessing the performance of the proposed stacking ensemble classifier in this study, 30 executions of 10-fold C.V. are analyzed. According to the experimental results, 3200 persons belonging to the non-glaucoma class (71.5 %) and 59 glaucoma persons (67.8 %) are correctly predicted in all executions. On the other hand, 368 persons of non-glaucoma class (8.2 %) and glaucoma class (8 %) are misclassified in all executions. Figure 3 indicates how many times of the executions each instance is predicted correctly.

Fig. 3
figure 3

Number of true predictions for the data instances in 30 executions of 10-fold C.V. for the proposed stacking ensemble

According to Fig. 3, a novel confusion matrix (NCM) is proposed based on the thresholds for at least 0 %, 30 %, 60 %, 90 %, and 100 % instances that could be predicted correctly, as shown in Table 6. For example, if the threshold is 60 %, TP and TN show that the instances are correctly classified into positive (Glaucoma) and negative (non-glaucoma) at least 18 times of 30 executions, respectively.

Table 6 The novel proposed confusion matrix (NCM) for our proposed stacking ensemble method in this study

The results shown in Table 5 are similar to the results corresponding to the threshold of 60 % in Table 6. According to the experimental results described in this section, data instances that no classifier can correctly predict negatively impact the classifier performance.

FP instances are the data instances with glaucoma, but the classifier has misclassified them as the non-glaucoma class label.

The average of top-ranked features for non-glaucoma and glaucoma classes is compared using the t-student test shown in Table 7 to investigate the significant difference between the non-glaucoma and glaucoma classes per each top-ranked feature.

Table 7 Comparing the average of top-ranked features for Non-glaucoma and Glaucoma classes using t-student test

As listed in Table 7, the average of LT, Spherical Equivalent, AST, Systolic BP, AL, Age, VCD, and NVD are significantly different for glaucoma and non-glaucoma groups. It indicates that the mentioned variables can distinguish well two classes in our study. However, the average of WTW, PCY, BMI, and Diastolic BP is not significantly different for the glaucoma and non-glaucoma groups.

Conclusions

Early identification of the persons with a high risk of glaucoma can help early beginning the necessary treatment and monitoring disease and prevent converting disease to the acute form. In this study, a novel stacking ensemble classifier composed of several machine learning classifiers is proposed, designed, and used for glaucoma prediction considering the Shahroud eye cohort dataset. This study’s input variables and predictors for glaucoma prediction are demographic characteristics, ophthalmology features, biometry, and perimetry descriptors for persons aged between 40 and 64 years old in Shahroud. Three scenarios are compared for handling an imbalanced dataset. The experimental results show that balanced bagging based on random forests and decision trees can improve the sensitivity and performance of glaucoma prediction with the average accuracy of 87.61 and 88.87, the sensitivity of 73.80 and 72.35, specificity of 87.88 and 89.10, and AUC of 91.04 and 94.53, respectively. On the other hand, the proposed stacking ensemble classifier achieves an average accuracy of 83.56, a sensitivity of 82.21, a specificity of 81.32, and an AUC of 88.54.

The previous studies used three different data types: fundus images, genome data, and structured data to develop a glaucoma prediction and diagnosis model. These studies achieved high-performance measures on fundus images or the combination of different data types, as shown in Table 8. This study used extensive ophthalmologic examinations like biometry, perimetry, and some clinical data to develop the predictive glaucoma model without fundus images or genome data. The developed model in this study has lower performance measures against other studies. Still, it has less complexity and cost and can use as the base of a decision support system in clinics to diagnose and screen glaucoma.

Table 8 Comparing the performance of the proposed method in this study with the previous studies

Top-ranked features for predicting glaucoma identified using DT and RF are listed in Table 4. These top features come from different eye examinations like perimetry and biometry or demographic features that can measure in every clinic. As discussed in the introduction, some of these top features are the main risk factors of glaucoma diseases like age, BMI, blood pressure, and axial length, identified in many studies. On the other hand, some of these features like NVD and VCDR are used to identify glaucoma by physicians instantly. As mentioned in a previous study[5], top-ranked features for glaucoma prediction determined using simple and multivariate logistic regression have been age, IOP, sex, diabetes, myopia, and axial length. The Van diagram in Fig. 4 shows the top predictors of glaucoma.

Fig. 4
figure 4

The Van diagram indicating our top-ranked features identified using DT and RF and top-ranked features identified in a previous study

This study aims to discriminate between glaucoma patients and non-glaucoma persons. The proposed and designed models in this study are disable to diagnose glaucoma type. Different types of glaucoma can be discriminated against, predicted, and diagnosed using machine learning models as a future research direction. Top-ranked features and risk factors for each type of glaucoma can be identified.

The first phase of the Shahroud Eye Cohort Study was used to predict glaucoma with extensive ophthalmologic examinations and demographic data without any fundus images and achieve an average accuracy of 83.56. The Shahroud Eye Cohort Study was conducted in two more phases with an interval of five years. For future work, the glaucoma condition of participating in the second phase can be use as the label for the first phase to develop a prognosis model which can identify people with glaucoma five years earlier and evaluate the model on the third phase.

This study has some main differences compared to the previous related works. In this study, different ophthalmological features are used as the input variables of our models, such as optometric examination results, biometric and perimetric features, ophthalmologic examinations. Moreover, top-ranked features include the variables describing ophthalmologic examination results. Using longitudinal data collected for 5-years provide us to assess the future trends and changes for glaucoma in people contributing to the cohort study.