# A Data Mining Framework for Glaucoma Decision Support Based on Optic Nerve Image Analysis Using Machine Learning Methods

- 792 Downloads

**Part of the following topical collections:**

## Abstract

Ocular imaging instruments, such as Confocal Scanning Laser Ophthalmoscopy (CSLO), captures high-quality images of the optic disc (also known as optic nerve head) that help clinicians to diagnose glaucoma. We present an integrated data analytics framework to aid clinicians in interpreting CSLO optic nerve images to diagnose and monitor the progression of glaucoma. To distinguish between healthy and glaucomatous optic discs, our framework derives shape information from CSLO images using image processing (Zernike moment method), selects salient features (hybrid feature selection), and then trains image classifiers (Multilayer Perceptron, Support Vector Machine, Bayesian Network). To monitor glaucoma progression over time, our framework uses a mathematical model of the optic disc to extract morphological features from CSLO images and applies clustering (Self-Organizing Maps) to visualize subtypes of glaucomatous optic disc damage. We contend that our data analytics framework offers an automated and objective analysis of optic nerve images that can potentially support both diagnosis and monitoring of glaucoma. We validated our framework with CSLO optic nerve images and our data analytics approach detected glaucomatous optic discs with a sensitivity of 0.86, a specificity of 0.80, an accuracy of 0.838, and an AUROC of 0.913 with a Bayesian network classifier using the optimal subset of Zernike features (six moments). Furthermore, our framework identified, using morphological features, five clusters of CSLO images, where each cluster stands for a subtype of optic nerve damage (two healthy subtypes and three glaucoma subtypes). The characteristics of each cluster—the subtype of the image—were determined by experts who examined the morphology of the images within each cluster and provided subtype characteristics to each cluster.

## Keywords

Glaucoma Machine learning Data mining Classification Clustering Confocal Scanning Laser Ophthalmoscopy## 1 Introduction

Glaucoma is a deteriorating disease which results in slow gradual damage to the optic disc and vision loss [1]. Understanding the cause, types, and natural course of glaucoma still remains a challenge. Several factors, such as elevated intraocular pressure (IOP) [2], corneal thickness [3], and type 2 diabetes mellitus [4], influence the risk of developing glaucoma. Subjective evaluation of the optic disc, using ophthalmoscopy or inspecting stereophotographs, is highly dependent on the examiner’s experience and skills, and it has been shown that lack of agreement among specialists could occur [5, 6]. Thus, objective imaging technologies, in complement of traditional methods, allow to help clinician to assess and document the status of the optic disc [7]. Several ocular imaging instruments have been developed to quantitatively measure the structure of the optic disc (also known as Optic Nerve Head (OHN)) [8]. For instance, clinicians diagnose and monitor glaucoma and other conditions [9] using three-dimensional images of the optic disc from Confocal Scanning Laser Ophthalmoscopy (CSLO) [10].

To help clinicians diagnose glaucoma, CSLO provides advanced analytical tools, such as the Moorfields Regression Analysis (MRA) [11] and the Glaucoma Probability Score (GPS) [12]. MRA uses linear regression between the optic disc area and log of the neuroretinal rim area and is capable of discriminating between healthy and glaucomatous eyes [13]. MRA compares the neuroretinal rim area globally and individually in six optic disc sectors with predicted values for a healthy subject with the same disc size and age [7]. An optic disc is classified as abnormal if the rim area lies outside the lower prediction interval for the whole optic disc or any optic disc sector [14]. Thus, an optic disc can be classified as normal (within normal limits), borderline, and abnormal (outside normal limits). However, MRA requires manual outlining of the optic disc boundaries, where contour lines drawn by different professionals vary, sometimes significantly [7]. GPS classification alleviates this limitation by using a mathematical model with 10 parameters describing the morphology of the optic disc shape [15]. This approach uses least-squares fitting to approximate the best parameters for a model of the optic disc in the image (globally and six sectors) [12]. GPS uses a Relevance Vector Machine (RVM) [16] on the disc model parameters to discriminate between normal and glaucomatous discs [17]. Similar to MRA, the overall outcome of GPS is determined by the classification outcomes of the disc sectors (normal, borderline, abnormal) [7]. However, it has been shown that GPS might not significantly improve the diagnostic capacity of CSLO [18, 19]. Furthermore, GPS is unable to classify optic discs when the topographical surface cannot be approximated by the disc model [7, 20]. Therefore, the current tools are insufficient, since they allow erroneous interpretation of CSLO images, fail to identify noisy images, and create variance in the diagnostic recommendations between practitioners. In order to assist practitioners in the diagnosis and therapeutic management of glaucoma, analysis of CSLO images must be automated in an objective and quantifiable manner.

Several previous works have automatized, with varying results, the analysis of optic nerve data and CSLO images. Bowd et al. [21] trained several classifiers (Multilayer Perceptron (MLP), Support Vector Machine (SVM), and linear discriminant functions) on features extracted from retinal tomography images, where relevant features were selected with forward and backward feature selection methods. Park et al. [22] trained SVM classifiers on features from optic disc data selected with correlation analysis and forward wrapper model. Belghith et al. [23] used Markov Random Field (MRF) to detect changes (pixels) between pair of CSLO images and a two-layer fuzzy classifier to detect glaucomatous progression. Mardin et al. [24] used bootstrap aggregating (Bagging) ensemble of decision trees to classify glaucoma. Bowd et al. [25] used CSLO baseline measurements with Standard Automated Perimetry (SAP) for training RVM classifiers for predicting glaucomatous progression in suspect eyes. Racette et al. [26] applied backward elimination to select features from CSLO and Short-Wavelength Automated Perimetry (SWAP) measurements for training RVM classifiers to classify healthy and glaucomatous eyes. Horn et al. [27] used CSLO and visual field testing with Frequency Doubling Technology (FDT) measurements for training Random Forests (RF) classifiers for detecting glaucoma. Twa et al. [28] used pseudo-Zernike radial polynomials to model optic disc morphological features for training a C4.5 decision tree classifier to detect glaucoma.

In this paper, we present a data analytics framework, systematically integrating multiple machine learning methods for feature selection, classification, and clustering of CSLO images, that aims to improve diagnosis of glaucomatous optic discs and to visualize the progression of the disease. Automatic interpretation of CSLO images in our framework supports (a) discrimination of healthy and glaucomatous optic discs (classification of CSLO images); (b) discovery of subtypes of glaucomatous optic disc damage (clustering of CSLO images); (c) monitoring the disease progression for a patient over a period of time (visualization of temporal progression); and (d) identification of noisy CSLO images for exclusion from any diagnostic decision making. In our framework, we automatically classify CSLO tomography images to provide gross diagnostics in terms of whether an image is healthy or glaucomatous, based on image-defining features that are extracted by an image shape analysis technique based on Zernike moments. The extracted image-defining features are further reduced by applying feature selection techniques to select an optimal subset of image features for training the classifiers—i.e., Support Vector Machine, Multilayer Perceptron, and Bayesian Network—to distinguish between healthy and glaucomatous optic discs. A unique aspect of our work is the identification of subtypes of glaucomatous optic damage which allows to sub-classify glaucoma patients—from a personalized medicine perspective this allows to administer precise treatments in line with the specific morphological patterns of the optic disc damage [29]. To automatically identify subtypes of glaucomatous damages in CSLO images and monitor disease progression, we extract morphological features from mathematical models of optic nerve head fitted to CSLO images using the approach proposed by Swindale et al. [12] for the glaucoma probability score. Using the extracted morphological features, our framework applies clustering techniques to identify the different subtypes of glaucomatous damages in CSLO image dataset. The image clusters (glaucoma subtypes) allow to monitor the progression of the disease over time within a patient by visualizing the subtypes (clusters) temporal transitions. Furthermore, this visualization allows to identify noisy CSLO images that at times occur when taking the image (does not respect disease progression temporal constraints). Previously, the proposed clustering approach was applied to Zernike moments [30]. However, visualization and interpretation of disease progression patterns are more informative with the clustering resulting from morphological features since the clusters correspond to specific morphological subtypes, and experts are able to interpret the clusters in terms of well-defined morphological features, which was not the case if clustering was pursued with Zernike moments. We validate the discrimination between healthy and glaucomatous optic discs using Zernike moment features from 1257 CSLO images from 136 subjects (51 healthy subjects and 85 glaucoma patients) taken at different time intervals. We validate the identification of glaucoma subtypes using 17 morphological features from 3479 visual field examinations for both normal control (63) and glaucomatous (100) patients. Our data analytics-based approach can detect glaucomatous optic discs with a sensitivity of 0.86, a specificity of 0.80, an accuracy of 0.838, and an AUROC of 0.913 with a Bayesian network classifier using the optimal subset of Zernike features (six moments). Furthermore, our framework identified, using morphological features, five clusters of CSLO images, where each cluster represents a subtype of optic nerve damage (two healthy subtypes and three glaucoma subtypes). The characteristics of each cluster—the subtype of the image—were determined by experts who examined the morphology of the images within each cluster and provided subtype characteristics to each cluster. The rest of this paper is organized as follow. In the section 2 , the tools used by the proposed framework are presented. In section 3 , the proposed framework to support automatic interpretation of CSLO images is presented in detail. In section 4 , results of validating the different components of the framework are presented and discussed. Finally, we conclude the paper and suggest future perspectives of this work.

## 2 Background

This section presents the tools used by the proposed method to support automatic interpretation of CSLO images.

### 2.1 Feature Extraction Using Zernike Moments

Features extracted from an object of interest should be invariant with respect to its position, size, and orientation [31]. Moment descriptors have been used in image analysis and provide shape characteristics of an object and are invariant to linear transformation [32]. Orthogonal circular moments, such as Zernike moments [33], are defined by mapping an image function (pixel intensity values) onto a set of orthogonal complex polynomials defined inside a unit circle [34]. Here, the unit circle corresponds to the largest circle fitting completely within the image. Zernike moments are invariant to arbitrary rotation and, after normalization, to scale and translation [35]. Zernike moments are relevant for our purpose since the optic disc is centered in the image, thus avoiding the requirement for an independent segmentation stage in which the object is explicitly identified. Zernike moments \( {Z}_p^q \) of order *p* and repetition *q*, where *p* − |*q*| is even and 0 ≤ |*q*| ≤ *p*, on an image intensity function capture gross shape information (low order) and high frequency information (high order) [36]. Since we use the magnitude of the Zernike moment as feature, which can be taken as a rotation invariant feature of the image function, we only consider moments \( {Z}_p^q \) with *q* ≥ 0 [37]. Furthermore, Zernike moments of orders 0 and 1 are not included in the feature set since they are constant for all normalized images [38]. Thus, given the maximum order *n* in the feature set *F*_{n}, the size *L* of the feature set is defined as follows [39]:

For instance, the feature set *F*_{30}, where the maximum order is *n* = 30, has 254 Zernike moment magnitude values (moments \( {Z}_2^0 \)to \( {Z}_{30}^{30} \)).

### 2.2 Feature Selection

The emphasis of feature selection is to find a subset of the original features of a dataset which can efficiently describe the dataset while reducing the effects from noise or irrelevant features and still provide good prediction results [40]. Feature selection methods are mainly classified into wrapper and filter methods. Wrapper methods for feature selection [40] use classifiers as a black box and classifiers performance as the objective function to evaluate feature subset. For instance, we can test each possible feature subset finding the one which minimizes the classifier error rate [41]. The main wrapper method types are sequential selection algorithms and heuristic search algorithms [42]. Sequential algorithms, such as Sequential Forward Selection (SFS) [43] and Sequential Backward Selection (SBS) [44], start with an empty set (full set) and add features (remove features) until the selection criterion is reached. Heuristic search algorithms, such as Genetic algorithms [45] and Particle Swarm Optimization [46], generate and evaluate subsets by searching in a search space or by generating solutions to the optimization problem.

Filter methods for feature selection [41] use feature ranking techniques as the principle criteria for feature selection by ordering. They assess the relevance of features and low-scoring features are removed before classification [42]. Filter methods can be based on Chi-squared test, information gain, and correlation coefficient scores [40]. A common disadvantage of filter methods is that they ignore the feature dependencies, which can lead to worse classification performance [42].

In order to alleviate the problem of ignoring feature dependencies, we use a filter method based on Bayesian Network (BN) and Markov Blanket (MB) [47]. The filter method uses a heuristic search based learning algorithm (K2) [48] to learn a BN in order to determine the causal relationship between features in MFS. Then, features not included in the Markov Blanket of the class variable (glaucoma/healthy) are considered less relevant to the BN model and are removed. Using the Markov Blanket criterion only remove features that are really unnecessary [49].

A Bayesian network is a directed acyclic graph where each node represents a random variable (a feature or class label) and each edges represent conditional dependencies between [50]. The Markov Blanket for a node *A* is the set of nodes composed of the parents of node *A*, its children, and the other parents of its children [49]. Given its Markov Blanket, a node *A* is independent of all other nodes in the BN, shielding it from the rest of the network. Thus, the Markov Blanket of a node is the only knowledge required to predict the value of that node [50]. For a classification task, where we have one node for the class label and the other nodes are the features, the optimal features are the ones inside the Markov Blanket of the class label node.

### 2.3 Self-Organizing Map

Self-Organizing Map (SOM) is an Artificial Neural Network (ANN) trained using unsupervised learning that converts complex statistical relationships between high-dimensional data item into simple geometric relationships on a low-dimensional space [51]. A trained SOM represents a topology preserving mapping of high-dimensional data points onto a (usually two-dimensional) grid of neurons (nodes/units) [52]. Each node represents a model (weight vector of same dimension as input data vectors) of observation [53]. SOM evaluates the models so that they optimally describe the domain of observation, where similar models are closer to each other in the grid than the more dissimilar ones [51]. Thus, SOM is a similarity graph and a clustering diagram. A new data item (input) will match a specific model (node) in the grid, where models that lie in the neighborhood of the best model in the grid match better the input than with the rest [53]. The U-matrix method can be used to visualize clusters of nodes on a map based on the distance between node weights of neighboring nodes [54].

Before initializing the training process, we must select the shape and size of the SOM grid. The map shape is usually a sheet (cylinder or toroid if data is circular) where the side length (rows) is longer than the other (columns). The sheet grid can be a hexagonal or rectangular lattice, where each node (neuron) can have six neighbors in a hexagonal lattice and four or eight neighbors (Von Neumann or Moore neighborhoods) in a rectangular lattice. There are two main types of training algorithms: (i) a recursive, stepwise approximation process where the input data items are applied to the training algorithm one at a time (periodic or random sequence), until it reaches a reasonably stable state; (ii) a batch-type process where all data input items are applied to the algorithm the same time and all models are updated in a single concurrent operation. It should be noted that the batch training algorithm does not involve a learning rate parameter and its convergence is an order of magnitude faster and safer than stepwise learning [53].

The stepwise learning algorithm can be described as follows: (i) Initialize, for each node, the weights vector. A good initialization can make the algorithm to converge faster to a good solution [53]. It can be done in a random way using the minimum and maximum values of each feature in the training dataset or in a linear way along the two greatest eigenvectors of the covariance matrix of the training dataset. (ii) Chose a random vector from the training dataset and present it to the SOM. (iii) Find the Best-Matching Unit (BMU) (node) by evaluating the Euclidian distance between the input vector and the weight vector of each node in the SOM. The selected BMU has the smallest distance, (iv) Determine the neighborhood (nodes) of BMU. The size of the neighborhood (radius) decreases with each iteration (exponential decay function) until a minimum value (e.g., BMU itself). (v) BMU and its spatial neighbors in the grid are modified. The modified weight vectors will match better the input. The rate of the modification at different nodes depends on the neighborhood function. For instance, in the Gaussian neighborhood function, nodes that are closer to BMU are influenced more than farther nodes. (vi) Repeat from the second step until convergence. In the batch learning algorithm, instead of choosing on random vector in the dataset, the algorithm partitions the input data vectors into lists associated with their best-matching models (nodes). Then, the weight vectors are modified as generalized medians over the neighborhood nodes’ lists. The process is repeated, cleaning the lists and matching new copies of input vectors, until the updated model values become steady.

The quality of each map can be evaluated using the average quantization error *qe* [51]. The average quantization error \( qe=\frac{1}{N}\sum \limits_{i=1}^N\left\Vert {x}_i, BMU\left({x}_i\right)\right\Vert \) is the average distance between each input data vector *x*_{i} and its best-matching unit (*BMU*(*x*_{i})). Smaller average quantization error gives a better trained SOM. As previously mentioned, a trained SOM represents a mapping of high-dimensional data points (dataset items) onto a two-dimensional grid of units (nodes/neurons) that preserve the topology. The sum of distance between a unit *u* and the units in is neighborhood in the high-dimensional space is shown on a U-matrix as a height value (U-height) for unit *u* [52]. High U-heights means there is a large gap in the data space and low U-heights means that points are close to each other in the data space. Data points that matches a group of units in a valley (low U-heights) surrounded by large walls (high U-heights) are within a distance-induced cluster structure in the data space. A SOM and its U-matrix are often used to visualize the distance structures in the high-dimensional data space. However, to divide the data space into clusters, additional clustering methods need to be applied [55].

### 2.4 EM Clustering

A finite Gaussian mixture model is a probabilistic model that assumes that all the data points are generated from a mixture of *K* Gaussian density functions (components) with unknown parameters [56]. The mixture density function with *K* components is:

*x*is a

*d*-dimensional vector data (e.g., unit weigh vector) from a dataset,

*p*

_{i}(

*x*|

*θ*

_{i}) is the

*i*

^{th}mixture component (density function),

*w*

_{i}is the weights of the

*i*

^{th}component (∑

*w*

_{i}= 1),

*θ*

_{i}are the parameters of the

*i*

^{th}component (mean vector and covariance matrix for multivariate Gaussian), and Θ is the complete set of parameters (all

*w*

_{i}and

*θ*

_{i}) for the mixture model with

*K*components. Given

*K*components, a set of observed data

*X*(dataset), and a set of missing data

*Z*, the goal of the EM algorithm is to find the optimal set of parameters Θ in order to maximize the likelihood of the model [56]. The missing data

*Z*are labels for the observed data, where each label item

*z*

_{i}is a vector of

*k*elements indicating which component (position with value 1) generated the data

*x*

_{i}. The EM algorithm is an iterative method that starts from an initial estimate of Θ (random) which is updated iteratively until convergence [57]. Each iteration alternates between an expectation (

*E*) step and a maximization (

*M*) step. The expectation step determines the conditional expected value of the log-likelihood function under the current estimate of the parameters

*Θ*(

*t*). Since the conditional expected value of the log-likelihood function is linear with respect to the missing data

*Z*, we only need to compute the

*Q*-function using the membership probabilities (

*p*(

*z*

_{ik}|

*x*

_{i},

*Θ*(

*t*))) estimating the probability that the observed data

*x*

_{i}was produced by component

*k*under the current estimate of the parameters

*Θ*(

*t*) [56]. Using the membership probabilities, the maximization step finds the new estimate of the parameters

*Θ*(

*t*+ 1) (component weights, means, covariance) that maximize the expected value of the log-likelihood function. It should be noted that the EM algorithm is highly dependent on initialization, where EM can converge to a local maximum [56]. To alleviate this issue, several heuristic methods can be used to escape a local maximum. For instance, the random restart hill climbing method iteratively does EM, each time randomize the initial parameters

*Θ*(0), and keeps the best produced model among all the restarts.

In order to estimate the number of mixture components (clusters), the maximum likelihood (ML) criterion cannot be used since the maximized likelihood is a non-decreasing function of *k* [56]. In order to find the model (number of components/clusters *K*, parameters Θ of the maximized likelihood function), we use the Bayesian Information Criterion (BIC) [58]. The criterion is defined as \( \mathrm{BIC}=\log \left(\widehat{\mathcal{L}}\right)-\frac{1}{2}K\log (N) \), where \( \widehat{\mathcal{L}}=p\left(X|\varTheta, M\right) \) is the maximized value of the likelihood function of the model *M* given the observed dataset *X* and the model parameter values *Θ*, \( -\frac{1}{2}K\log (N) \) is the penalty term, *K* is the number of free parameters (e.g., clusters/components), and *N* is the number of observations in *X*. The model with the highest BIC value is preferred, where the penalty term alleviates the overfitting problem by penalizing the complexity of the model [59].

## 3 Methods

### 3.1 CSLO Optic Nerve Image Datasets

The classification of CSLO images (glaucomatous or healthy optic discs) uses a dataset of 1257 CSLO images taken at different time intervals from 136 subjects (51 healthy and 85 glaucoma) [60]. The dataset is divided into training data (75%) and testing data (25%) randomly. In stage 1, for each image, we extract the Zernike moments up to order 30 (254 moments) as features for classification (see section 3.2.1 ). We also create feature subsets (29) where each subset *F*_{n} comprises moments from order 2 to order *n* (*n* ∈ [2, 30]). For the CSLO image clustering (glaucoma subtype), we use a dataset of 3479 visual field examinations with 17 morphological features for both normal control (63) and glaucomatous (100) patients [61]. For each CSLO image, a mathematical model of the optic nerve head allows to extract the morphological features (see section 3.2.2 ). Some patients were examined on both left and right eyes; others were examined on either left or right eye. The CSLO images were obtained in intervals of 6 months over a period of up to 9 years.

### 3.2 Stage 1: CSLO Image Processing

To support automatic interpretation of optic discs in CSLO images, we need to extract relevant features by using image analysis techniques. Two image describing features were pursued: (1) Zernike moments to describe the shape of the optic disc, which is used for classification of CSLO images (healthy or glaucomatous optic discs), and (2) morphological features to describe the inherent morphological nature of a CSLO image, which is used to cluster the CSLO images to identify glaucoma subtypes.

#### 3.2.1 Zernike Moments

For each CSLO image, we extracted Zernike moments from order 2 to order 30 for a total of 254 features. Thus, the CSLO feature set *F*_{30}, where the maximum order *n* = 30, has 254 Zernike moment magnitude values (moments \( {Z}_2^0 \) to \( {Z}_{30}^{30} \)). Furthermore, the features can be grouped in an incremental order, ranging from *F*_{2} to *F*_{30} (29 groups). Since each group *F*_{n} comprises Zernike moments up to order *n*, we have *F*_{2} ⊆ *F*_{3} ⊆ ⋯ ⊆ *F*_{29} ⊆ *F*_{30}. To reduce the number of features, it is important to keep the classification of CSLO images efficient by selecting an optimal set of lower order moments. This is a difficult problem for two reasons. Firstly, the determination of the optimal number of low order moments needed in order to achieve high classification accuracy cannot be objectively measured. Secondly, there is no distinct relationship between Zernike moments that can be used to find the optimal set of features. Thus, the next task must take into account these problematic when carrying out feature subset selection in conjunction with training CSLO image classifiers (stage 2).

#### 3.2.2 Morphological Features

We utilize 17 morphological features that were automatically extracted from mathematical models with 10 parameters describing the morphology of the optic disc shape [12]. Swindale et al. [12] utilized this approach to evaluate the Glaucoma Probability Score (GPS), since the regularities in shape of the optic nerve head (ONH) allowed the description by a reasonable mathematical model with few parameters. It uses least-squares fitting to approximate the best parameters for a model of the CSLO image (globally and six sectors). The six optic disc sectors are temporal, temporal superior, temporal inferior, nasal, nasal superior, and nasal inferior. From the mathematical model, we extract 17 morphological features: (1) the overall curvature along the nasotemporal axis, (2) the overall curvature in the vertical direction, (3) the maximum cup depth, (4–10) the cup radius from the center of the cup to the cup wall globally and for the 6 cup sectors, and (11–17) the overall steepness of the cup walls globally and for the 6 cup sectors. The next task is to use the morphological features to identify glaucoma subtypes (clusters) (stage 3).

### 3.3 Stage 2: Classification of CSLO Images

In the previous stage, 254 Zernike moments were extracted for each CSLO image. In order to reduce the complexity of the classification process, we selected the best subset of low order moments, which contain the least number of features that contribute the most to classification accuracy [41]. Therefore, stage 2 involves feature selection on Zernike moments and training image classifiers on selected features. Feature selection is carried out in a two-pass approach (see Fig. 1). The first pass generates a *Moment Feature Subset* (MFS) consisting of low order moment features by applying a feature selection wrapper approach [62]. The second pass selects highly relevant moments from the MFS, i.e., the *Optimal Moment Feature Subset* (OMFS), by applying a Markov blanket filter method [47] for feature selection. The OMFS offers reasonably high image classification despite using a small number of moments. The final image classifiers are based on a Bayesian network resulting from the OMFS selection and on a SVM and MLP trained using OMFS as features.

It is uncommon that a wrapper model is used before a filter model as typically in hybrid models, a filter model is used before the wrapper model because of its efficiency. However, our feature selection methodology proposes a novel approach, which is based on the following reasons. Each moment feature is extracted individually and is independent of the other moments; therefore, it is difficult to detect the relationships among them. The low order moments represent some fundamental geometric information and high order moments represent the details of the digital images. So, most of the valuable information relevant to classification is contained in the low order moments. If we use a filter model in this phase, then this feature selection model will consider all features equally. So more high order moments which are not useful or even noisy will be selected and more important low order moments will be ignored. To avoid this problem, we decided to find the suitable MFS to avoid emphasis of high order moments as the first step of our feature selection using a wrapper model.

#### 3.3.1 Pass I: Wrapper Method for Feature Selection

The objective of the first pass is to reduce the size of the feature set (254 moments) of a CSLO image using a wrapper method to generate MFS. We apply a sequential feature selection strategy to determine the size of MFS as follows: (a) generate training sets, where new each trained set is generated by incrementally adding moments (features) from the next higher order to the previous existing training set. As previously mentioned, given an image feature set *F*_{n} with a maximum order *n*, we have *n* − 1 feature subsets where each subset *F*_{i} includes moments from order 2 to order *i*. If the maximum order *n* is 30 for the feature set, we have 29 feature subsets, starting from *F*_{2} (2 moments) to *F*_{30} (254 moments). It should be noted that *F*_{2} ⊆ *F*_{3} ⊆ ⋯ ⊆ *F*_{29} ⊆ *F*_{30}. In total, we have 29 different training sets, where the initial training set has 2 features (subset *F*_{2}) and the final training set has 254 features (set *F*_{30}) to describe CSLO images (same idea for testing sets); (b) train, for each feature subset, two classifiers based on the associated training set; in order to evaluate the performance the associated feature subset; and (c) determine, for each feature subset, the classification accuracies (selection criterion) of both classifiers, using the associated testing set with the same features (moments). It should be noted that the images are partitioned so that 75% images were used for training and 25% images were used for testing the classifier. We experimented with multiple classification algorithms to select the best classifier—the prominent classifiers were MLP, SVM, and Bayesian Networks.

MLP is a feedforward neural network where nodes are connected by weighted links and organized in several layers [63]. To train the MLP classifier, we used the (iterative) backward propagation of errors (backpropagation) algorithm [64]. However, it requires an existing network and defining the optimal architecture (how many nodes for each layer, number of hidden layers) remains a hard task [65]. For each training set, we use a three-layer perceptron where its structure is dependent on the number of features (moments) in the training set. In the input layer, the number of nodes is the same as the number of features. In the output layer, we have only one node which outputs the class label glaucoma (1) or healthy (− 1). Since we use a hyperbolic tangent sigmoid as activation function, the moment features were scaled to [−1, +1]. It should be noted that a Single Layer Perceptron (SLP) is first built to do initial analysis. Although a SLP can only learn linearly separable classification model, the predictive power of the single variable classifier [41] can help learn which moments are important for predicting class.

SVM constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space. [66]. A hyperplane is optimal when the separation margin is optimal (high distance between nearest training data point of any class and the hyperplane) [67]. SVM achieves high discrimination using nonlinear kernel functions to transform the input space into a multidimensional space [68]. Kernel functions, such as the Gaussian Radial Basis Function (RBF) and polynomial kernels, allow nonlinear SVM classification where data is not linearly separable [69]. The effectiveness of SVM classifier depends on the selection of the kernel (e.g., linear, RBF, and polynomial), the kernel’s parameters (e.g., *γ* parameter in RBF), and the soft margin parameter *C*. It should be noted that *γ* controls the degree of nonlinearity of the model and *C* controls overfitting of the model by specifying tolerance for misclassification [68]. The selected kernel is RBF since analysis on training data shows that moments and class label are nonlinear and related works using SVM with Gaussian kernels for glaucoma detection shown promising results [70, 71, 72, 73]. To identify the optimal *C* and *γ* parameters, we use grid search using a fivefold cross-validation [74]. Like MLP, the inputs for SVM need to be scaled to [−1, +1].

In total, 29 SLP, MLP, and SVM classifiers were trained and evaluated. The next step involved the selection of a reduced feature subset by utilizing the trained MLP and SVM classifier’s accuracies. The feature subset selection goal is to maximize the accuracy (minimize the error rate) by selecting a feature subset *F*_{n} with the lowest maximum moment order *n*. In other words, given a selected feature subset *F*_{n} of maximum order *n*, all feature subsets *F*_{i} of maximum order *i*, where *F*_{n} ⊆ *F*_{i}, do not improve classification accuracy (decrease error rate). For each feature subset *F*_{n}, we select the maximum classification accuracy between the MLP and SVM classifiers trained with this feature subset. Then, we select the feature subset *F*_{n} with the highest accuracy value. If more than one feature subsets are selected, we use the feature subset *F*_{n} with the lowest maximum order *n*. The selected feature subset *F*_{n} is labeled as the MFS.

#### 3.3.2 Pass II: Filter Method for Feature Selection

The goal of the second pass is to generate the OMFS that comprises only the highly salient moments by applying a filter method to further reduce the size of the feature subset MFS. The filter method uses a heuristic search (K2) to learn a BN to determine the causal relationship between features in MFS and only remove features not included in the Markov Blanket of the class variable (glaucoma/healthy), which are considered less relevant for the BN [47]. Therefore, the OMFS is the Markov Blanket of the class label node in the BN.

The OMFS subset is generated by following these four steps: (1) discretize the moments in OMF and remove the moments that were discretized into a single value; (2) learn a Bayesian network from data using the K2 algorithm; (3) define the Markov Blanket of the class label and select its features for OMFS; and (4) train classifiers using only the moments inside OMFS. Since the BN learning algorithm (K2) only supports discrete variables [48], the first step partitions the features in OMFS using a discretization technique [75]. The discretization process is based on Minimum Description Length Principle (MDLP), which uses an information entropy minimization heuristic [76]. Also, this preprocessing step allows to remove moments from OMFS that were discretized into a single value. In the second step, the K2 algorithm is used to learn the BN structure. Given a dataset with a fixed ordering of variable (features and class label), the K2 algorithm uses heuristic search (hill climbing) to find the most probable BN structure [48]. Based on the order of the nodes (variables), K2 looks for parents for each node whose addition increases the score of the BN. Given an upper bound for the maximum number of parents, K2 finds, for every node, the most probable set of parents. However, an inadequate variable ordering may give poor results [48]. In order optimize the variable ordering needed for K2, the Chi-squared test of independence is performed between each variable and the class label to measure the strength of the dependence relationship [47]. The variables are ranked based on the chi-squared test score (high to low). Based on the chi-squared ordering, K2 learns the BN structure. In step 3, we determine the Markov Blanket [49] of the class label according to the learn BN structure in step 2. As previously mentioned, features in the Markov Blanket are the selected features for OMFS. In step 4, we use the features in OMFS to train and test three classifiers. The classifiers are BN, SVM and MLP.

### 3.4 Stage 3: Clustering of CSLO Images

To monitor the progression of the disease over time, we need to be able to differentiate between the different subtypes of healthy and glaucomatous optic nerves. Understanding the large variation in the appearance of the optic nerve, both within groups of healthy subjects and in patients with glaucoma, is rather tedious. Therefore, to recognize and differentiate between patterns of optic nerve damage [77], we needed to sub-classify CSLO images based on well-defined criteria (such as morphological features) so that human experts can validate the sub-classes (i.e., clusters). The manual sub-classification of optic nerve damage and monitoring the damage progression are a subjective task [78], giving rise to considerable levels of disagreement between trained experts [79]. We argue that by clustering the images, using machine learning methods applied to the image’s morphological features, allows to identify subtypes (clusters) of optic nerve damage in an objective manner. Furthermore, the clusters can be used to visualize the progression of the disease over time within a patient and even to identify noisy optic nerve images.

We mentioned earlier that the CSLO images were processed using Zernike moments and morphological features. We attempted clustering using both these CSLO image-defining methods. Clustering using Zernike moments could identify healthy (one cluster) and glaucomatous (three clusters) subtypes [30]. However, it was not feasible for the experts to validate the clusters using Zernike moments (that cannot be interpreted) and the clusters could not be associated with morphological definitions of the image which is well understood by experts. Therefore, we proceeded CSLO image clustering using the morphological features which resulted in clusters that correspond to specific morphological subtypes. For human interpretation, the resulting clusters were more informative and allowed to visualize and interpret the disease progression patterns.

For CSLO image clustering, the choice of the clustering algorithm was largely driven by the need to visualize the emergent clusters to track the progression of the disease over time. We decided to use SOM as the clustering algorithm as it provides a topology preserving mapping of high-dimensional data points onto a two-dimensional grid of units. The alternative clustering approaches, based on hierarchical agglomerative clustering, were deemed unsuitable for our purpose since it does not provide a topological visualization of the clusters (rather it offers a clustering tree—i.e., dendrogram), is not suitable for large datasets given higher computation time, and is not robust toward outliers [80]. Therefore, we used a distribution-based clustering (Gaussian mixture models using the Expectation-Maximization algorithm), where each new observation is assigned to the cluster (mixture component) corresponding to the highest posterior probability [80]. This two-stage procedure, where SOM produces the proto-clusters (units) that are later clustered, usually performs well and reduces computation time when compared with direct clustering of the data [55].

Using just the CSLO image’s morphological features, our image clustering strategy consists of two phases: (a) partition training images into distinct clusters using SOM [53] and (b) draw clear and distinct boundaries around the clusters using the Expectation-Maximization (EM) algorithm [57]. The clustering of CSLO images utilizes the 17 morphological features automatically extracted from mathematical models with 10 parameters describing the morphology of the optic disc shape [12].

#### 3.4.1 Phase A: Data Clustering Using SOM

We used a SOM for image clustering. Our clustering approach is to train a SOM to cluster the CSLO images based on the similarities between image shapes (morphological features), where each cluster may represent a different subtype of healthy and glaucomatous optic nerves. We train the SOM in four steps as follow: (1) pre-process the CSLO dataset for normalization the morphological features; (2) train several SOMs using different parameters; (3) select best SOM based on quantization error; and (4) use the U-matrix method to visualize plausible discernible clusters (optic nerve damage subtypes). In step 1, morphological features need to be scaled since SOM use Euclidian distance to compare attribute vectors [53]. Thus, we scale all features to [0, 1] using a simple linear transformation.

In step 2, we train several SOMs to select the best model. Before initializing the training process, we must select the shape and size of the SOM grid. Here, we use a sheet shape with a hexagonal lattice. The SOM was trained in two phases [55]: (i) a rough training phase and (ii) a fine-tuning phase. In the rough training phase, the neighborhood radius started with an initial larger neighborhood value which was progressively decreased until it reaches a fixed minimum value (greater or equal the initial radius in fine-tuning). At this point, the rough organization of the map is used to fine-tuning the SOM. In the fine-tuning phase, the neighborhood radius started with an initial small number and was progressively decreased until it reached a final radius (usually 1—i.e., the BMU only). The training length (epochs) in the fine-tuning phase was greater than the one for the rough training phase. We use the stepwise training algorithm, with the learning rate parameter *α*, having a value between 0.9 and 0.1 in rough training and between 0.1 and 0 in the fine-tuning phase.

In order to select the best SOM models in step 3, we evaluate the quality of each map using the average quantization error *qe* [51]. Smaller average quantization error gives a better trained SOM. A SOM and its U-matrix are often used to visualize the distance structures in the high-dimensional data space. However, to divide the CSLO images (data space) into optic nerve damage subtypes (clusters), additional clustering methods need to be applied [55].

#### 3.4.2 Phase B: Defining the Cluster Boundaries

While using the U-matrix method helps to visualize the plausible clusters (optic nerve damage subtypes) in a SOM, we need an automatic process to objectively determine clusters’ boundaries. After training a SOM, we can use, for instance, agglomerative clustering using neighborhood relation as constrain for the construction of the dendrogram [55] and Ward hierarchical clustering with fuzzy membership [81]. Thus, clustering the CSLO images is carried out in two phases where we use SOM to produce the units (proto-clusters) that are then clustered in the second phase. The advantage over direct clustering on high-dimensional data (e.g., k-means) is the reduction of computational cost (dimension reduction) and noise reduction in two-level SOM [55]. For the second phase, the EM algorithm [57] is used to determine the optic nerve damage subtypes (clusters) boundaries.

Our approach is guided by the assumption that the units (weight vectors) within the learnt SOM are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Estimation of the number of mixture components (clusters) in the Gaussian mixture model uses the EM algorithm [56]. Once we selected the best finite mixture model according to the Bayesian Information Criterion (BIC), we assign for each unit in the map the component (cluster) label with the highest probability. Thus, the SOM is partitioned into *K* regions (clusters) according to the component label assigned to each map unit.

#### 3.4.3 Monitoring Disease Progression and Identifying Noisy CSLO Images

To monitor disease progression in a patient over time and to identify noisy CSLO images, we use the trained SOM, where map units are grouped based on the EM clustering. For each patient, we map the CSLO images to SOM units and estimate the compactness factor of the mapping cluster. The compactness factor measures the average distance between the centroid of all mapped units and each mapped unit. The lower the compactness factor for a patient, the better the clustering efficiency. The series of CSLO images are quite similar over time, where differences are quite minute and does not warrant large distances between consecutive images. Thus, it implies that CSLO images of a patient would be near in the SOM and give a low compactness factor measurement. For instance, if all images are mapped to units in a close neighborhood, then the compactness factor is low since the distances between the mapping cluster centroid and the mapped units are low.

It is possible to observe disease progression if the mapping cluster (under reasonable dispersion) is inside two or more SOM clusters (optic nerve damage subtypes). Based on the temporal order of CSLO images, it is possible to visualize the disease progression trend between the initial damage subtype and a final damage subtype, with or without additional subtypes in the trend. Furthermore, it is possible to detect noisy CSLO images in a sequence if a CSLO image is mapped to a unit far away from the mapping cluster’s centroid and fall into a different cluster or if the temporal order of the CSLO images’ damage subtypes is not respected (e.g., the sequence (subtype I, subtype II, subtype I)). Currently, the identification of disease progression and noisy images is carried out by visualizing the mapping units in SOM and the evolution of the distance to the mapping cluster’s centroid in the series of CSLO images.

## 4 Results and Discussion

To illustrate our approach, we use the CSLO image datasets (see section 3.1 ) to detect glaucomatous optic discs using Zernike moments (stage 2) and to identify glaucoma subtypes, monitor disease progression, and identify noisy CSLO images using morphological features (stage 3).

### 4.1 Classification of CSLO Images

#### 4.1.1 Selecting the Best Feature Subset MFS

All classifiers exhibited a similar classification accuracy trend—i.e., they both start with a relatively high accuracy with feature subset *F*_{2} and then the accuracy drops with the addition of moments from the next few orders. But later, the accuracy starts to pick up again until it reaches the maximum accuracy in both classifiers. After the accuracy peak, the classification accuracy of subsets with high order moments is relatively lower as compared to the peak. Thus, adding more features after threshold does not improve classification accuracy. Thus, it verifies that the idea of finding the best feature subset (MFS) to delete high moments for feature selection is reasonable. Furthermore, the classification accuracy results show that feature subsets including lower moments can produce higher classification accuracy than that based on all moments. Therefore, it shows that feature selection for the CLSO dataset is necessary.

For SLP and MLP, the feature subset *F*_{9} containing moments up to order 9 (28 moments) generates the best classifiers (accuracy of 0.713 and 0.740 respectively). For SLP, the mean classification accuracies for feature subsets before and after *F*_{9} are 0.682 and 0.679. For MLP, the mean classification accuracies for feature subsets before and after *F*_{9} are 0.694 and 0.726. For SVM, the feature subset *F*_{12} containing moments up to order 12 (47 moments) gives us to most accurate classifier (accuracy of 0.870). The mean classification accuracies for feature subsets before and after *F*_{12} are 0.730 and 0.781, respectively. Thus, the feature subset *F*_{12} (moments \( {Z}_2^0 \) to \( {Z}_{12}^{12} \)) is selected as MFS since its accuracy (SVM classifier) is better than the accuracy for *F*_{9} (SLP and MLP classifiers). It should be noted that the MLP classifier using the feature subset *F*_{12} produced the second highest accuracy level (0.738) in all MLP classifiers. Thus, combining the results of MLP and SVM to select the best feature subset MFS can also alleviate overfitting resulting from wrapper models.

#### 4.1.2 Selecting the Optimal Feature Subset OMFS

Using the best feature subset MFS obtained from the wrapper method, we obtain the optimal feature subset OMFS by using the Bayesian filter method. We learn a Bayesian network structure based on MFS and use Markov Blanket to select features for OMFS. Then, we use OMFS to train classifiers. In the previous pass, the generated MFS has selected 47 features (moments \( {Z}_2^0 \) to \( {Z}_{12}^{12} \)). After step 1 (discretization), only 18 moments are used for training. Using the Chi-squared test, the moments (features) are ordered as follows (high to low score): \( {Z}_2^0 \), \( {Z}_{12}^4 \), \( {Z}_7^3 \), \( {Z}_9^3 \), \( {Z}_8^4 \), \( {Z}_8^8 \), \( {Z}_4^2 \), \( {Z}_4^0 \), \( {Z}_{11}^3 \), \( {Z}_2^2 \), \( {Z}_9^7 \), \( {Z}_{10}^8 \), \( {Z}_{11}^5 \), \( {Z}_4^4 \), \( {Z}_{12}^{10} \), \( {Z}_{12}^8 \), \( {Z}_{12}^6 \), and \( {Z}_6^2 \). Then, K2 uses the ordered features to learn the BN structure. According to the BN structure, the Markov Blanket of the class is used to generate the optimal feature subset OMFS. The six moments selected from the Markov Blanket are \( {Z}_2^0 \), \( {Z}_4^2 \), \( {Z}_7^3 \), \( {Z}_8^4 \), \( {Z}_{11}^5 \), and \( {Z}_{12}^{10} \).

Classification results for Bayesian classifiers using original and Chi-squared ordered features

Sensitivity | Specificity | Accuracy | |
---|---|---|---|

Original order | 0.80 | 0.73 | 0.772 |

Chi-squared order | 0.80 | 0.82 | 0.809 |

Classification results on the Optimal Feature Subset (OMFS)

Classifier | Sensitivity | Specificity | Accuracy | AUROC |
---|---|---|---|---|

| | | | |

SVM | 0.85 | 0.71 | 0.803 | 0.853 |

MLP | 0.86 | 0.61 | 0.728 | 0.804 |

Based on the six features in OMFS, the BN classifier is better (accuracy of 0.8382) than the SVM (accuracy of 0.803) and MLP (accuracy of 0.728) classifiers using the same features. The sensitivity of the BN classifier (0.86) is slightly better than the SVM classifier (0.85) and almost similar to the MLP (0.86). However, the specificity of the BN classifier (0.80) is better than SVM (0.71) and MLP (0.61) classifiers. Finally, the AUROC of the BN classifier (0.913) is better than SVM (0.853) and MLP (0.804). It should be noted that the BN classifier using OMFS has a slightly lower classification accuracy (0.838) than the SVM classifier (0.867) using features in MFS (Fig. 2). However, the number of features in MFS (47 moments) is almost an order of magnitude of the number of features in OMFS (six moments). Thus, selecting the BN classifier trained with OMFS to classify glaucomatous images with few features (6 vs 47 moments) without compromising overall accuracy (only 0.029 difference) allows to distinguish between healthy and glaucomatous optic nerve images.

### 4.2 Clustering of CSLO Images

*M*

_{1}to

*M*

_{10}) with different learning parameters. All SOMs use the Gaussian neighborhood function and the shape of the grid is a sheet with a hexagonal lattice. The different grid sizes (rows × columns) are 27 × 11 (

*M*

_{1},

*M*

_{7}, and

*M*

_{8}), 21 × 14 (

*M*

_{2},

*M*

_{9}, and

*M*

_{10}), and 20 × 15 (

*M*

_{3}–M

_{6}). The SOM initialization is linear (

*M*

_{1}–M

_{3}and

*M*

_{5}–

*M*

_{9}) or random (

*M*

_{4}and

*M*

_{10}). The batch training algorithm (

*M*

_{1}–

*M*

_{4},

*M*

_{7},

*M*

_{9}, and

*M*

_{10}) and the stepwise learning algorithm (

*M*

_{5},

*M*

_{6}, and

*M*

_{8}) are used. For the rough training, the initial radius values are 10 (

*M*

_{3}–

*M*

_{6}), 11 (

*M*

_{2},

*M*

_{9}, and

*M*

_{10}), and 14 (

*M*

_{1},

*M*

_{7}, and

*M*

_{8}), the final radius values are 2 (

*M*

_{3},

*M*

_{4}, and M

_{5}), 2.5 (

*M*

_{4}), and 3 (

*M*

_{1},

*M*

_{2}, and

*M*

_{7}–

*M*

_{10}), the training lengths (epochs) are 10 (

*M*

_{4},

*M*

_{7}, and

*M*

_{8}) and 100 (

*M*

_{1}–

*M*

_{3},

*M*

_{5},

*M*

_{6},

*M*

_{9}, and

*M*

_{10}), and the initial training rate

*α*for the stepwise learning algorithm is 0.5 (

*M*

_{5},

*M*

_{6}, and

*M*

_{8}). For the fine-tuning phase, the initial radius values are the rough training final radius values, the final radius is 1, the training lengths (epochs) are 2000 (

*M*

_{9}), 1000 (

*M*

_{1}–

*M*

_{3},

*M*

_{5},

*M*

_{6}, and

*M*

_{10}), and 100 (

*M*

_{4},

*M*

_{7}, and

*M*

_{8}), and the initial training rate

*α*for the stepwise learning algorithm is 0.1 (

*M*

_{5}) and 0.05 (

*M*

_{6}and

*M*

_{8}). Table 3 shows that

*M*

_{3}(300 units) has the lowest average quantization error (

*qe*) and is selected as the final SOM.

The 10 trained SOMs

SOMs | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|

| 0.1278 | 0.1279 | | 0.1284 | 0.1279 | 0.1290 | 0.1282 | 0.1386 | 0.1279 | 0.1279 |

*K*and observed data). We search for the model with the best

*K*components (clusters) by performing EM on nine models with different number of components

*K*(2 to 10). For each model, we initialized EM using the 10 random restarts method. To compare each model, we use the absolute differences

*ΔBIC*[82] between the maximum

*BIC*value and the

*BIC*value of a model with

*K*components (clusters). We select the model with

*ΔBIC*= 0 as the finite mixture model for clustering the map units. Thus, models with

*ΔBIC*close to 0 are better models than models with

*ΔBIC*far from 0. It should be noted that if

*ΔBIC*is very low (2 or less), then the model can also be considered as a good model [82]. Table 4 shows that the best model has five components (clusters), followed by eight and six components. Since

*ΔBIC*of the models with eight and six components is more than 2, then the model with five components is the only one selected. Using the finite mixture model with five components, we label each map unit in the SOM with the most probable component (cluster) that generated the map unit. Thus, we use the component labels to create the five cluster boundaries in the SOM.

Number of cluster vs *ΔBIC* values

| 2 | 3 | 4 | | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|

ΔBIC | 565 | 250 | 234 | | 174 | 187 | 138 | 270 | 429 |

*Cluster 1*represents predominantly healthy discs, small to medium-sized, with small and shallow cups (healthy subtype 1);

*Cluster 2*represents predominantly healthy discs with various cup appearances (healthy subtype 2);

*Cluster 3*represents predominantly optic discs with severe glaucomatous damage and large and deep cupping (glaucoma subtype 3);

*Cluster 4*represents predominantly small to medium-sized discs with suspicious glaucomatous appearance and moderate cupping (glaucoma subtype 4); and

*Cluster 5*represents predominantly glaucomatous discs of moderate to large size with large concentric cupping (glaucoma subtype 5). It is interesting to note that the upper left cluster (i.e., cluster 1) represents healthy discs and going down on a diagonal to the lower right end of the SOM the clusters are progressively represents a decline in the health of the optic disc.

### 4.3 Visualization of Glaucoma Progression

The visualization of the disease progression for a patient over a period of time is based on the trained SOM with its five subtypes (clusters). CSLO images taken over time for a patient are mapped onto the SOM map units. Patterns of mapped units indicate potential progression of the disease from one cluster to another, where each cluster represents optic nerve damage subtypes.

*AAB*indicates that the first and second images map to unit

*A*and the third image map to unit

*B*. Figure 5a, b shows the disease progression of one glaucoma patient where the six images fall into six map units crossing three different clusters (subtypes) with a compactness factor of 0.1468. In the trained SOM (Fig. 5a), the first and second images are mapped to unit

*A*and

*B*in cluster 1 (healthy subtype 1), the third to fifth images are mapped to unit

*C*to

*E*in cluster 2 (healthy subtype 2), and the sixth image is mapped to unit

*F*in cluster 5 (glaucoma subtype 5). While the distances to centroid (Fig. 5b) of images in units

*A*,

*B*,

*E*, and

*F*are high, it fits a pattern of disease progression where there is a temporal progression of the disease in terms of the cluster dispersion on three different clusters along the time series of six images. In this pattern, the images are equally distributed between the initial, mid, and final stages of the progression trend. There are a lot of changes between the initial and final images, thus a higher distance to the centroid and compactness factor. We should observe a decrease followed by an increase of the distance for images between the initial and final stage of the progression since the trend go closer first and then farther from the centroid.

Figure 5c, d shows the disease progression of another patient where eight images fall into two different clusters (subtypes) with a compactness factor of 0.06478. In the trained SOM (Fig. 5c), the first image is mapped to unit *A* in cluster 2 (healthy subtype 2) and the remaining images are mapped to unit *B* to *D* in cluster 4 (glaucoma subtype 3). While the distances (Fig. 5d) of the images in unit *A* and *D* are relatively high, it fits a pattern of disease progression similar to the previous progression pattern, where there is a temporal progression of the disease in terms of the cluster dispersion on two different clusters along the time series of nine images. Thus, we observe two main disease progression patterns. The first pattern starts in a healthy subtype (subtypes 1 or 2), then go into glaucoma subtype 5 and progresses toward the severe glaucoma subtype (subtype 3). The second pattern starts in a healthy subtype (subtypes 1 or 2), then go into glaucoma subtype 4 and progresses toward the severe glaucoma subtype (subtype 3).

### 4.4 Identifying Noisy CSLO Images

*A*and unit

*C*in cluster 1 (healthy subtype 1), the second, third, and sixth images are mapped to unit

*B*in cluster 2 (healthy subtype 2), and the fifth image is mapped to unit

*D*in cluster 4 (glaucoma subtype 4). Five images gather in one close neighboring area (

*A*,

*B*, and

*C*), while one image is far away from the center region and fall into a different cluster (

*D*). The distance to centroid (Fig. 6b) of the mapped unit of fifth image is relatively high, while the distances of other mapped units are relatively low. Thus, the fifth image is identified as a single noisy image for two reasons. First, the image does not respect disease progression temporal constraints (if an image falls in another glaucoma subtype, you cannot revert to previous subtype over time). Secondly, the distance to centroid of the mapped unit of the second image is high (spike) while it is low for other units (high proximity between the images).

Figure 6c, d shows the sequence of CSLO images of a glaucomatous patient where 15 images disperse on two different clusters with a compactness factor of 0.037391. In the trained SOM (Fig. 6c), the 14th image is mapped to unit *D* in cluster 4 (glaucoma subtype 4), while the remaining images are mapped to 3 units (*A*, *B*, *C*) in one close neighboring area in cluster 3 (glaucoma subtype 3). The distance to centroid (Fig. 6d) of the mapped unit of fourteenth image is relatively high, while the distances of other mapped units are relatively low. Thus, the 14th image is identified as a single noisy image for the same reasons as the previous example (disease progression temporal constraints and spike in the distance to centroid).

### 4.5 Discussion

Several works proposed classification approaches to detect glaucomatous optic nerve from CSLO images. For instance, Adler et al. [83] evaluated several classifiers using features extracted from CSLO images. The best classifiers have similar classification performance with our approach, where the specificity and sensitivity of the best model (random forest) are close (0.8279 and 0.8656, respectively) to our BN classifier using OMFS (0.86 and 0.8, respectively). However, unlike our automated approach to extract feature from CSLO images, they use CSLO image features requiring manual outlining of the optic disk. Twa et al. [28] proposed an approach to automatically extract morphological features of the optic nerve using radial polynomial (pseudo-Zernike moments). This approach is similar to our approach but uses decision trees to evaluate feature subsets. The best decision tree is trained with three pseudo-Zernike moments and has a sensitivity of 0.69, specificity of 0.88, and accuracy of 0.8. While they use less features (three pseudo-Zernike moments) than our approach (six Zernike moments), the classifier has inferior performance than our approach for classifying glaucomatous CSLO images (sensitivity difference of 0.17). While most approaches can classify glaucomatous CSLO images, they are unable to monitor disease progression over time. Most approaches that monitor disease progression use classification to detect if there is progression in the current images based on the changes observed since the previous CSLO images. For instance, Belghith et al. [23] use a MRF to model the change detection map between a pair of CSLO images. Fuzzy classification is carried on the estimated change detection map to classify CSLO images into non-progressing and progressing glaucoma classes. The classifier sensitivity (progressing) is 0.86 and specificity (non-progressing) is 0.88. However, the classifier was validated on a dataset with only glaucomatous images (progressing and stable glaucoma), without an initial classification between glaucomatous and healthy images. In our approach, the first step is to classify glaucomatous and healthy CSLO images using Zernike moments and the second step is to train a SOM for visualizing disease progression using morphological features, where it is possible to identify noisy image.

Recent approaches have used deep learning [84] (e.g., convolutional neural networks) to detect glaucoma in optical images with high accuracy [85]. Raghavendra et al. [86] used a convolutional neural network to extract robust features from digital fundus images and achieve an accuracy of 0.98 for classifying normal and glaucoma images. However, it was noted that the models needed improvements when applying deep learning to investigate the underlying patterns in images [87]. For our purpose, where the topological organization of the clusters was important to track the progression of the disease across different glaucoma subtypes, the option of using deep learning was suboptimal for the following reasons: (a) a large amount of training data is needed to achieve breakthrough improvements in feature extraction and classification performance—to extract generalized feature from CSLO images will require a large dataset (e.g., 1 million). Since we are working with a much smaller dataset, the use of deep learning was suboptimal; (b) deep models offer black box-like characteristics which means it is challenging to understand and interpret the learned models intuitively-for medical decision making it is important that human experts can understand and validate the decision models which was not possible with deep learning; and (c) output of deep models cannot be rendered on a 2D topological plane illustrating smooth boundaries between adjoining clusters.

Overall, the main advantages of our approach are: (1) automatic feature extraction from CSLO images without manual outlining of the optic disc; (2) using few features for classification without compromising classification accuracy of glaucomatous and healthy CSLO images (e.g., 6 moments vs 254 moments); (3) monitor disease progression by visualizing CSLO images mapped on a trained SOM using 17 morphological features where map units are clustered into healthy and glaucomatous subtypes; and (4) identifying noisy images based on the visualization of the temporal sequence of CSLO images onto the trained SOM and the distance between each mapped unit and the centroid of all mapped unit. It should be noted that the CSLO image clustering can be applied on others shape-defining features. For instance, in Abidi et al. [30], the CSLO image clustering utilizes a subset of 47 Zernike moments as features. The resulting SOM had four clusters (one healthy subtype and three glaucoma subtypes), but visualization and interpretation of disease progression patterns are more informative with the SOM resulting from morphological features (e.g., two main disease progression patterns). The proposed approach has some shortcoming. While the classification step can select a good feature subset with few Zernike moments, it is suboptimal because it removes all high order moments, where a few can improve classification accuracy. A way to improve this limitation is to use a forward selection on the selected OMFS to add useful high order moments. To improve the Bayesian filter method, other approaches can be used to learn a BN structure. Also, the feature selection process is not the most efficient since the wrapper method trains and evaluates accuracy for SVM and MLP for 29 feature subsets. A way to improve the feature selection is to use a classification accuracy threshold to select the first feature subset where one of its classifier reach (if threshold not reached, then evaluate all subsets). The current approach does not identify in an objectively way disease progression and noisy images (only visualization on SOM). Further analysis can be done to identify outliers (noisy image) and classify a disease progression pattern.

It should be noted that the proposed framework can use other classifiers for the feature selection process and to classify CSLO images. However, since the objective was to evaluate each component of the proposed framework, only function and Bayesian-based classifiers were used (MLP, SVM, BN), excluding tree-based classifiers as they cannot work with Zernike moments. Also, while other features can be used to describe CSLO images, extracted Zernike moments are invariant with respect to the position, size, and orientation of the object of interest (optic disc) and provide shape characteristics that are invariant to linear transformation. Since Zernike moments can also be used in other medical images [88] (e.g., CT and MRI), the proposed framework can also be applied to analyze other diagnostic images to provide automated interpretation and decision support.

## 5 Concluding Remarks

In this paper, we have presented a data mining framework to provide decision support to clinicians based on objective analysis of medical images. This framework was applied to provide decision support for the diagnosis and monitoring of glaucoma from CSLO images. Our framework can discriminate healthy and glaucomatous optic discs by automatically classify CSLO images to provide binary diagnostics (image is healthy or glaucomatous). Classification of CSLO images is based on shape information extracted using image processing techniques (Zernike moments). Zernike moments are automatically extracted from CSLO images, while the traditional approach (morphological features) to analyze CSLO images needs interactions from the clinicians (manual outlining of the optic disc boundaries) or can fail (topographical surface cannot be approximated by the disc model). To alleviate the curse of dimensionality, we have developed a feature selection strategy (wrapper and filter selections) that identifies the most salient image-defining features without compromising the diagnostic (classification) accuracy.

A unique aspect of the framework is the discovery and visualization of subtypes of glaucomatous optic disc damage in terms of clusters of similar images using morphological features from mathematical models of the optic nerve head fitted to CSLO images. Thus, it allows to subclassify glaucoma patients which, form a personalized medicine perspective, allows to administer precise treatments in line with the specific morphological patterns of the optic disc damage. For each patient, we can visualize the dispersion of multiple observations (CSLO images) inside or across clusters due to changes in the optic disc over time. This visualization of temporal progression allows to monitor the disease progression for a patient over time and to identify noisy CSLO images for exclusion from any diagnostic decision making. The proposed discovery strategy was previously applied to Zernike moments, but visualization and interpretation of disease progression patterns were more informative with morphological features since clusters correspond to specific morphological subtypes and can be interpreted by experts (well-defined morphological features).

The framework was validated on real data (Zernike moments and morphological features of CSLO images) to show the feasibility of our framework. The results have shown that this framework can discriminate healthy and glaucomatous CSLO images and can discover glaucoma damage subtypes. We believe that our framework is a promising step forward to support glaucoma diagnostics and monitoring by automatically and objectively analyzing optic nerve images. Our approach is generic and can be extended to analyze other diagnostic images for the purposes of automated interpretation and subsequent decision support.

## Notes

### Compliance with Ethical Standards

### Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

## References

- 1.Stamper RL, Lieberman MF, Drake MV (2009) Primary open angle glaucoma. In: Stamper RL, Lieberman MF, Drake MV (eds) Becker-Shaffer’s diagnosis and therapy of the glaucomas, 8th edn. Mosby/Elsevier, Edinburg, pp 239–265. https://doi.org/10.1016/B978-0-323-02394-8.00017-6 CrossRefGoogle Scholar
- 2.Dielemans I, de Jong PTVM, Stolk R, Vingerling JR, Grobbee DE, Hofman A (1996) Primary open-angle glaucoma, intraocular pressure, and diabetes mellitus in the general elderly population. Ophthalmology 103:1271–1275. https://doi.org/10.1016/S0161-6420(96)30511-3 CrossRefGoogle Scholar
- 3.Brandt JD (2004) Corneal thickness in glaucoma screening, diagnosis, and management. Curr Opin Ophthalmol 15:85–89CrossRefGoogle Scholar
- 4.Prum BE, Rosenberg LF, Gedde SJ, Mansberger SL, Stein JD, Moroi SE, Herndon LW, Lim MC, Williams RD (2016) Primary Open-Angle Glaucoma Preferred Practice Pattern® guidelines. Ophthalmology 123:P41–P111. https://doi.org/10.1016/j.ophtha.2015.10.053 CrossRefGoogle Scholar
- 5.Jampel HD, Friedman D, Quigley H, Vitale S, Miller R, Knezevich F, Ding Y (2009) Agreement among glaucoma specialists in assessing progressive disc changes from photographs in open-angle glaucoma patients. Am J Ophthalmol 147:39–44.e1. https://doi.org/10.1016/j.ajo.2008.07.023 CrossRefGoogle Scholar
- 6.Azuara-Blanco A, Katz LJ, Spaeth GL, Vernon SA, Spencer F, Lanzl IM (2003) Clinical agreement among glaucoma experts in the detection of glaucomatous changes of the optic disk using simultaneous stereoscopic photographs. Am J Ophthalmol 135:949–950. https://doi.org/10.1016/S0002-9394(03)00480-X CrossRefGoogle Scholar
- 7.Coops A, Henson DB, Kwartz AJ, Artes PH (2006) Automated analysis of Heidelberg retina tomograph optic disc images by glaucoma probability score. Investig. Opthalmology Vis. Sci. 47:5348. https://doi.org/10.1167/iovs.06-0579 CrossRefGoogle Scholar
- 8.Kotowski J, Wollstein G, Ishikawa H, Schuman JS (2014) Imaging of the optic nerve and retinal nerve fiber layer: an essential part of glaucoma diagnosis and monitoring. Surv Ophthalmol 59:458–467. https://doi.org/10.1016/j.survophthal.2013.04.007 CrossRefGoogle Scholar
- 9.Mistlberger A, Liebmann JM, Greenfield DS, Pons ME, Hoh S-T, Ishikawa H, Ritch R (1999) Heidelberg retina tomography and optical coherence tomography in normal, ocular-hypertensive. and glaucomatous eyes Ophthalmology 106:2027–2032. https://doi.org/10.1016/S0161-6420(99)90419-0 CrossRefGoogle Scholar
- 10.Zinser G, Wijnaendts-van-Resandt RV, Dreher AW, Weinreb RN, Harbarth U, Schroder H, Burk RO (1989) Confocal laser tomographic scanning of the eye. In: Wampler JE (ed) Proc. SPIE 1161, New Methods in Microscopy and Low Light Imaging, San Diego, August 7. International Society for Optics and Photonics, Bellingham, pp 337–344Google Scholar
- 11.Wollstein G, Garway-Heath DF, Hitchings RA (1998) Identification of early glaucoma cases with the scanning laser ophthalmoscope. Ophthalmology 105:1557–1563. https://doi.org/10.1016/S0161-6420(98)98047-2 CrossRefGoogle Scholar
- 12.Swindale NV, Stjepanovic G, Chin A, Mikelberg FS (2000) Automated analysis of normal and glaucomatous optic nerve head topography images. Invest Ophthalmol Vis Sci 41:1730–1742Google Scholar
- 13.Miglior S, Guareschi M, Albe’ E, Gomarasca S, Vavassori M, Orzalesi N (2003) Detection of glaucomatous visual field changes using the Moorfields regression analysis of the Heidelberg retina tomograph. Am J Ophthalmol 136:26–33. https://doi.org/10.1016/S0002-9394(03)00084-9 CrossRefGoogle Scholar
- 14.Wollstein G, Garway-Heath DF, Fontana L, Hitchings RA (2000) Identifying early glaucomatous changes: comparison between expert clinical assessment of optic disc photographs and confocal scanning ophthalmoscopy. Ophthalmology 107:2272–2277. https://doi.org/10.1016/S0161-6420(00)00363-8 CrossRefGoogle Scholar
- 15.Strouthidis NG, Garway-Heath DF (2008) New developments in Heidelberg retina tomograph for glaucoma. Curr Opin Ophthalmol 19:141–148. https://doi.org/10.1097/ICU.0b013e3282f4450b CrossRefGoogle Scholar
- 16.Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244MathSciNetzbMATHGoogle Scholar
- 17.Strouthidis NG, Demirel S, Asaoka R, Cossio-Zuniga C, Garway-Heath DF (2010) The Heidelberg retina tomograph glaucoma probability score: reproducibility and measurement of progression. Ophthalmology 117:724–729. https://doi.org/10.1016/j.ophtha.2009.09.036 CrossRefGoogle Scholar
- 18.Iester M, Oddone F, Prato M, Centofanti M, Fogagnolo P, Rossetti L, Vaccarezza V, Manni G, Ferreras A (2013) Linear discriminant functions to improve the glaucoma probability score analysis to detect glaucomatous optic nerve heads. J Glaucoma 22:73–79. https://doi.org/10.1097/IJG.0b013e31823298b3 CrossRefGoogle Scholar
- 19.Banister K, Boachie C, Bourne R, Cook J, Burr JM, Ramsay C, Garway-Heath D, Gray J, McMeekin P, Hernández R, Azuara-Blanco A (2016) Can automated imaging for optic disc and retinal nerve fiber layer analysis aid glaucoma detection? Ophthalmology 123:930–938. https://doi.org/10.1016/j.ophtha.2016.01.041 CrossRefGoogle Scholar
- 20.Zhu H, Poostchi A, Vernon SA, Crabb DP (2014) Detecting abnormality in optic nerve head images using a feature extraction analysis. Biomed Opt Express 5:2215–2230. https://doi.org/10.1364/BOE.5.002215 CrossRefGoogle Scholar
- 21.Bowd C, Chan K, Zangwill LM, Goldbaum MH, Lee T-W, Sejnowski TJ, Weinreb RN (2002) Comparing neural networks and linear discriminant functions for glaucoma detection using confocal scanning laser ophthalmoscopy of the optic disc. Invest Ophthalmol Vis Sci 43:3444–3454Google Scholar
- 22.Park J-M, Reed J, Zhou Q (2002) Active feature selection in optic nerve data using support vector machine. In: Fogel DB (ed) Proc. of the 2002 International Joint Conference on Neural Networks (IJCNN’02), May 12-17, Honolulu, Hawaii. IEEE, Piscataway, pp 1178–1182Google Scholar
- 23.Belghith A, Balasubramanian M, Bowd C, Weinreb RN, Zangwill LM (2014) A unified framework for glaucoma progression detection using Heidelberg retina tomograph images. Comput Med Imaging Graph 38:411–420. https://doi.org/10.1016/j.compmedimag.2014.03.002 CrossRefGoogle Scholar
- 24.Mardin CY, Hothorn T, Peters A, Jünemann AG, Nguyen NX, Lausen B (2003) New glaucoma classification method based on standard Heidelberg retina tomograph parameters by bagging classification trees. J Glaucoma 12:340–346CrossRefGoogle Scholar
- 25.Bowd C, Lee I, Goldbaum MH, Balasubramanian M, Medeiros FA, Zangwill LM, Girkin CA, Liebmann JM, Weinreb RN (2012) Predicting glaucomatous progression in glaucoma suspect eyes using relevance vector machine classifiers for combined structural and functional measurements. Investig Opthalmol Vis Sci 53:2382–2389. https://doi.org/10.1167/iovs.11-7951 CrossRefGoogle Scholar
- 26.Racette L, Chiou CY, Hao J, Bowd C, Goldbaum MH, Zangwill LM, Lee T-W, Weinreb RN, Sample PA (2010) Combining functional and structural tests improves the diagnostic accuracy of relevance vector machine classifiers. J Glaucoma 19:167–175. https://doi.org/10.1097/IJG.0b013e3181a98b85 CrossRefGoogle Scholar
- 27.Horn FK, Lämmer R, Mardin CY, Jünemann AG, Michelson G, Lausen B, Adler W (2012) Combined evaluation of frequency doubling technology perimetry and scanning laser ophthalmoscopy for glaucoma detection using automated classification. J Glaucoma 21:27–34. https://doi.org/10.1097/IJG.0b013e3182027766 CrossRefGoogle Scholar
- 28.Twa MD, Parthasarathy S, Johnson CA, Bullimore MA (2012) Morphometric analysis and classification of glaucomatous optic neuropathy using radial polynomials. J Glaucoma 21:302–312. https://doi.org/10.1097/IJG.0b013e31820d7e6a CrossRefGoogle Scholar
- 29.Broadway DC, Nicolela MT, Drance SM (2003) Optic disc morphology on presentation of chronic glaucoma. Eye 17:798. https://doi.org/10.1038/sj.eye.6700478 author reply 799CrossRefGoogle Scholar
- 30.Abidi SSR, Artes PH, Yan S, Yu J (2007) Automated interpretation of optic nerve images: a data mining framework for glaucoma diagnostic support. In: Kuhn KA, Warren JR, Leong T-Y (eds) MEDINFO 2007: building sustainable health systems. IOS Press, Amsterdam, pp 1309–1313Google Scholar
- 31.Liao SX, Pawlak M (1998) On the accuracy of Zernike moments for image analysis. IEEE Trans Pattern Anal Mach Intell 20:1358–1364. https://doi.org/10.1109/34.735809 CrossRefGoogle Scholar
- 32.Ming-Kuei H (1962) Visual pattern recognition by moment invariants. IEEE Trans Inf Theory 8:179–187. https://doi.org/10.1109/TIT.1962.1057692 CrossRefzbMATHGoogle Scholar
- 33.Teague MR (1980) Image analysis via the general theory of moments. J Opt Soc Am 70:920–930. https://doi.org/10.1364/JOSA.70.000920 MathSciNetCrossRefGoogle Scholar
- 34.Hosny KM (2010) A systematic method for efficient computation of full and subsets Zernike moments. Inf. Sci. (Ny). 180:2299–2313. https://doi.org/10.1016/j.ins.2010.02.006 MathSciNetCrossRefzbMATHGoogle Scholar
- 35.Papakostas GA, Boutalis YS, Karras DA, Mertzios BG (2007) A new class of Zernike moments for computer vision applications. Inf Sci (NY) 177:2802–2819. https://doi.org/10.1016/j.ins.2007.01.010 MathSciNetCrossRefzbMATHGoogle Scholar
- 36.Teh C-H, Chin RT (1988) On image analysis by the methods of moments. IEEE Trans Pattern Anal Mach Intell 10:496–513. https://doi.org/10.1109/34.3913 CrossRefzbMATHGoogle Scholar
- 37.Khotanzad A, Hong YH (1990) Invariant image recognition by Zernike moments. IEEE Trans Pattern Anal Mach Intell 12:489–497. https://doi.org/10.1109/34.55109 CrossRefGoogle Scholar
- 38.Li S, Lee M-C, Pun C-M (2009) Complex Zernike moments features for shape-based image retrieval. IEEE Trans Syst Man, Cybern - Part A Syst Humans 39:227–237. https://doi.org/10.1109/TSMCA.2008.2007988 CrossRefGoogle Scholar
- 39.Singh C, Mittal N, Walia E (2011) Face recognition using Zernike and complex Zernike moment features. Pattern Recognit Image Anal 21:71–81. https://doi.org/10.1134/S1054661811010044 CrossRefGoogle Scholar
- 40.Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024 CrossRefGoogle Scholar
- 41.Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
- 42.Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. https://doi.org/10.1093/bioinformatics/btm344 CrossRefGoogle Scholar
- 43.Whitney AW (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C-20:1100–1103. https://doi.org/10.1109/T-C.1971.223410 CrossRefzbMATHGoogle Scholar
- 44.Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9:11–17. https://doi.org/10.1109/TIT.1963.1057810 CrossRefGoogle Scholar
- 45.Tsai C-F, Eberle W, Chu C-Y (2013) Genetic algorithms in feature and instance selection. Knowledge-Based Syst 39:240–247. https://doi.org/10.1016/j.knosys.2012.11.005 CrossRefGoogle Scholar
- 46.Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35:1817–1824. https://doi.org/10.1016/j.eswa.2007.08.088 CrossRefGoogle Scholar
- 47.Hruschka ER, Hruschka ER, Ebecken NFF (2004) Feature selection by Bayesian Networks. In: Tawfik AY, Goodwin SD (eds) Advances in Artificial Intelligence: 17th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2004, London, Ontario, Canada, May 17–19, 2004. Proceedings. Springer, Berlin, pp 370–379CrossRefGoogle Scholar
- 48.Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347. https://doi.org/10.1007/BF00994110 CrossRefzbMATHGoogle Scholar
- 49.Koller D, Sahami M (1996) Toward optimal feature selection. In: Saitta L (ed) Proceedings of the Thirteenth International Conference on Machine Learning (ICML), Bari, Italy, July 3–6, 1996. Morgan Kaufmann, San Mateo, pp 284–292Google Scholar
- 50.Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, BurlingtonzbMATHGoogle Scholar
- 51.Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480. https://doi.org/10.1109/5.58325 CrossRefGoogle Scholar
- 52.Lötsch J, Ultsch A (2014) Exploiting the structures of the U-matrix. In: Villmann T, Schleif F-M, Kaden M, Lange M (eds) Advances in Self-Organizing Maps and Learning Vector Quantization: Proceedings of the 10th International Workshop, WSOM 2014, Mittweida, Germany, July, 2–4, 2014. Springer, Cham, pp 249–257CrossRefGoogle Scholar
- 53.Kohonen T (2013) Essentials of the self-organizing map. Neural Netw 37:52–65. https://doi.org/10.1016/j.neunet.2012.09.018 CrossRefGoogle Scholar
- 54.Ultsch A, Siemon HP (1990) Kohonen’s self organizing feature maps for exploratory data analysis. In: Widrow B, Angeniol B (eds) Proceedings of the International Neural Network Conference (INNC-90), July 9–13, 1990, Paris, France. Kluwer Academic Publishers, Dordrecht, pp 305–308Google Scholar
- 55.Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11:586–600. https://doi.org/10.1109/72.846731 CrossRefGoogle Scholar
- 56.Figueiredo MATAT, Jain AKK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24:381–396. https://doi.org/10.1109/34.990138 CrossRefGoogle Scholar
- 57.Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38MathSciNetzbMATHGoogle Scholar
- 58.Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464. https://doi.org/10.1214/aos/1176344136 MathSciNetCrossRefzbMATHGoogle Scholar
- 59.Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr 19:716–723. https://doi.org/10.1109/TAC.1974.1100705 MathSciNetCrossRefzbMATHGoogle Scholar
- 60.Yu J, Abidi SSR, Artes PH (2005) A hybrid feature selection strategy for image defining features: towards interpretation of optic nerve images. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics: August 18–21, 2005, Ramada Hotel, Guangzhou, China. pp. 5127–5132. IEEE, Los Alamitos, CA, USAGoogle Scholar
- 61.Yan S, Abidi SSR, Artes PH (2005) Analyzing sub-classifications of glaucoma via SOM based clustering of optic nerve images. In: Engelbrecht R, Geissbuhler A, Lovis C, Mihalas G (eds) Connecting Medical Informatics and Bio-Informatics: Proceedings of MIE2005 The 19th International Congress of the European Federation for Medical Informatics (MIE2005), Geneva, August 28–September 1, 2005. IOS Press, Amsterdam, pp 483–488Google Scholar
- 62.Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324. https://doi.org/10.1016/S0004-3702(97)00043-X CrossRefzbMATHGoogle Scholar
- 63.Riedmiller M (1994) Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Comput Stand Interfaces 16:265–278. https://doi.org/10.1016/0920-5489(94)90017-5 CrossRefGoogle Scholar
- 64.Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536. https://doi.org/10.1038/323533a0 CrossRefzbMATHGoogle Scholar
- 65.Thomas P, Suhner M-C (2015) A new multilayer perceptron pruning algorithm for classification and regression applications. Neural Process Lett 42:437–458. https://doi.org/10.1007/s11063-014-9366-5 CrossRefGoogle Scholar
- 66.Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1007/BF00994018 CrossRefzbMATHGoogle Scholar
- 67.Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167. https://doi.org/10.1023/A:1009715923555 CrossRefGoogle Scholar
- 68.Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ (2010) Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak 10:16. https://doi.org/10.1186/1472-6947-10-16 CrossRefGoogle Scholar
- 69.Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory-COLT ‘92, Pittsburgh, Pennsylvania, USA—July 27–29, 1992. pp. 144–152. ACM Press, New York, New York, USAGoogle Scholar
- 70.Bowd C, Zangwill LM, Medeiros FA, Hao J, Chan K, Lee T-W, Sejnowski TJ, Goldbaum MH, Sample PA, Crowston JG, Weinreb RN (2004) Confocal scanning laser ophthalmoscopy classifiers and stereophotograph evaluation for prediction of visual field abnormalities in glaucoma-suspect eyes. Investig. Opthalmology Vis. Sci. 45:2255. https://doi.org/10.1167/iovs.03-1087 CrossRefGoogle Scholar
- 71.Bock R, Meier J, Nyúl LG, Hornegger J, Michelson G (2010) Glaucoma risk index: automated glaucoma detection from color fundus images. Med Image Anal 14:471–481. https://doi.org/10.1016/j.media.2009.12.006 CrossRefGoogle Scholar
- 72.Acharya UR, Dua S, Du X, Sree SV, Chua CK (2011) Automated diagnosis of glaucoma using texture and higher order spectra features. IEEE Trans Inf Technol Biomed 15:449–455. https://doi.org/10.1109/TITB.2011.2119322 CrossRefGoogle Scholar
- 73.Goldbaum MH, Sample PA, Chan K, Williams J, Lee T-W, Blumenthal E, Girkin CA, Zangwill LM, Bowd C, Sejnowski T, Weinreb RN (2002) Comparing machine learning classifiers for diagnosing glaucoma from standard automated perimetry. Invest Ophthalmol Vis Sci 43:162–169Google Scholar
- 74.Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MathSciNetzbMATHGoogle Scholar
- 75.Garcia S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25:734–750. https://doi.org/10.1109/TKDE.2012.35 CrossRefGoogle Scholar
- 76.Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. of the 13th International Joint Conference on Artificial Intelligence—volume 2, Chambery, France, August 28-September 3, 1993. pp. 1022–1027. Morgan Kaufmann Publishers, San Mateo, CAGoogle Scholar
- 77.Nicolela MT, Drance SM (1996) Various glaucomatous optic nerve appearances. Ophthalmology 103:640–649. https://doi.org/10.1016/S0161-6420(96)30640-4 CrossRefGoogle Scholar
- 78.Hammel N, Belghith A, Bowd C, Medeiros FA, Sharpsten L, Mendoza N, Tatham AJ, Khachatryan N, Liebmann JM, Girkin CA, Weinreb RN, Zangwill LM (2016) Rate and pattern of rim area loss in healthy and progressing glaucoma eyes. Ophthalmology 123:760–770. https://doi.org/10.1016/j.ophtha.2015.11.018 CrossRefGoogle Scholar
- 79.Nicolela MT, Drance SM, Broadway DC, Chauhan BC, McCormick TA, LeBlanc RP (2001) Agreement among clinicians in the recognition of patterns of optic disk damage in glaucoma. Am J Ophthalmol 132:836–844. https://doi.org/10.1016/S0002-9394(01)01254-5 CrossRefGoogle Scholar
- 80.Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1 CrossRefGoogle Scholar
- 81.Sarlin P, Eklund T (2013) Financial performance analysis of European banks using a fuzzified Self-Organizing Map. Int J Knowledge-based Intell Eng Syst 17:223–234. https://doi.org/10.3233/KES-130261 CrossRefGoogle Scholar
- 82.Raftery AE (1995) Bayesian model selection in social research. Sociol Methodol 25:111. https://doi.org/10.2307/271063 CrossRefGoogle Scholar
- 83.Lausen B, Adler W, Peters A (2008) Comparison of classifiers applied to confocal scanning laser ophthalmoscopy data. Methods Inf Med 47:38–46. https://doi.org/10.3414/ME0348 CrossRefGoogle Scholar
- 84.Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/J.NEUCOM.2016.12.038 CrossRefGoogle Scholar
- 85.Cerentini A, Welfer D, Cordeiro d’Ornellas M, Pereira Haygert CJ, Dotto GN (2017) Automatic identification of glaucoma using deep learning methods. Stud Health Technol Inform 245:318–321Google Scholar
- 86.Raghavendra U, Fujita H, Bhandary SV, Gudigar A, Tan JH, Acharya UR (2018) Deep convolution neural network for accurate diagnosis of glaucoma using digital fundus images. Inf Sci (Ny) 441:41–49. https://doi.org/10.1016/J.INS.2018.01.051 MathSciNetCrossRefGoogle Scholar
- 87.Shen D, Wu G, Suk H-I (2017) Deep learning in medical image analysis. Annu Rev Biomed Eng 19:221–248. https://doi.org/10.1146/annurev-bioeng-071516-044442 CrossRefGoogle Scholar
- 88.Kumar Y, Aggarwal A, Tiwari S, Singh K (2018) An efficient and robust approach for biomedical image retrieval using Zernike moments. Biomed Signal Process Control 39:459–473. https://doi.org/10.1016/J.BSPC.2017.08.018 CrossRefGoogle Scholar