Introduction

The retina is the organ that enables humans to capture visuals from the real world. It is a window to the whole body that shares physiological, embryological, and anatomical characteristics with main organs, including the brain, the heart, the kidneys, and so on. The retina is a vital source to assess distinct pathological processes and neurological complications associated with risks of mortality. The retina refers to the inner surface of the eyeball opposite to the lens, including the optic disc, optic cup, macula, fovea, and blood vessel [1, 2]. Fundus images are fundus projections captured by a monocular camera on a 2D plane [3]. Fundus images play an important role in monitoring the health status of the human eye and multiple organs [4]. Analyzing fundus images and their corresponding association with biological traits can help prevent eye diseases and early diagnosis. It is an acceptable notion that has been considered as a gateway to examine neurological complications. Retina allows us to visualize both vascular and neural tissues in a non-invasive way. The strong association of retina with physiology and vitality may lead to a deeper association with biological traits, such as age and gender. Biological traits can be determined by genes, environmental factors, or a combination of both, which can either be qualitative (such as gender or skin color) or quantitative (such as age or blood pressure) [5]. Biological traits are relevant to a variety of systemic and ocular diseases in individuals [6], for instance, females are expected to have longer life expectancies compared to those males in similar living environment [7,8,9,10]. With the increasing age, women with reduced estrogen production are predisposed to develop degenerative eye diseases, including cataracts and age-related macular degeneration [11,12,13]. In contrast, males are more likely to suffer from pigment dispersion glaucoma [14], open-angle glaucoma [15], and diabetic retinopathy [16]. The study of biological traits association with fundus images is a challenging task in the clinical practices where experts in the field are unaware of the gender discrimination in the fundus images of males and females and the association of aging information. This study utilizes deep learning (DL) algorithms to estimate biological traits and their association to the generated fundus images.

The fundus or retinal images have been studied for classification, disease identification, and analysis using conventional machine learning (ML) to recent DL methods [17, 18]. However, much of the work has been focused on feature engineering, which involves computing explicit features specified by experts. On the contrary, DL has been characterized by multiple computational layers that allow an algorithm to learn the appropriate predictive features based on examples. The DL algorithms are optimized and reformulated with enhanced features and improvements to expand to a wider range of problems [19,20,21]. The DL algorithms have been utilized for the classification and detection of different eye diseases, such as diabetic retinopathy and melanoma, with human comparable results. In the conventional ML approaches, the relationship between retinal morphology and systemic health has been extrapolated using multivariable regression. However, such methods show limited ability for large size and complex datasets [22, 23]. Thus, the DL algorithms avoid manual feature engineering, tuning, and made it possible of extracting hidden features which were previously unexplored by the conventional methods. The DL models have shown significant results for previous challenging tasks. The harnessing of DL power innovatively associated the retinal structure and pathophysiology. The DL models can extract independent features unknown to clinicians; however, they may face challenges of explainability and interpretability, which have been attempted to address in the existing work [24]. The DL approaches to fundus image analysis are receiving popularity featuring easy implementation and high efficiency [25]. It has been extrapolated that DL models can capture subtle pixel-level information in terms of luminous and contrast which humans may not differentiate. These findings underscore the promising ability of DL models hidden to humans and can be employed in medical imaging with high efficacy in clinical practices [26].

In clinical studies, experts in the field are unaware of the subjects’ discrimination based on their fundus images which emphasis on the importance of employing DL models. The cause and effect of demographic information in fundus images are not readily apparent to domain experts. On the contrary, DL models may enable data-driven algorithms to discover of novel approaches to disease biomarkers identification and biological traits association. Therefore, the ophthalmoscope has been deeply associated with systemic indices of biological traits (such as aging and gender) and diseases. In previous studies, age has been estimated from distinct clinical images, such as age prediction from brain MRI, facial images, and neuroimaging using machine learning and deep learning [27,28,29,30]. For instance, brain MRI and facial images have been used for age prediction emphasizing on the potential of traits estimation from fundus images [27,28,29, 31]. The excellent performance in age prediction implies that fast, safe, cost-effective, and user-friendly deep learning models can be possible in a larger population. In addition to the aging association, fundus images have also been associated with sex by applying logistic regression on several features [32]. These features include papillomacular angle, retinal vessel angles, and retinal artery trajectory. Various studies have shown retinal morphology differences between the sexes, including retinal and choroidal thickness [33, 34]. The study [26] reported fovea as an important region for gender classification. The prediction of gender became possible, which was an inconvenient job for the ophthalmologist who spent the whole career at retina [35]. Thus, results for the age and gender estimation may assist investigating physiological variations on fundus images corresponding to biological traits [17]. The estimation of age and gender classification may not be clinically inevitable, but the study of age progression based on biological traits learning hints the potential application of DL in discovering novel associations between traits and fundus images. The DL models implementation uncovers additional features from fundus images results in better biological traits association [36].

The successful estimation of age and gender prediction convince for studying age progression effects and evaluating aging status via fundus images. In the study of [17], aging effects were investigated while associating cardiovascular risk factor with fundus images. Similarly, large size DL models was used for classification and association of fundus images with physiological traits dependent on patients’ health [17]. The existing algorithms mainly consider the optic disc’s features for gender prediction as having consistent observations with that of Poplin [17]. In Poplin’s work, large deep learning models were used to classify sex and other physiological and behavioral traits that were associated with patient health based on fundus images. Similarly, fundus (retinal) images were closely related to age and gender traits by allowing the definition of ’retinal age gap’, which is a potential biomarker for aging and risk of mortality [37].

The variational effects of age progression can be visualized in distinct ways, including saliency maps or heat maps in fundus images that were difficult to be observed by ophthalmologists. The differential visualization in fundus images can also be used to distinguish male and female subjects. After the successful classification of gender trait from fundus images [38], our proposed model (FAG-Net) emphasizes on optic disc area and learned features while training and learning the association corresponding to aging. The optic disc was also considered the main structure to train our deep learning approaches. Similarly, the second proposed model (FGC-Net) utilizes such knowledge to generate different fundus images given a single fed fundus with a list of ages as label (condition). The detailed of the proposed modalities are illustrated in methodology section.

In the current study, we first trained and successfully evaluated a DL model (FAG-Net) for the trait effects in terms of age and gender estimation. We proposed a second DL model (FGC-Net) to learn aging effects and embed these effects for the generation purpose. The FGC-Net evaluates different age values given a single input fundus image. The corresponding multiple generated versions are subtracted accordingly to demonstrate the learning effects with age progression. The detailed architecture of both models has been illustrated in the methodology section. The rest of the paper is organized as follows: “Introduction” outlines the existing works, “Methodology” demonstrates methods, “Results” illustrates and analyzes results, and “Conclusion and future directions” concludes the study with future directions.

Literature study

In previous studies, age and gender have been estimated from distinct imaging modalities such as age prediction from brain MRI, facial images, and neuroimaging using machine learning and deep learning [27,28,29,30]. Brain MRI and facial images have been used for age prediction, emphasizing the potential of trait estimation from fundus images [27,28,29, 31]. In the work by Poplin [17], large deep learning models were used to classify gender and other physiological and behavioral traits that were associated with patient health based on retinal fundus images. There are a number of studies in which the fundus images have been used for age prediction and gender classification using machine learning [26,27,28,29, 31,32,33,34]. Most of them have estimated the age and gender of healthy or unhealthy subjects. However, the current study examines both healthy and unhealthy subjects’ age and gender associations with fundus images. For age and gender prediction, conventional to recent deep learning-based algorithms have been employed [17, 25, 26, 39]. To our knowledge, none of them attempted the age progression effects besides the age prediction and gender classification.

Clinicians are currently unaware of the distinct retinal features varying between males and females, highlighting the importance of deep learning and model explainability. Automated machine learning (AutoML) may enable clinician-driven automated discovery of novel insights and disease biomarkers. Gender has been classified in the study of [26], in which the area under the receiver operating characteristic curve of the code-free deep learning model was 0.93. The study [40] estimated biological age from dataset collected with MAE = 3.67 and cumulative score = 0.39 for age-related macular degeneratioin (AMD) [41]. The subjects for AMD prevalence are supposed to have an age above 50 as an inclusion criteria, which could not cover subjects of all age ranges. The study [42] developed CNN age and sex prediction models from normal participants with underlying vascular conditions such as hypertension, diabetes mellitus (DM), or any smoking history with convincing results for age prediction (R\(^2\) = 0.92), MAE = 3.06 (for normal), MAE = 3.46 years (for hypertension), 3.55 years (for DM), and 2.56 years (for smokers); however, R\(^2\) = 0.74 is relatively low for hypertension. The proposed model (FAG-Net) shows higher results in the majority of the evaluation metrics compared to the existing models for both healthy and unhealthy subjects, as tabulated in Tables 2 and 3.

The ML algorithms are widely applied in analyzing biological traits with different imaging modalities such as MRI, facial visuals, footprints, and so on [43]. In the conventional biological traits estimation, the study [44] proposed a trait tissue association mapping with human biological traits and diseases. The study [45] estimated the age of the subjects from MRI modality using PCA [46] for dimension reduction and relevance vector machine [47] with a significant score. The study in [48] applied a new automated machine learning approach in brain MRI to predict age with MAE = 4.612 years. Similarly, Valizadeh et al [49] used neural network [50] and support vector machine [51] to analyze five anatomical features, which resulted in high prediction accuracy. Martina et al [52] estimated brain age in PNC (Philadelphia Neurodevelopmental Cohort; n = 1126, age range \(8-22\) years) using a cross-validation [53] framework with the MAE = 2.93 years. Similarly, the study [54] used partial least squares regression [55] to classify gender based on MRI with the accuracy = 97.8%. According to the study of [17], machine learning has been leveraged for many years for a variety of classification tasks, including the automated classification of eye disease. However, much of the work has focused on feature engineering, which involves computing explicit features specified by experts.

The relationship between retinal morphology and systemic health has been extrapolated using multivariable regression like conventional approaches. However, such methods show limited ability for large size and complex datasets [22, 23]. Thus, the advances in automatic algorithm into DL avoids manual feature engineering and made extracting hidden features possible, which were previously unexplored. The DL models have shown significant results for previous challenging tasks. The harnessing of DL power is innovatively associated with the retinal structure and pathophysiology. DL models extract independent features unknown to clinicians; however, face challenges of explainability and interpretability which have been attempted to address by a neuro-symbolic learning study [24]. Deep learning is a family of machine learning characterized by multiple and deep level computations that has been optimized for images. Deep learning has been applied in different domains specifically in diseases diagnosing, such as melanoma and diabetic retinopathy, and achieved comparable accuracy to that of human experts [56]. The model RCMNet composed of ResNet18 with a self-awareness mechanism observed a decent performance of 83.36% accuracy on the CAR-T cell dataset.

Deep learning approaches to automated retinal image analysis are gaining popularity for their relative ease of implementation and high efficacy [25]. It has been reported that DL models capture subtle pixel-level luminance variations, which are likely indifferentiable to humans. Such findings underscore the promising ability of deep neural networks to utilize salient features in medical imaging that may remain hidden to domain experts [26]. Deep learning has shown great strength in medical image analysis. The study [57] developed a hyperdimensional computing-based algorithm [58] to classify gender from resting state and task fMRI from the publicly available Human Connectome Project with accuracy\( = \) 87%. Similarly, Jonsson [30] presented a novel deep learning approach using residual convolutional neural networks [59] with the prediction of brain age from a T1-weighted MRI with MAE = 3.39 and \(R^{2} = 0.87\); however, the study lacks generative feature given age as condition to evaluate the desired projection.

Most importantly, ophthalmologists have successfully predicted biological traits, such as age and gender with the significance of 0.97 as area under curve (AUC) score [26]. Yamashita performed logistic regression on several features that were identified to be associated with sex [32]. These features include papillomacular angle, retinal vessel angles and retinal artery trajectory. Various studies have shown retinal morphology differences between the sexes, including retinal and choroidal thickness [33, 34]. In previous studies, age has been estimated from clinical images via machine learning and deep learning [27,28,29,30]. The excellent performance in age prediction implies that fast, safe, cost-effective, and user-friendly deep learning models are possible in larger population. Motivated by recent DL concepts such as convolution neural network and attention mechanism, we employ these characteristics in the proposed model to associate biological traits with retinal visuals. The state-of-the-art (SOTA) models are limited to the learning of trait factors in the fed visuals, whereas the proposed model learns both the aging factor and the generative capability in order to accomplish the desired projection. In contrast to the works of SOTA, our research seeks to demonstrate the continuous effect of aging in addition to age estimation and gender classification. By incorporating both control and healthy group subjects, specialists are able to include age as a condition in the model and retrieve the retinal visuals of a healthy subject. This will not only benefit experts in age estimation similar to SOTA, but it will also assist in examination and diagnosis decisions. The proposed models are elaborated in proceeded section.

Methodology

This section illustrates the proposed deep learning architectures, their parameters, and hyper parameters. Similarly, we also extrapolate the logic behind the specific structure to achieve the intended goals.

Fig. 1
figure 1

FAG-Net: for age and gender estimation using fundus images

Biological traits estimation using FAG-Net architecture

For Age prediction and gender classification, we borrowed the concept of biological traits estimation from ShoeNet model [43]. The ShoeNet model has been used for age estimation and gender classification from pairwise shoeprints. However, the datasets available for fundus images are rarely found in pairwise (left and right eyes’ images). Thus, the model needs special attention to overcome the challenging situation to be utilized for biological traits estimation. Therefore, we propose a model for fundus images based age and gender estimation (FAG-Net) (Fig. 1). The model composed of six blocks, where block1, -2, and -6 contain spatial attention mechanism (SAB) while the rest of the blocks have been exempted from SAB. The first block receives input fundus images with dimensions of 512\(\times \)512\(\times \)3 (width\(\times \)hight\(\times \)channel). The input three channels fundus image first pass through a stack of convolution neural network with given number of filters (32) and kernel size (3). The SAB block has been augmented to focus on the salient spatial regions.

Attention-mechanism has shown great attention recently due to its significant performance in the literature [60]. In practice, both channel wise (CA) and spatial wise (SA) attentions have been employed with channel first order. However, we only applied SA, which only focuses along the spatial dimension. In SA, average pooling and max pooling are applied in parallel to the input and concatenated correspondingly. A 2D attention map is generated over each pixel for all spatial locations with a large filter-size (i.e., \(K = 5\)). The convolutional output is then normalized by non-linear sigmoid function. Finally, the normalized and direct connections are merged with element-wise multiplication to produce attention-based output. Both average and max pooling are used in SA to balance the selection of salient features (max pool) and global statistics (average pool). The embedding of the attention mechanism in FAG-Net focuses on regions of interest vulnerable to aging effects. The output from SAB passes through batch-normalization (BN) and rectified linear unit (ReLu) functions. Therefore, each block ends with BN and ReLu functions.

The input of block-2 received from block-1 passes through stack of convolution, SAB, and ends with maxpool layer. Similarly, the output of block-1 also passes as a direct connection to convolution maxpool (CMP) block. The convolution layer in CMP applies to the output (from block-1) with the same number of 64 filters (as that of block-2). However, the 1 \(\times \) 1 kernel size has been used to produce the same feature maps (64) and passed to a maxpool operation to bring into the same dimension. Both outputs of block-2 and CMP block concatenate along the third dimension and forward as input to block-3 and a direct connection.

The purpose of CMP block is to retain the spatial features in high dimensional space to deeper-level related to age progression. In the abstract level, the dense structure passes salient features together with those extracted from block sequence. The feature maps increase and the dimensions decrease as the network goes deeper. The accumulated output from all the blocks passes through a normal convolution layer having 8 \(\times \) 8 feature maps and 1024 number of filters. The final convolution layer passes the output to fully connected layers where each layer has been dropout with ratio of 9, 8, and 5 output 10 to avoid overfitting. The final output neuron can be singleton for age prediction or two for gender classification. In the case of age prediction, a linear activation function applies to produce a regression value. However, for gender classification, a softmax layer is employed to output weighted output for both male and female.

Objective function for FAG-Net

The objective function used for training FAG-Net composed of three loss terms including \(L_1\), \(L_2\), and regression specific custom loss function (CLF). The accumulative loss function (ALF) is the mean of all the weighted loss terms, formulated as follows:

$$\begin{aligned} ALF = \psi *L_1+\psi *L_2+\psi *CLF)/3, \end{aligned}$$
(1)

\(\psi \) is the corresponding weights to balance the loss terms, including \(L_1\) and \(L_2\) and which can be formulated as follows:

$$\begin{aligned} \textit{L}_\textit{1}= & {} {\sum _{i = 0}^{n-1} abs|A_{age}^i-P_{age}^i|}, \end{aligned}$$
(2)
$$\begin{aligned} \textit{L}_\textit{2}= & {} {\sum _{i = 0}^{n-1}\{A_{age}^i-P_{age}^i\}^2}, \end{aligned}$$
(3)

where n is the number of samples and \(A_{age}\) and \(P_{age}\) denote the actual and predicted ages.

Furthermore, age prediction is a regression problem, and a single output will be expected as a result. Thus, a specialized custom loss function based on mean square error (MSE) is proposed to optimize the hyperparameters during training [43]. The optimizer (Adam) fine-tunes the weights of convolution filters to minimize the loss value. To produce regression specific results, CLF penalizes the out-ranged values more. It minimizes the distance between the actual and predicted age in a target-oriented way. The formulation of CLF is illustrated in the following equation:

$$\begin{aligned} \text {CLF} = \frac{\sum _{i = 1}^{n} E_i}{n};~E_i = {\left\{ \begin{array}{ll} d_i*\varphi ,&{} \text {if } d_i\le J\\ d_i^{3}+\varphi ,&{} \text {if } d_i>J \end{array}\right. } \end{aligned}$$
(4)

CLF is the mean of difference (E) for n number of samples, where n = \((total-samples)/(input\)-size). \(\varphi \) is a small value (0.0001-to\(-\)0.3) used to prevent the network from attaining zero difference and to sustain the learning process. Similarly, \(d_i = ||y-\bar{y}||\) is an absolute error between actual age (y) and predicted age \((\bar{y})\). Furthermore, J is a natural number derived from MCS-J for predictable age ranges. In the second condition (\(d_i\le J\)), the values higher than the value of J will cause more penalization of the weights based on the computed loss-value in the exponential time (power 3). The penalization influences the optimization of network weights and biases. It will direct the optimizer to tune these parameters in order to minimize the difference between actual and predicted age. The CLF values indicate abrupt changes for \(J = 2\) and \(J = 3\), which demonstrates a high penalty by following that the MCS-J would be only counted in the given range of J. CLF not only considers the absolute error but also penalizes more the adjacent values to J in MCS-J. By directing for more penalization, the given optimizer fine tunes the learning weights to obtain a persuasive estimation score. Adam is used as an optimizer with the L\(_2\) regularizer to tune hyper-parameters.

Evaluation metrics for FAG-Net

Besides MAE and MSE as evaluation metrics for age prediction, we apply cumulative score (CS) and mean cumulative score (MCS) as evaluation metrics to accommodate the nature of the problem. CS and MCS imitate the existing studies, and are used to assess accuracies in a range of age groups. CS (or CS\(_j\)) and MCS (or MCS-J) give more weight to the smaller ranges of match windows. The ranges depend on the value of j and J, the absolute differences between actual and estimated age scores [43], is formulated as follows.

$$\begin{aligned} \text {MCS-J}= & {} \frac{\sum _{j = 0}^{J} CS_j}{J+1}\nonumber \\ \text {CS}_j= & {} \frac{\sum _{i = 1}^{n} \delta _i}{n}*100 \end{aligned}$$
(5)

where

$$\begin{aligned} \delta _i = {\left\{ \begin{array}{ll} 1,&{} \text {if } \delta _i\le j\\ 0,&{} \text {if } \delta _i>j\\ \end{array}\right. } \end{aligned}$$

CS\(_j\) is the percentage mean of \(\delta _i\), where \(\delta _i\) is the Euclidean-distance \(y_i-\bar{y_i}\) between actual (\(y_i\)) and predicted (\(\bar{y_i}\)) score, and it will be counted as 1 for \(y_i-\bar{y_i}\le j\). The value of \(\delta _i\) expressed as zero (0) implies that the distance \(y_i-\bar{y_i}\) is greater than the threshold value (j). The MCS score facilitates prediction in various ranges of matching thresholds rather than a single threshold. Thus, the MCS score gives a more comprehensive assessment for the challenging problem of retinal based age prediction to cover all the values of \((y_i-\bar{y_i})\le j\) for the setup threshold (j). This also allows us to give different penalties with varying thresholds in the objective function of the deep learning model.

Fundus images generation given age as condition

After proposing a sophisticated DL model (FAG-Net) for age prediction and gender classification, a novel network model has been introduced to predict futuristic variations in the fundus images. The model generates fundus images given age as condition (FGC-Net) (Fig. 2). The FGC-Net

Fig. 2
figure 2

FGC-Net: for fundus images generation given different ages. The generator part composed of variational autoencoder where the bottleneck receives age as condition prior to decoding. The discriminator receives both ground truth and generated fundus images to learn the aging effects. Left: The FGC-Net receives fundus image as input, encodes into latent space and generated back with embedded (in the bottleneck) age as condition. The generated fundus image discriminates against the age as label for learning age embedding. Right: for the testing phase, a single fundus image can be input with multiple age labels and generates corresponding fundus images with relevant variations. The details of the model have been drawn in Fig. 3

Encoding

The encoding phase of FGC-Net first receives the input fundus images (\(X^i \mathbb {R}^{N\times H\times W\times C}\)) regarding biological traits association and learning (Fig. 3). The dimensions \(N\times H\times W\times C\), denote the batch size, width, height, and features map (number of channel: 1 for grayscale and 3 for color images), respectively. The encoder automatically extracts lower-dimensional features from the input data and inputs them into the latent space. The \(i^{th}\) convolutional layer (\(NC_i\)) acts as a feature extractor by encoding the salient features from \(X_i\). Considering the input structure (e.g., \(X^h = H\), \(X^w = W\), \(X^c = C\), where \(X^h\), \(X^w\), \(X^c\) are the output structure with new height h, width w, and dimension c, respectively), the encoder (e) part contains five encoding blocks-EB (EB-1 to -6) to sufficiently extract low-level features in the spatial dimension (e.g., \(X^h = \frac{1}{n}\times H\), \(X^w = \frac{1}{n}\times W, X^c = n\times C\), where n is the number of downsampling and deeper levels) followed by the bottleneck layer (\(Z\in \mathbb {R}^k\), where k is the spatial dimension of Z). The size of the channels (EB-1 to EB-6) decreases by halves in each subsequent deep layer, where the loss of information can be compensated for by doubling the number of filters (channels).

Fig. 3
figure 3

Detailed network architecture of FGC-Net, which is composed of generative and discriminative modules together with condition in the bottleneck using variational autoencoder (VAE). The VAE in the bottleneck embeds age as scaler values and facilitates different versions for the given input. The input block has a special architecture where varieties of filters (with different sizes) have been employed. The model output different copies of the input given ages as condition while testing

In the encoding layer, the received image passes through an input-block (IB) which has been designed for the purpose of extracting varieties of features by employing distinct kernel and filter sizes such as 1\(\times \)1, 3\(\times \)3, 5\(\times \)5, and 7\(\times \)7 after a normal convolution (512\(\times \)512, 24, 3, corresponding to dimensions, number of filters, and kernel size). The outputs of variant size of filters merge as elementwise sum prior to proceeding for the deeper block. The output of IB forwards to EB1, where EB1 contains strides convolution to avoid the loss of information useful for generation, normal convolution (dimensions, number of features, kernel size, and stride) for feature extraction, followed by BN and ReLu functions. The rest of the blocks (EB2 to EB6) have the same structure till the bottleneck layer. The EB compresses the input spatial wise and extends channel wise. The compressing process at the \(l^{th}\) encoding block \(EB\text{- }l\), where \(l = 6\), is formulated as follows:

$$\begin{aligned} EB\text{- }l = En\Big (\left[ NC[S_t(X^{l-1})];\{op_b, op_r\} \right] ;\phi \Big ), \end{aligned}$$
(6)

where, \(S_t\) and Co denote strides (s = 2) and normal convolution in a block (\(-l\)) over the data sample (\(X^{l-1}\)) obtained from a previous block (\(l-1\)). The output from a stride convolution \(S_t\) and normal convolution NC forwards to BN (\(op_b\)) and ReLu (\(op_r\)) functions. The stack of stride convolution (St) and normal convolution (NC) avoids the loss of useful information. In addition to reducing computational operations [59], St enables the model to learn while downsampling [61] and retain and passes features into subsequent layers heading into latent space which is used by the decoder to generate back with age embedded effects.

Besides the encoding of input fundus image, the corresponding label information has also been embedded into the latent space. After a number of experiments, the label information as condition in the latent space is more effective to effect the generation process. The embedding of age as condition with the output of the encoding layer is carried out in the latent space. The encoder part (En) passes the label (age \(L_g\)) information as condition (\(Ec(L_g, \xi )\)), where \(\xi \) is the learned parameters by the encoder, into the latent space of VAE.

Bottleneck and conditioning

The bottleneck layer is an unsupervised method of modeling complex and higher dimensional data in deeper level. The encoder part (\(En(X^i, \phi )\)) compresses the input from higher-dimensional space (\(X_m^H,X_m^W\)) through network parameters (\(\phi \)) and generates the probabilistic distribution over the latent space (Z) with a lower possible dimension (\(\frac{1}{n}\times X^H,\frac{1}{n}\times X^W\)). Similarly, \(Ec(L_g, \xi )\) passes \(L_g\) through fully connected layers with learning parameters of \(\xi \) similar to the fully connected layer of \(En(X^i, \phi )\). The decoder part utilizes the embedded and compressed form (latent variables Z) and generates it back to the high-dimensional generated space (Y). Minimizing the gap between X and Y enables the model to learn and tune the parameter values. The latent space enables the model to learn from the mutual distributions of X and \(L_g\). The output of EB6 and \(En(L_g, \xi )\) are passed to a fully connected layer modeling the complex dimensional structure into a latent representation and then flatten via 64 neurons. From each of the flatten 64-neurons, both mean (\(\mu \)) and standard deviation (\(\sigma \)) are computed.

The encoder part (\(En\{X_{m};~\phi \}\)) generates the posterior over latent space (\(z^i\), where i denotes the sample number) and samples from the latent space (\(P^i\)) which can be used for the decoding (generation) as \(De\{En\uplus L_g \oplus S_f;~\vartheta \}\). The latent space is obtained as follows:

$$\begin{aligned} z_i\sim \Re _i \left[ (z_0/x^i)\parallel (z_1/x^i) \right] , \end{aligned}$$
(7)

where \(\Re ()\) is the distribution over \(z_0\) and \(z_1\) given input \(x^i\). The sampling \(z_i\) from the distribution \({\mathcal {N}}(;)\) can be rewritten for the conditional input as follows:

$$\begin{aligned} z_i\sim {\mathcal {P}}_i(z/X)= & {} {\mathcal {N}}\big (\mu (X; \phi _0), \sigma (X; \phi _0)\big )\nonumber \\{} & {} \parallel {\mathcal {N}}\big (\mu (X; \phi _1), \sigma (X; \phi _1),\nonumber \\ z_i\sim {\mathcal {P}}_i(z/X)= & {} {\mathcal {N}}\big ( \left[ \mu (X; \phi _0)+\mu (X; \phi _1)\right] ;\nonumber \\{} & {} \left[ \sigma (X; \phi _0)*\sigma (X; \phi _1) \right] \big )\nonumber \\ z_i\sim {\mathcal {P}}_i(z/X)= & {} {\mathcal {N}}\big ( \left[ \mu (X; \phi _l)\right] ; \left[ \sigma (X; \phi _m) \right] \big ) \nonumber \\ where ~ \phi _l= & {} \phi _0+\phi _1, ~\phi _m = \phi _0*\phi _1 \end{aligned}$$
(8)

The drawn sample (\(z_i\)) conditioned with \(X_m\) from the distribution (see Eq. 8) maps into the same dimensions as the decoder (\(Dec(z_i, \theta )\)) for the generative process with the learning network parameters (\(\theta \)). The latent distribution must be regularized by the Kullback leibler (KL) divergence (see the loss function) to closely approximate the posterior (P(z/x)) and prior (P(z)) distributions. The regularization (i.e., via the Gaussian prior) holds in the latent space between the distributions in terms of \(\mu \) and \(\sigma \), which further contributes to the latent activations utilized by the decoder to produce new retinal image. The latent distributions are centered (\(\mu \)) and spread over the area \(\sigma \) to project the possible fundus as desired (DSp). Usually, the distance between the learned distribution \({\mathcal {N}}(\mu ,~\sigma ^2)\) and the standard normal distribution \({\mathcal {N}}(0,1)\) can be quantified by the KL divergence. However, instead of Gaussian normal distribution and normal mean (\(\mu \)) standard deviation \(\sigma \), we utilize the sum of \(\mu \) and the product of \(\sigma \). The detailed formulation is shown in Eqs. 8 and 10. The latent distribution and regularization are expected to have the properties of continuity and completeness. In the case of continuity, the sampling from the latent distribution given X may exist a nearby data point that feeds into the decoder to generate fundus images with a similar structure with additional information, as desired. The decoder must generate target-oriented fundus images in a controlled fashion.

Decoding

FGC-Net generates a random sample (\(z_i, for~i = 1, 2,..., n\)) conditioned by \(L_g\) drawn from the probabilistic distribution \(P_i(z_i/X)\) at the decoding side as decoding blocks (DB1 to DB7) and projects to \(Y_i\):

$$\begin{aligned} Y_i = Dec\big \{[z_i\odot R_i]\oplus S_f(X); \theta \big \}, \end{aligned}$$
(9)

where, \(Y_i\) is the generated fundus images corresponding to \(z_i\) with adjustable weights (\(\odot R_i\)) regularized by the objective function and merged with the contextual skipped features (\(\oplus S_f\)) using network learning parameters (\(\theta \)).

In the decoding process, z is computed from the sum of \(\mu \) and \(\sigma \) multiplied by the standard normal distribution (\(\varepsilon \)). The values of \(\mu \) and \(\sigma \) are computed in Eq. 8. The \(\varepsilon \) value is computed from the absolute different of normal distribution having mean \(\mu (L_g)\) and standard deviation \(\sigma (L_g)\) based on the fed scaler age values.

$$\begin{aligned} z\sim {\mathcal {N}}(\mu ,\sigma ^2)\cdot \epsilon \leftarrow \mu +\sigma ^2\cdot {\mathcal {N}}(\mu (L_g), \sigma (L_g)). \end{aligned}$$
(10)

The dimension of z is reshaped and upsampled to match the dimension of the corresponding encoding layer (EB6) and merge \(\oplus S_f\) as elementwise sum with the skip layer from EB6. Each block receives input, performs upscale dimensions via transpose convolution followed by BN and ReLu function. Each decoding block-DB composed of strides convolution, BN, and ReLu activation. The output of DB1 concatenates with the skip connection from IB.

Skip layer

The deeper the network, the more chances of losing key features due to the application of downsampling operations and the vanishing gradient problem [62]. Similarly, to avoid the loss of contextual information [63], we adopted skip connections between the encoding (\(Enc_k\{X;~\phi \}\)) and decoding (\(Dec_k\{z\oslash S_f;~ \theta \}\)) at particular layers(\(_K\)) to transfer spatial features and global information in terms of the input image structure.

The skip layers integrate the learned features from early levels, avoid degrading shallow stacked networks and overcome gradient information loss by retaining key features during training. These connections also improve the end-to-end mapping of training and achieve an effective role in a deeper network. The sole purpose of adopted skip connections is to facilitate the decoder to maintain the existing input structure while generating on the decoding side together with synthetic information to reflect age progression. The dimensions and merging position with the corresponding layers, both at the bottleneck and decoder layers, are show, in Fig. 3.

After generating z given P(z/X) from the encoder (see Eq. 6), the decoder part merges the data sample information from the latent space, conditioning information (\(L_g\)), and skip connection at a particular layer (\(_k\)) is formulated as follows:

$$\begin{aligned} DB_k = Dec\big (NC[S_t(Y^{k+1});\{op_b, op_r\}] \oplus S_f(X);\theta \big ), \nonumber \\ \end{aligned}$$
(11)

where \(Y_{k+1}\), \(\oplus S_f(X)\), and \(\oplus \) denote the previous tensor, skipped features, and merging operation for the elementwise sum, respectively. Additionally, \(op_b\) and \(op_r\) denote BN and nonlinear activation ReLu operations, respectively. In addition to the completeness and continuity properties of the VAE, the involvement of skip connections borrowed from U-Net controls the generation process.

Discriminator

The discriminative part borrowed from generative adversarial network (GAN) [64] is appended at the end of FGC-Net, which brings sharpness and better quality to the generated images [65]. Adversarial learning plays a min-max game to distinguish the original and fake (generated or synthetic) images. FGC-Net brings the inferencing features to reason at the latent space and generates fundus images as desired [66]. However, instead of training in a min-max fashion, we utilize the discriminative part solely for prediction of a scaler value or regression similar to subjects’ ages. There are six blocks receiving both input and generated fundus images. Each discriminative block (DsB) composed of stride convolution, BN, and ReLu functions. The output of stacked DsB ends with three fully connected layers containing 512, 256, and 128 neurons followed by dropout layers with ratio of 0.8, 0.7, and 0.6, respectively. Finally, the output of fully connected layer passes through linear activation function to a single neuron for age estimation. In the objective function, both outputs as single values (from input fundus and generated fundus images) are formulated as mean square error (MSE) or \(L_2\) loss term.

In our case, the generator maps \(X_i\), which is the input to the encoder of FGC-Net (Fig. 3), to \(Y_i^j\), which is output form the decoder of FGC-Net (Fig. 3). The model generates fundus images for each age value \(L_g\) where \(j = L_g\). The “discriminator” part discriminates the actual \(X_i\) and generated version \(Y_i^j\) either as real or fake value. The min-max game of learning in GAN [64] can be formulated as follows:

$$\begin{aligned} V(D,G) = \underset{G}{min}~\underset{D}{max}(D_{XY}, G_X), \end{aligned}$$
(12)

Similarly, the generative (\(G_X\)) and discriminative (\(D_{XY}\)) operations can be illustrated in mathematical forms as follows:

$$\begin{aligned} G_{X}= & {} G\left\{ \underbrace{En(X_i;~\phi )\rightarrow Y_i\sim Dec(Z_{i};~\theta )}_{Generative~Unit} \right. \nonumber \\{} & {} \rightarrow \left. \underbrace{Disc(\left[ X_{i}, Y_i\right] ;~\Phi )}_{Discriminative~Unit}\right\} \nonumber \\ G_{X}= & {} G(X_{i},~Y_{i};~\omega )~where~\omega = \{\phi ,\theta , \Phi \} \end{aligned}$$
(13)

The discriminator plays a vital role in the abstract reconstruction error in the circumstances where VAE is infused in the network model. The discriminator part measures the sample similarity [66] at both element and feature levels. In addition, the discriminator is made stronger to distinguish between real and fake images by including \(L_2\) loss term.

Objective function for FGC-Net

The objective loss function for FGC-Net is composed of reconstruction loss (\(L_2\)) (Eq. 3) and KL divergence loss [67].

The probabilistic distribution in VAE as inferencing model (\(q_\phi (z/x)\)) approximates the posterior (true) distribution (\(p_\theta (z/x)\)) in terms of KL-divergence to minimize the gap as follows [68]:

$$\begin{aligned} \textit{KL}_{d}(q_\phi (z/x)||p_\theta (z/x))) = \mathbb {E}_{q_\phi }\left[ log \frac{q_\phi (z/x)}{p_\theta (z/x)} \right] , \end{aligned}$$
(14)

In our case, the KL-divergence between the distribution \({\mathcal {N}}(\mu _i, \sigma _i)\) of the inference model with mean \(\mu _i\) and variance \(\sigma _i\), and the standard normal distribution \(\mu (L_g), \sigma (L_g)\) (Eq. 10) with mean \(\mu \) and unit variance \(\sigma \) can be formulated after the Bayesian inference simplification [69] as follows:

$$\begin{aligned}{} & {} \textit{KL}_{d}({\mathcal {N}}(\mu , \sigma )||{\mathcal {N}}(\mu (L_g), \sigma (L_g)))\nonumber \\{} & {} \quad = \frac{1}{2}\sum _{i = 1}^{l}\big (\sigma _i^2+\mu _i^2-1-exp(\sigma _i^2), \big ) \end{aligned}$$
(15)

Thus, the total loss function for FGC-Net (TLF-FGC) composed of the following terms:

$$\begin{aligned} TLF{-}FGC = (L_1+L_2+KL_d)/3 \end{aligned}$$
(16)

Dataset preparation

To train, evaluate, and test proposed models for biological trait estimation and trait-based futuristic analysis, we used the dataset Ocular Disease Intelligent Recognition (ODIR-5K) [70], PAPILA [71], and a longitudinal population based on 10-year progression collections (10Y-PC) [72]. There are total 12,005 samples in cumulation, where 80% (9604 of 12,005) and 20% (2401 of 12,005) are the respective training and testing splits. All three datasets contain age and gender as label information. The age as a label feeds to FGC-Net during training, which can be used as a condition in the testing phase. The samples missing label information such as age and gender were discarded. The subjects ranged in age from 10 to 80 years old. To propose a generalized model for biological trait estimation, we utilized both cross-sectional and longitudinal populations. Furthermore, to cover the estimation of biological traits, both healthy and unhealthy subjects were included so that the underlying DL model should learn features invariant to abnormalities. Similarly, varieties of cameras and environments have been used to capture different qualities of images, modeling sophisticated DL networks.

Network training

Both FAG-Net and FGC-Net have been trained via Adam for optimizing network parameters. For FAG-Net and FGC-Net, Adam optimizer was used with initial learning rate of 0.001, \(\beta _1\) = 0.9, \(\beta _2\) = 0.999, where the learning rate was decreased by \(\frac{1}{10}\) after every 50 epochs. The batch size was composed of 16 samples according to the available GPU’s memory size. The models run for 500 epochs and dynamically stop when a poor result is observed after each epoch.

Results

Biological traits estimation

The proposed model FAG-Net and the state-of-the-art (SOTA) models have been trained for 5-fold cross-validation (FCV). The evaluation metrics MAE, MSE, MCS-2 and MCS-3 were used for testing purposes. Evaluation metrics MCS can help better assessing the performance of the models for age prediction where age prediction can only be predicted in a range of values rather than classified value. Therefore, MSE metrics may produce larger value for larger difference between actual and predicted values in the case of outliers. Thus, MSE metric for such scenarios may not be a reliable option. The details of five cross validation have been shown in Table 1. Table 2 shows the results of all the underlying modalities corresponding to evaluation metrics.

Table 1 FAG-Net scores for five cross validation \(CS_0\), \(CS_1\), \(CS_2\), \(CS_3\), MAE, MSE, MCS
Table 2 Comparative evaluation-scores of FAG-Net and SOTA models in terms of MAE, MSE, MCS, and R\(_2\). For the values of CS\(_0\), CS\(_1\), CS\(_2\), and CS\(_3\), see formulation in Eq. 5
Table 3 Comparative evaluation-scores of FAG-Net and SOTA models in terms of MAE, MSE, MCS

Gender classification

In biological traits estimation, we also trained the proposed (FAG-Net) and few SOTA models. All the classification results are shown in Table 3. After the successful classification of gender traits from fundus images, the proposed model (FGC-Net) emphasized more on optic disc area and learned features while training for aging association. In the study for gender classification [38], optic disc was also considered the main structure by the deep learning approaches.

To evaluate the performance of our proposed model for gender classification, we randomly chose few SOTA models and trained and tested them on the same dataset and parameters (Table 3). We used confusion metrics to evaluate the results. The rest of the metrics include true positive (TP), false positive (FP), true negative (TN), false negative (FL), specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), F\(_1\) score, and accuracy. The derivation of these metrics is illustrated in the following equations.

$$\begin{aligned}{} & {} Sensitivity = \frac{TP}{TP+FN},\nonumber \\{} & {} Specificity = \frac{TN}{TN+FP},\nonumber \\{} & {} F_1 = 2\times \frac{Specificity\times Sensitivity}{Specificity+Sensitivity},\nonumber \\{} & {} PPV = \frac{TP}{TP+FP},\nonumber \\{} & {} NPV = \frac{TN}{TN+FN},\nonumber \\{} & {} Accuracy = \frac{TP+FN}{TP+TN+FP+FN}. \end{aligned}$$
(17)
$$\begin{aligned}{} & {} R_2 = 1 - \frac{RSS}{TSS},\nonumber \\{} & {} SSR = \sum _{i = 1}^{j}(X_i-Y_i)^2,\nonumber \\{} & {} TSS = \sum _{i}^{j}\big ( (Avg(X_i^j)-Y_i)^2 \big )^2 \end{aligned}$$
(18)

where sum of square of residuals (SSR), total sum of square (TSS).

From the accumulated results, our proposed model FAG-Net outperforms the competitive SOTA models. VGG-Net-16 has received the second highest score in terms of accuracy. The convincing results of our proposed model encourage us to proceed with the age prediction from fundus images and learning the corresponding effect.

Fig. 4
figure 4

Displaying output results from distinct versions of FGC-Net. Each row corresponds to the FGC-Net version (from FGC-Net 0 to FGC-Net 6). There is total 9 columns, where first column displays randomly chosen sample. Second to ninth column shows the subtracted results between the input (sample) and corresponding output with the given condition

Age progression effects in fundus images

In this study, FAG-Net has been utilized to estimate biological traits from fundus images. After the successful estimation of age (accuracy 91.878%) and gender (MAE 1.634), we proposed FGC-Net, a generative model conditioned by subjects’ age. To extrapolate the effects of the fed condition, we proposed different versions of FGC-Net in order to verify the changes made on fundus images with age progression. After training FGC-Net (Fig. 3) together with all versions, we randomly chose samples and fed them to the models to retrieve fundus images in different age stages (Fig. 4). There are total of seven versions of FGC-Net together with their output (Fig. 4). The random chosen sample is fed to each model together with different conditions (labels) as in the range of 10–80 years. The output images are subtracted from the original (fed fundus image) and the difference is displayed in Fig. 4 (2nd column to the 9th).

Variations have been observed based on the subjective evaluation. From the visualized results, three key anatomical structures including optical disc, area near OD, and size (volume), were observed to be variant given different ages. The OD region, approximately a circular and bright yellow object, in all the generated fundus images by a variety of modalities found variant from early to late aging. The embedded age as condition mostly influences optic disc with age progression and which can be observed with naked eyes from 5th to 7th row (Fig. 4) for the corresponding model. Similarly, the nearby thick vessels and region to OD have also been observed variant with aging. Besides, the size of the fundus images has also been found variant with age progression. Such variations are apparent for FGC-Net-6 model (Fig. 4-last row). We employed attention mechanism in all the proposed models to highlight the regions of interest while embedding and estimating biological traits. The attention mechanism also highlights pixels in the input image based on their contributions to the final evaluation. Therefore, the affected regions in the generated images can be observed from the underlying modalities. The learning process from the embedded age as condition occurs at abstract level. In other sense, the learning becomes generalized by utilizing the fundus images from both healthy and unhealthy subjects and avoids. Thus, the study innovatively learns biological traits and their effects on fundus images using the cutting-edge technology of deep learning.

The ability of neural networks to use greater abstractions and tighter integrations comes at the cost of lower interpretability. Saliency maps, also called heat maps or attention maps, are common model explanation tools used to visualize model thinking by indicating areas of local morphological changes within fundus photographs that carry more weight in modifying network predictions. Algorithms mainly used the features of the optic disc for gender prediction, which is consistent with the observations made by Poplin [17]. Deep learning models that were trained using images from the UK Biobank and EyePACS data sets primarily highlighted the optic disc, retinal vessels, and macula when soft attention heat maps were applied, although there appeared to be a weak signal distributed throughout the retina [17].

Conclusion and future directions

In this study, we investigate biological traits from fundus images from both healthy and unhealthy subjects. We also extrapolate the variational effects on fundus images with age progression. We proposed two types of DL models name FAG-Net and FGC-Net. FAG-Net estimates age and classifies subjects from fundus images utilizing the dense network architecture together with attention mechanism at distinct levels. The proposed models generalize the learning process in order to avoid the variation in anatomical structure in fundus images caused by the retinal disease. The study successfully carried out age prediction and gender classification with significant accuracy. Similarly, the attention mechanism highlighted regions of interest that are vulnerable to aging. Furthermore, the model shows similar salient regions in ungradable input images as in gradables (Fig. 2). This suggests that the model is sensitive to signals in poor quality images from subtle pixel-level luminance variations, which are likely indifferentiable to humans. This finding underscores the promising ability of deep neural networks to utilize salient features in medical imaging which may remain hidden to human experts. In the future study, more sophisticated deep learning models with attention mechanisms can be proposed for healthy and unhealthy subjects both in isolated and joint form.