1 Introduction

Aging is an intricate biological process that encompasses several cellular and molecular changes, ultimately leading to a gradual decline in tissue function and regenerative capacity. This complex phenomenon is driven by a range of mechanisms that contribute to the aging phenotype, including genomic instability, telomere attrition, mitochondrial dysfunction, altered intercellular communication and gene expression (Consortium, 2020; Rivero-Segura et al., 2020). While human lifespans have greatly increased in the past decades, the same cannot always be said for healthspans, the period when individuals are considered healthy (Beard et al., 2016). The quest to extend healthspan and enhance the quality of life for the elderly has prompted a surge of interest in developing interventions targeting the underlying mechanisms of age-related diseases (Rattan, 2018).

The most popular approach for age estimation is to train models based on diverse biological features using machine learning techniques, colloquially referred to as “aging clocks”. Currently, the most well-studied and accurate models are based on molecular data using DNA methylation. DNA methylation is vital to the developmental cycle of mammals and plays a key role in signaling and modifying the aging process (Smith & Meissner, 2013). In addition to DNA methylation, other epigenetic markers, such as histone marks, have also been associated with the aging process (Yang et al., 2010). Importantly, reversing epigenetic markers might open new venues to therapeutically target aging-related diseases (Yang et al., 2010).

Beyond epigenetic modifications, other biological data modalities are associated with the aging process, including gene expression (Viñuela et al., 2018; Melé et al., 2015), protein and metabolite abundances (Moaddel et al., 2021; Menni et al., 2013), and telomere length (Demanelis et al., 2020), demonstrating the potential for aging models leveraging different data types. Furthermore, current research has unveiled organ-specific aging signatures in human transcriptomes (Shokhirev & Johnson, 2021; Glass et al., 2013), plasma proteomes (Oh et al., 2023), and metabolic profiles (Fang et al., 2023), emphasizing the importance of modeling age in a tissue-specific manner.

Combining different types of data, including epigenomic, proteomics, metabolomics, transcriptomic, and microbiomics, is a promising approach to unveil distinctive molecular aging patterns in individuals (Zhang, 2023). In a recent study, we tackled this issue using lung tissue and incorporated gene expression, DNA methylation, telomere length, and histological image data (Moraes et al., 2023). Here, we extend this work in several key ways: (1) we refined the previously implemented pipelines for age prediction accuracy in gene expression and methylation, by improving data preprocessing, feature selection, and adding a step to deal with data imbalance, leading to an overall improvement in performance, (2) improved the age prediction using histological images by using Hierarchical Image Pyramid Transformer (HIPT) to extract meaningful histological features, (3) evaluated the result of integrating different combinations of data modalities using a wider range of approaches, and (4) extended our prediction framework to include an additional tissue—the ovary. We selected the ovary to extend our work due to the fact that this tissue has been absent from age prediction literature (Horvath, 2013; Hannum et al., 2013; Lima Camillo et al., 2022), coupled with the importance of understanding the impact of aging in this tissue (Wang et al., 2023). Furthermore, it allows us to evaluate our prediction pipelines in both somatic and sexual tissues. A scheme of the pipeline used in this work is represented in Fig. 1.

2 Related work

DNA methylation has been widely used for age estimation and is one of the main hallmarks of the aging process (López-Otín et al., 2013). As early as 2011, Bocklandt et al. (2011) used saliva samples to build the first age estimator, achieving a MAE (mean absolute error) of 5.2 years. Two years later, two important aging models were introduced. The first, developed by Hannum et al. (2013) (colloquially named Hannum clock), was trained using 656 blood samples and achieved a MAE of 3.9 years, greatly improving on previous work. Although this model was trained on blood tissue samples, the authors were able to adjust the model to work with different tissues, producing comparable error rates. In the same year, Horvath (2013) proposed another model (Horvath clock), trained on a multi-tissue dataset of over 8,000 samples across 51 tissues. Despite age-induced methylation being highly tissue-specific (Slieker et al., 2018), the proposed model was able to generalize to most tested tissue and cell types, resulting in a MAE of 3.6 years. For both of these models, the authors used penalized regression techniques to drastically reduce the high dimensionality of the methylation arrays (\(\sim \) 450,000 for the Hannum clock and \(\sim \) 27,000 for the Horvath clock) to 71 and 353 features, respectively (Hannum et al., 2013; Horvath, 2013). More recently, neural networks (NN) have been successfully applied to epigenetic data for age prediction, achieving a MAE of 2.7, 3.8, and 2.15 years respectively (Galkin et al., 2021; Li et al., 2022; Lima Camillo et al., 2022). The latter of these models, named AltumAge, was trained with a multi-tissue dataset and demonstrated particularly promising results. Other types of data, including RNA expression data (i.e. transcriptomics), have also been used to predict biological age (Peters et al., 2015) (7.8-years MAE). This model was applied with the Horvath and the Hannum clock to a set of common samples and uncovered that the error rates of each data modality were correlated with different aging-related phenotypes. This suggests that gene expression and methylation may provide complementary information for age prediction. A number of age models have been built using other types of molecular data, including protein abundances (Sayed et al., 2021; Kristic et al., 2014; Tanaka et al., 2018; Lehallier et al., 2020), metabolites quantities (Akker et al., 2020; Robinson et al., 2020; Hertel et al., 2016), taxonomic profiling of gut microbiota (Galkin et al., 2020; Chen et al., 2022), blood biomarkers (Mamoshina et al., 2018), and chromatin accessibility (Morandini et al., 2024). Despite the large number of data modalities found in the literature, there has been a very limited number of attempts at generating aging models using multi-modal datasets. We previously focused on this problem and attempted to combine gene expression, DNA methylation, and histological images for age prediction using an ensemble approach (Moraes et al., 2023). However, this approach led to very similar results when compared with individual data modalities. As far as we know, the only other example of this class of models is Precious1GPT, which combines methylation and gene expression data (Urban et al., 2023), where the performance of methylation models alone was better than the combined dataset (methylation data MAE 4.23, gene expression data MAE 6.28, combined MAE 5.62).

3 Data description and preparation

All the data used in this study was generated in the context of the GTEx (Genotype-Tissue Expression) project (GTEx, 2020) and is publicly available online. We focused on lung and ovary tissue across four data modalities explored: gene expression data, DNA methylation data, telomere length data, and histological images data (Fig. 2a, b).

DNA methylation data generated in Oliva et al. (2023) was downloaded from the Gene expression Omnibus (GEO; accession number GSE213478). The processed dataset contains roughly 750,000 features corresponding to CpG sites, measured in \(\beta \)-values, with values ranging from 0 to 1. The \(\beta \)-values were converted to M-values. In contrast to M-values, \(\beta \)-values have been shown to suffer from heteroscedasticity in highly methylated and unmethylated positions (values near 1 and 0 respectively) (Du et al., 2010).

Gene expression data (both raw counts and normalized TPM values) generated using RNA-seq was downloaded from the GTEx data portal (https://www.gtexportal.org/home/). For each tissue, we selected protein-coding and long non-coding RNAs with minimum TPM of 1 in at least 20% of samples (15,746 features in lung; 14,712 features in ovary). Following filtering, the selected gene features underwent a log2(x+1) transformation.

Whole slide images (WSI) were downloaded from the GTEx Histological Image Viewer (https://gtexportal.org/home/histologyPage) and divided into smaller patches of \([4096 \times 4096]\) pixels (with 35 patches per WSI on average). Then, for each patch k, we extracted two types of features, \([CLS]_{4096}^{(k)}\) and \([CLS]_{256}^{(j)}\) (where j is the index of the j-th patch of \([256 \times 256]\) within the k-th patch of \([4096 \times 4096]\), and [CLS] is the classifier token), utilizing the HIPT architecture. We further aggregated and combined information from these two features sets, culminating into 3 types of features \([CLS]_{4096}^{(WSI)}\), \([CLS]_{256}^{(WSI)}\) and \([CLS]_{4096,256}^{(WSI)}\).

Telomere length (TL) data generated in Demanelis et al. (2020) was downloaded from the GTEx portal. This data type is represented by a single feature, relative telomere length, which measures telomere repeat abundance in a DNA sample relative to a standard sample.

Fig. 1
figure 1

Summary of the pipeline implemented for model training. Data was retrieved from the GTEx portal and repositories containing data generated for the companion papers. We split the data from each tissue and each modality into train-test sets, with the test samples being common to all data modalities within a tissue. Model optimization was performed using fivefold cross-validation. Feature selection was implemented to decrease the feature space in gene expression and DNA methylation. Histological features were extracted from images using the HIPT architecture. To avoid data leakage, feature selection was performed independently for each fold. Model training was performed using a variety of algorithms. The final model evaluation was performed on the common test set

4 Overview of prediction model methodology

For each tissue, we split all data modalities into train-test sets. We verified that age labels were skewed to older individuals (Fig. 2c, d). Consequently, during the train-test split, we ensured the representation of younger individuals in the dataset. Specifically, we performed a stratified partitioning of the methylation dataset with an 80-20 overall split and a 50-50 split for younger ages (< 45 years old). This ensured that younger individuals were represented in the test dataset, allowing us to evaluate the ability of the trained models to generalize in this age range. Test samples were common to all data modalities. For the remaining data modalities, samples not included in the common test set were used for model training.

Fig. 2
figure 2

Number of samples in each data modality in lung (a) and ovary data (b). The overlap between data type was not complete. Common samples represent the number of samples in common across the 4 data types. Distribution of age in the train set of each data modality in lung (c) and ovary (d) (Color figure online)

State-of-the-art age prediction models were applied to the epigenetic data, in order to establish a baseline for model comparison. These models are based on either selected methylation features (Hannum et al., 2013; Horvath, 2013) or all the features contained in the Illumina Infinium HumanMethylation27 array (Lima Camillo et al., 2022). A percentage of these features was missing from our processed epigenetic dataset. Before applying each model to our data, we imputed the corresponding missing features, by taking the mean of \(\beta \)-values of the all the features in each sample. We note that this processing step was only performed in the context of age prediction using these models, and were not considered during our model training pipeline. After data imputation, each methylation model was applied to the train-test set.

The ML pipelines were optimized using the optuna package (Akiba et al., 2019) in a fivefold cross-validation strategy (CV). Briefly, we divided the train dataset into 5 partitions (folds). We then trained the model on fourfolds (training folds), while the remaining fold was held out for validation. We repeated this process 5 times, with each fold serving as validation one time. All steps from the pipeline, including feature selection and Label Distribution Smoothing (LDS) were performed using only the training-folds to avoid data leakage. The best combination of parameters were selected based on average mean absolute error (MAE). These parameters were then applied to the whole train data to train the final model. Alternatively nested CV could had been applied, completely separating the tasks for hyper-parameter tuning and performance evaluation.

Due to the high dimensionality of the molecular datasets (gene expression and methylation), we evaluated several feature selection methods in order to decrease the feature space (Section A.5). In particular we identified methylation and gene expression features that significantly change with age in each tissue, by performing differential expression analysis using limma (Ritchie et al., 2015). Briefly, this method evaluates change in expression/methylation, by fitting feature-wise linear models using the target variable (in this case age). This frameworks allows for the inclusion of other covariates, in order to correct for potentially confounding effects. As such, we also included both technical (e.g. ischemic time and RIN) and biological (sex and BMI) covariates. Elastic net served as an additional feature selection method, and was considered before the training of non-linear models.

Elastic Net (EN) and Gradient Boosting Trees (GBT) algorithms were used to build the predictive models for methylation and gene expression, using their implementation in the sklearn and lightgbm (Ke et al., 2017) python packages respectively. This allowed us to test linear and non-linear algorithms for age prediction using molecular data. Furthermore, both elastic net and GBT are computationally efficient, easy to interpret, and have previously shown good results for prediction tasks using molecular data (Takahashi et al., 2020; Horvath, 2013; Hannum et al., 2013).

A Gaussian quantile transformation was applied prior to model training to reduce the right-skewed distribution. This process spread out the values, particularly those centered around 0, and decreased the influence of outliers. To further address age skewness, we implemented sample re-weighting using LDS (Yang et al., 2021).

Regarding the histological data models, for each Whole Slide Image (WSI) we extracted features using the pre-trained HIPT model (Chen et al., 2023). Similar to molecular datasets, we used EN for age prediction. Furthermore, we also implemented a custom-designed Multi-Layer Perceptron (MLP) that served as an encoder, forcing the dimensionality of its input data to be progressively reduced throughout the forward propagation process, followed by a linear layer for age prediction.

Similarly to the molecular datasets, we applied the LDS technique to address age skewness. We assessed TL, measured using Luminex-based methods (Demanelis et al., 2020), as a predictive biomarker of age using a multiple linear regression model including this feature and other demographic variables of the donors.

In the context of multi-modal modeling, we systematically explored various permutations of data modalities and performed integration using EN, GBT, and MLP. In order to ensure comparability across different combinations of data types, we constrained model training to the samples encompassed by all four modalities (Sect. 5.4) or shared samples between gene expression and histology (Sect. 5.5). Furthermore, we repeated the training process for each data type using only the shared samples in order to assess the benefits of multi-modal data integration.

We evaluated the performance between data modalities and combinations of modalities using Nadeau and Bengio’s variance-corrected t-test by comparing the tested model to the corresponding baseline on the MAE across the fivefolds.

A more detailed overview of our framework can be found in appendix A (Supplementary Methods). The code used in this work can be found in https://github.com/zroger49/multi_modal_age_prediction.

5 Experimental results

5.1 State-of-the-art methylation clocks performance

Table 1 Tissue-specific performance of the best models

In order to establish the current state-of-the-art performance and a baseline for all future model comparisons, we tested the performance of three epigenetic-based models on our data: the Horvarth clock (Horvath, 2013), the Hannum clock (Hannum et al., 2013) and AltumAge (Lima Camillo et al., 2022). Each model was applied independently to the train-test sets (Table 1, Supplementary table 1). The performance obtained for the Horvath clock and AltumAge in lung data (train set MAE = 4.39 and 4.07 respectively) was within 1 year of the reported accuracy for these tissues (Horvath, 2013; Lima Camillo et al., 2022). The Hannum clock reported the largest error (train set MAE 7.23). Although lung tissue was used to validate this model (Hannum et al., 2013), the authors did not report tissue-specific error rates. Applying the same models to ovary data yielded a MAE >10 years. Although both Horvarth clock and AltumAge have been reported to yield similar error rates in certain tissues (Horvath, 2013; Lima Camillo et al., 2022), ovary data was not present in either their train or validation datasets, and therefore a direct comparison is not possible. We further applied these models to two additional tissues: colon and prostate (Supplementary Table 1). Analyzing all four tissues we observed that, for our data:

  • The Hannum clock yields large errors (MAE >10 years) for all tissues except lung, where it still performs worse than the other two models.

  • Overall, AltumAge has lower MAE compared with the Horvarth clock.

  • All molecular clocks yield large error rates (MAE >10 years) in the ovary (with the exception of AltumAge in the test set; MAE = 8.74).

The poor performance of the Hannum clock might be due to the high percentage of missing features in the processed methylation dataset (9 out of 71–13%). The Horvarth clock and AltumAge also contained missing features, although to a lesser extent (24/354–6.7% and 524/20,318–2.6%). Indeed, in order to apply these models to our data, we had to perform a data imputation step (Section A.2), which might affect age estimation, since the imputed values might not completely reflect reality.

With the results delineated above, we selected AltumAge as the baseline for comparison with our methylation models.

5.2 Model performance within individual data modalities

Fig. 3
figure 3

Performance of single data modality models on the test set. Panels ad represent the performance in lung data in methylation (a), gene expression (b), histological images (c) and telomeres (d), while EH represents the performance of ovary in methylation (e), gene expression (f), histological images (g) and telomeres (h) data. Mean absolute error—mae, rmse—root mean squared error, med—Median absolute error, cor—correlation, R2—coefficient of determination (Color figure online)

We developed tissue-specific aging models for each data type using several machine learning algorithms, including linear (EN) and non-linear (GBT and NN) algorithms. The performance of the best model is summarized in Table 1 and Fig. 3a–d for lung, e–f for ovary. EN models were found to be the most effective for both methylation and gene expression models, while neural networks were found to be more effective for modeling histological images. A more complete version of this table, along with the performance of other tested approaches is presented in Supplementary table 2.

To ensure fairness in performance comparison across data modalities, we defined a common test set across the four types of data (Fig. 1). Despite being the data modality with the lowest sample size, methylation models achieved the lowest error rates both during CV and in the test set across both tissues (lung test MAE 3.36, Fig. 3a–d; ovary test MAE 4.36, Fig. 3e–h).

In order to compare performance of our developed models to the state-of-the-art approach, we derived the fivefold performance of AltumAge, by dividing the predicted age into 5 sets, following the same splits used for CV during model training. The lung model performed significantly better than AltumAge model (p value < 0.05). The ovary methylation model also performed significantly better than AltumAge in this tissue, but this comparison holds less meaning, the multi-tissue methylation clocks did not generalize in this tissue (Supplementary Table 1). The improvement verified across models can be attributed to the fact that AltumAge model was trained on a multi-tissue dataset (Lima Camillo et al., 2022), and therefore is not as accurate in capturing the tissue-specific methylation patterns observed in lung and ovary (Christensen et al., 2009).

Gene expression and histological images have very similar performances, with MAE 5–6 years for gene expression and 5–7 years for histology. The error rates obtained for gene expression are within previously reported results (Peters et al., 2015; Fleischer et al., 2018; Mamoshina et al., 2018; Xia et al., 2020; Holzscheck et al., 2021; Meyer & Schumacher, 2021). In contrast, for histological images, we observed a major improvement in performance compared with our previous work (Moraes et al., 2023) performed in the same dataset (lung test MAE 5.37 vs 8.15; lung test R2 0.69 vs 0.35). Finally, telomere length yielded the worst models (CV MAE \(\sim \) 9 year, test MAE >10 years). Nevertheless, TL showed significant negative correlation with age in lung even after correcting for demographic variables (p value < 0.05), in line with previous results (Demanelis et al., 2020).

5.3 Model interpretation

Fig. 4
figure 4

SHAP values for DNA methylation and gene expression age predictors. Results are shown for the a lung methylation model, b ovary methylation model, c lung gene expression model and d ovary methylation model. SHAP values were computed based on the test set. Only the top 10 features (mean absolute SHAP) in each model are represented. In methylation models, the features are represented by the CpG site identifier and the nearest gene. Dots are colored according to their scaled \(\beta \)-values (methylation models) or TPM values (gene expression), with red dots representing higher values and blue dots representing lower values. The position in the X-axis corresponds to negative (decrease in age) or positive (increase in age) feature impact (Color figure online)

To identify and rank the most important features in methylation and gene expression models for each tissue, we employed SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017). We identified several features that have been previously described to be related with aging in human tissues. For example, in the lung methylation model the most important feature, cg27320127 (Fig. 4a) is one of the predictive features in at least four previously described epigenetic age models (Jones et al., 2015). Also in the lung model, cg00590036, the third most important feature, was found to be altered with age in adipose tissue (Rönn et al., 2015). In ovary, cg05708550 a site near the KDM3B gene was the fifth most important feature, with higher methylation levels leading to higher age predictions (Fig. 4b). Despite no studies associating this site directly with aging, decrease in KDM3B expression has been correlated with retinal disruption (An et al., 2022).

In gene expression, ZNF518B, the most important gene in the lung model, has been linked to human longevity and is negatively correlated with age (Bou Sleiman et al., 2020), consistent with finding in this work (Fig. 4c). In contrast, the expression for the gene EDA2R has been found to be positively correlated with age in several human tissues, including lung (Jeong et al., 2020; Melé et al., 2015). In ovary tissue, the most important gene was TRIM59 (Fig. 4d), which was negatively correlated with age. Interestingly, previous epigenetic studies have identified methylation markers related to aging in this gene (Jung et al., 2019; Wezyk et al., 2018).

In summary, our findings indicate that our models effectively capture molecular age-related features in human tissues, unveiling potential aging biomarkers, some of which have not been previously reported in the literature.

5.4 Combining methylation with other data modalities has a limited effect on model performance

Table 2 Tissue-specific performance of the multi-modal models using EN

Despite having fewer samples, methylation data outperformed the other three data types in age prediction across both tissues in the test set (Table 1). We set out to investigate if integrating other data modalities with methylation would improve model performance. We trained several models with different data combinations: methylation + gene expression, methylation + histology, methylation + gene expression + histology and methylation + gene expression + histology + telomere length. To ensure a fair comparison, we trained these models using only the common samples across the four data modalities. Furthermore, as a baseline, we retrained the methylation model using the same subset of samples.

First, we evaluated the performance using EN, since this was the method that performed best for both methylation and gene expression. Combining methylation with gene expression and histology led to the best results in both tissues in the cross-validation with a marginal and non-significant difference (3.75 ± 0.56 vs 3.80 ± 0.5 MAE in lung, p value = 0.36; 4.17±0.46 vs 4.35±0.33 MAE in ovary, p value = 0.29). All the models also showed a similar performance in the test set, with the model including methylation and histology having slightly lower MAE.

We obtained similar results when integrating the data using GBT and NN (Supplementary table 3). In the former, although the performance was better in CV when compared with EN, the MAE in the test set was higher (+0.5-1 year). Employing NN, we observed enhanced performance by combining methylation with gene expression and histological images compared to integrating methylation with each data type individually (p value < 0.05) in ovary tissue, while in lung tissue, integrating these three data modalities was equivalent to combining methylation with gene expression (p value > 0.05). Neverthelss, the performance of the combined models was similar to the downsampled EN methylation models (p value > 0.05).

SHAP analysis (Lundberg & Lee, 2017) was used in the EN models to quantify the contribution of each data type. Methylation was the most significant data type, with a total feature contribution ranging from 81.14% to 98.65% (Fig. 5). It should be noted that methylation also has the highest number of features per model. Telomere length did not contribute to model prediction. We analyzed the SHAP values of the model integrating methylation, gene expression and histology to get a sense of the most important features when integrating these data modalities. In lung we recapitulate several features, such as cg27320127 and cg00590036. Among the genes in the top 15 features we found H4C12 (H4 Clustered Histone 12), of particular interest, since the depletion of histones has been shown to occur during human aging (Dubey et al., 2024). In ovary we found that a long non-coding RNA (EPHA5-AS1) gene had the highest SHAP values, with an increasing expression with age. No previous study has linked this gene with aging.

In summary, our results demonstrate that, for these tissues, integrating methylation data with histological and gene expression data leads to only marginal improvements in model performance.

Fig. 5
figure 5

SHAP analysis for the multi-modal models using methylation data. a Percentage of contribution of each data type for the final model prediction (methyl—DNA methylation, Gexp—gene expression, Hist—histological features and TL—telomere length). b, c SHAP values for the model including methylation, gene expression and histological images in (B) lung and (C) ovary. SHAP values were computed based on the test set. Only the top 15 features (based on mean absolute SHAP) in each model are represented. Methylation features are represented by the CpG site identifier and the nearest gene. Red dots represent higher values and blue dots represent lower values. The position in the X-axis corresponds to negative (decrease in age) or positive (increase in age) feature impact (Color figure online)

5.5 Combining gene expression with histology improves their individual model performance

The small number of samples available could be hampering performance improvements in the multi-modal models. As methylation is one of the main hallmarks of aging (López-Otín et al., 2023), it is expected to require less samples to accurately model age compared with histology and gene expression.

We performed a second integration experiment by combining histology and gene expression, and comparing the performance of these multi-modal models with their single data type variant (Table 3, Supplementary Table 4). Similar to the previous section, we trained individual gene expression and histological models using only the common samples between these two data modalities. The EN model combining gene expression and histology performed significantly better in lung compared with the model using only gene expression (p value = 0.005) and histology (p value = \(2.363 \times 10^{-9}\)). In ovary, we verified a similar tendency. In both cases, we also verified an improvement in the test set.

Table 3 Tissue-specific performance of the multi-modal models combining gene expression and histological images

6 Discussion

In this work, we have leveraged four data types from the GTEx dataset to build age prediction models in two human tissues: lung and ovary (Fig. 1). Methylation and gene expression models performed in line with previous studies (Peters et al., 2015; Fleischer et al., 2018; Mamoshina et al., 2018; Xia et al., 2020; Holzscheck et al., 2021; Meyer & Schumacher, 2021; Horvath, 2013; Hannum et al., 2013; Lima Camillo et al., 2022). As expected, methylation models were the best performing despite having the lowest sample size, indicating the presence of a stronger signal for this data type.

The success of histological images for age prediction has been limited and requires further exploration. In a recent work, we explored this idea with lung data (Moraes et al., 2023). We hypothesized that a contributing factor to the limited performance of the histological models was the loss of contextual information caused by the selection of [256 \(\times \) 256] pixel patches, which focused solely on cellular-level features, therefore potentially overlooking broader tissue structures and contextual cues that might be useful for an accurate determination of age. To get around this problem, Chen et al. (2023) proposed a Visual Transformer (ViT)-based architecture, HIPT, to extract meaningful features from WSI considering three main resolutions: [16 \(\times \) 16] (cellular features), [256 \(\times \) 256] (cellular organization), and [4096 \(\times \) 4096] (tissue phenotypes). In this study, we applied this technique and developed an age model based on histological images which significantly improves on the previous work (test MAE from 8.15 to 6.29; test R2 from 0.35 to 0.59). Furthermore, we extended our framework beyond a single tissue and applied the same techniques in ovary.

Integrating methylation with other data types yielded limited improvement in predictive performance (Table 3). This may result from the low sample size of the common samples, with methylation benefiting from a stronger association with aging. We made a second experiment, focusing only on histology and gene expression data, since these modalities shared the highest number of common samples. We observed that combining these two data types significantly improved age prediction compared to their individual performances, thereby demonstrating the potential of integrating different data types in age prediction models.

This work highlighted several challenges in working with biological multi-modal datasets. First, the molecular data has more features than samples, with roughly 750,000 features by less than 200 samples in methylation and around 20,000 features by 128 to 527 samples in gene expression. Feature selection was an important step for significantly reducing the number of input features while maintaining or even contributing to more competitive model performance. Second, the datasets are skewed towards older individuals, with methylation being the most affected due to its smaller sample size. To address this issue, we implemented sample re-weighing with label distribution smoothing (Yang et al., 2021). Moreover, we stratified ages during cross-validation and the train-test split, a method demonstrated to enable fairer comparisons among individuals within similar age ranges (Li et al., 2023).

Another limitation of the analysis conducted herein is that the GTEx samples are obtained post-mortem, and the data might be skewed by a number of confounding factors, including ischemic time and cause of death that might alter the molecular fingerprint. We attempted to account for confounding factors during feature selection through differential analysis, where we explicitly corrected for ischemic time and hardy scale (a measure to categorize the severity of injuries sustained in violent deaths), as well as a number of other covariates. Despite this, the applicability of the developed models and methodology requires further evaluation and investigation.

The integration of multiple data types from the same cohort remains largely unexplored and therefore constitutes a strong point of our work. To date, there have been relatively few instances of aging research utilizing multi-modals models in the literature (Urban et al., 2023; Solovev et al., 2020). Our work pioneers the integration of transcriptomics, epigenomics, telomere data, and histological images from the same population, offering novel insights into aging research.