The AccelerAge framework: a new statistical approach to predict biological age based on time-to-event data

Aging is a multifaceted and intricate physiological process characterized by a gradual decline in functional capacity, leading to increased susceptibility to diseases and mortality. While chronological age serves as a strong risk factor for age-related health conditions, considerable heterogeneity exists in the aging trajectories of individuals, suggesting that biological age may provide a more nuanced understanding of the aging process. However, the concept of biological age lacks a clear operationalization, leading to the development of various biological age predictors without a solid statistical foundation. This paper addresses these limitations by proposing a comprehensive operationalization of biological age, introducing the “AccelerAge” framework for predicting biological age, and introducing previously underutilized evaluation measures for assessing the performance of biological age predictors. The AccelerAge framework, based on Accelerated Failure Time (AFT) models, directly models the effect of candidate predictors of aging on an individual’s survival time, aligning with the prevalent metaphor of aging as a clock. We compare predictors based on the AccelerAge framework to a predictor based on the GrimAge predictor, which is considered one of the best-performing biological age predictors, using simulated data as well as data from the UK Biobank and the Leiden Longevity Study. Our approach seeks to establish a robust statistical foundation for biological age clocks, enabling a more accurate and interpretable assessment of an individual’s aging status. Supplementary Information The online version contains supplementary material available at 10.1007/s10654-024-01114-8.

1 Gompertz distribution as a PH and as an AFT model Although the Gompertz distribution is usually parameterized as a Proportional Hazards (PH) model, it can also be an Accelerated Failure Time (AFT) model (but, contrary to the Weibull distribution, not at the same time).However, it requires a different parameterization to see this (Broström [2021]), as will be illustrated below.
A regression model adheres to the proportional hazards assumption if it fulfills the assumption that where h 0 (t) is the baseline hazard.This is the proportional hazards assumption: the ratio of h(t)/h 0 (t) is the constant exp(β T X).
A model adheres to the accelerated failure time assumption if it fulfills the assumption that where S 0 (t) is the baseline survival.The expression exp(β T X) is also known as the 'acceleration factor' in the context of AFT models.

Gompertz as a PH model
The usual Gompertz parametrization is the rate parametrization, where the hazard is given by h(t) = a exp(bt). (3) Here, a is generally called the rate parameter and b the shape parameter.Now we have to choose how to insert the linear predictor and show that the resulting model adheres to assumption (1).For simplicity, assume X is binary.Insert the linear predictor by reparameterizing a = exp(β T X).Then the hazard ratio (X = 1 vs. X = 0) equals such that the proportional hazards assumption of (1) is met.

Gompertz as an AFT model
To see that a Gompertz regression model can also be parameterized as an AFT model, rewrite the hazard as where σ = 1 b and τ = a b .Then the survival function can also be written in canonical form: The next step is again to insert the linear predictor somewhere and show that the resulting model adheres to 2. For simplicity, assume X is binary.Take 1 σ as the linear predictor exp(β T X).Then and Now note that which is equal to expression (7).Hence, assumption (2) is satisfied.If the linear predictor would have been inserted in a different way, for example taking σ as the linear predictor exp(β T X), then the above equation would not have been equal to (7).

GrimAge in detail
In both the simulation study as well as the real data application we fitted our own version of GrimAge, because we do not use DNAm data as predictor variables.Therefore we could not use the weights of the published GrimAge predictor, but we repeated the approach used to construct the original GrimAge predictor with different predictor variables.Here we describe our implementation and the differences with the approach to fit the original GrimAge predictor.
In the original publication [Lu et al. 2019], the GrimAge predictor is constructed following a two-step approach.In the first stage, DNAm-based surrogate biomarkers of smoking pack-years and a selection of plasma proteins are defined.In the second stage, together with chronological age and sex these surrogate biomarkers are included in an elastic net Cox Proportional Hazards model with time to all-cause mortality as outcome, after which the linear predictors of this model are transformed to an age-scale.The authors justify this two-step approach because the single-step approach (directly regressing CpG-sites on time to all-cause mortality) resulted in a less significant p-value.The transformation to an age-scale depends on the mean and standard deviation of chronological age in the dataset that the GrimAge predictor is fitted on.
Measurements of individual CpG-sites are known to be quite unreliable [Sugden et al. 2020].A solution to this was recently suggested by Higgins-Chen et al. [2021], who used principal components as predictors: the first step of defining surrogate markers is hence replaced by the step of finding the principal components.This resulted in more reliable DNAm-based clocks and has the added benefit of creating a set of uncorrelated predictors.When fitting our version of GrimAge, metabo-GrimAge, on the UK Biobank data we hence followed this approach: we did not construct surrogate markers, but did a principle component analysis on our candidate markers of aging and included these principle components as the predictor variables in step 2. (In the simulation study, we only considered two independently generated predictor variables, so no dimension reduction step was required in the first place.)Our step 2 is exactly similar to step 2 in the original GrimAge model-fitting approach.

First stage
• Perform principal component analysis on the set of predictor variables.
• Include the first i principal components that collectively explain at least 95% of the variance as predictor variables in step 2.

Second stage
• On the training data, fit a Cox PH model with follow-up time as the outcome variable (i.e.time-on-study as timescale t) and C, sex and the principal components from step 1 as the predictor variables.
• Obtain the linear predictors: βT X train .
• Determine coefficients a and b such that the mean and standard deviation of the linearly transformed linear predictors a + b(β T X train ) are the same as the mean and standard deviation of C in the training data.
• Obtain linear predictors for the data set of interest (the test data, or some new data set).
• Linearly transform these linear predictors using the values for a and b as estimated earlier.This results in the GrimAge prediction: • To get ∆: regress GrimAge on C and obtain the residuals.
3 How to draw survival times

Proportional Hazards
Following Bender et al. [2005], the survival function of a Cox PH model is given by and hence the cumulative distribution function is given by Let Y be a random variable with distribution function Let T be the survival times from the Cox model.It follows that Finally, solve (13) for T : Accelerated Failure Time The survival function of an AFT model is given by and hence the cumulative distribution function is given by Let Y be a random variable with distribution function F .Then U = F (Y ) ∼ U [0, 1].Let T be the survival times from the AFT model.It follows that Finally, solve (18) for T : 4 Additional details on study populations   Comparison survival of all UKB participants, the included UKB participants and the UK's general population

Simulation study results semiparametric and flexible parametric AFT
This section of the supplementary material contains the results of a simulation study that includes two additional predictors, based on a semiparametric and flexible parametric AccelerAge approach.All other details of the simulation study are exactly as described in the main manuscript.
Results can be found in Figures 4, 5 and 6 As mentioned, the flexible parametric AFT model is numerically delicate to fit and often had troubles to converge (in particular for the Gompertz AFT case).We excluded simulation runs for which this occurred from the results.Table 2 contain the number of simulation runs (out of n sim = 200 for which this was the case.Results are hence based on fewer simulation runs, but when comparing the figures in this section with those in the main manuscript, it can be seen that the performance of the three methods that are included in both is almost identical. To fit the semiparametric AFT, we use the approach of Stute [1993], described in more detail in the next section of this document.The flexible parametric AFT models were fitted using the function aft() from the R-package rstpm2 [Liu et al. 2018] with df = 3.Its performance is as expected, but in our implementation convergence proved to be extremely slow.6 The semiparametric AFT The semiparametric AFT model does not assume an underlying parametric baseline hazard so this needs to be estimated.We use the weighted least squares method, based on Kaplan-Meier weights, as suggested by Stute [1993].Estimates for the coefficients are found via: where log(T (i) ) is the i th ordered value of the observed response log(T ) and X (i) is the corresponding covariate.If there is no delayed entry, W contains the Kaplan-Meier weights (the successive increments of the Kaplan-Meier estimator) of T (i) .If there is, W is adjusted to: where L j is the left truncation time of individual j.In other words, the risk set is adjusted.Note that since the Kaplan-Meier estimator does not change when an individual is censored, W i /W L i is zero for each censored individual.
Figure 1: Distribution of chronological ages of the included UKB samples.

Figure 2 :
Figure 2: Distribution of chronological ages of the included LLS samples.

Figure 3 :
Figure 3: Comparison of survival curves of all UKB participants, the subset of UKB participants with metabolite measurements who were included in the analysis, and the UK general population (the period lifetable of 2018-2020) as provided by the Office for National Statistics.All curves are stratified by sex.The population survival curves were scaled such that they only start decreasing from age 40 onward, to avoid an unfair comparison due to the immortal time bias present in the UKB data.

Figure 4 :Figure 5 :Figure 6 :
Figure4: Performance of the five different biological age predictors in terms of the root-mean-square error under the Gompertz-PH data-generating mechanism.Results are reported for data sets of varying sizes (n obs = 500, 2500, 5000, 7500 and 10,000) as the average root-mean-square error over a simulation sample size of n sim = 200.

Table 1 :
Prevalence of (chronic) diseases in UK Biobank sample and Leiden Longevity Study sample.Prevalence is defined as having been diagnosed with the disease at any time prior to inclusion in one of the two studies.These data were taken from the Electronic Health records of participants, except for the category 'cancer (any type)', which was self-reported.

Table 2 :
Number of simulation runs (out of n sim = 200) for which the flexible parametric AFT model (AccelerAge-flexpar) did not properly converge and which were subsequently excluded from the results.