Background

Many studies have been conducted to quantitatively estimate biological age using measurable biomarkers since a study by Comfort in 1969 [1]. Biological age can represent a person’s aging status more appropriately compared to chronological age because, while chronological age is just a period of living which does not consider person’s health status, biological age is associated with health status, which is closely related with aging. There are no standardized or widely accepted sets of biomarkers for estimating biological age; nonetheless, several researchers have suggested some criteria that should be satisfied by biomarkers of aging [24], which included age-related change, nonlethal measurability, essential factor for health, fundamental reflection of biological or bodily process, and highly reproducibility. By considering these criteria, some clear set of requirements for biomarkers has been discussed in the research by Sharman and Zhumadilov [5]: 1) provide information about the functional condition of the body, its metabolic and regulatory systems, 2) have quantitative characteristics that correlate with age, 3) be reproducible, sensitive, and specific, 4) be suitable for use in humans as well as in laboratory animals. Since the biological age study began, by considering these criteria, various biomarkers including physical, physiological, or biochemical parameters have been used to compute biological age [610]. Some biomarkers were commonly found in these studies: body mass index (BMI), blood pressure (BP), waist circumference (WC), forced expiratory volume in 1 s (FEV1), body muscle percentage (BMP), body fat percentage (BFP), blood urea nitrogen (BUN), albumin, globulin, high density lipoprotein (HDL), low density lipoprotein (LDL), and/or triglyceride (TG), which are all relatively easily obtainable in a general health check-up. BMI, BP, and WC are associated with physical fitness, and physical strength could be represented by FEV1, BMP and/or BFP. Meanwhile, BUN, albumin, globulin, HDL, LDL, and TG are associated with biochemical factors.

Appropriate statistical algorithm is essential for computing biological age. Currently, two statistical algorithms have been introduced and widely used: multiple linear regression (MLR) and principal component analysis (PCA). MLR had been used widely in the early stage of biological age study; however, since several shortcomings, such as overestimation of biological age for younger people, and an underestimation of biological age for older people, have been raised [1113], PCA algorithm has become an alternative statistical tool to compute biological age [6, 9, 10, 14]. In recent years, a new and somewhat complicated algorithm, developed by Klemera and Doubal, [15] has been used by some researchers [16, 17].

Several studies have been conducted to use biological age as an index for work ability [16], frailty [18], or physical fitness [19]. Levine and Crimmins has shown that biological age could predict 10 year mortality more accurately than other measures, such as Allostatic Load and Framingham Risk Score using 9,942 subjects [20], and shown that differences in biological age relative to chronological age account for disparities in mortality between black and white subjects [21]. In the Mennonite community in Kansas in USA, a close relationship between biological age and mortality has been investigated using 1,009 and 568 subjects, respectively [22, 23]. Meanwhile, Brown and McDaid [24] reviewed a number of studies and identified several risk factors affecting mortality in elderly person: socioeconomic/demographic risk factors such as age, education, gender, income, marital status, occupation, race/ethnicity, religion, and behavioral risk factors such as smoking, alcohol intake, physical activity, and obesity. Based on several risk factors of socioeconomic/demographic and behavioral risk factors, a mortality prediction model has been developed using a Markov chain framework and logistic regression [25]. Zhu et al. have presented a logistic regression model to gain efficiency and effectiveness in addressing a spectrum of mortality risk assessment issues based on US insure mortality experience study using 9 risk factors including gender, smoking status, issue age, and underwriting class [26].

Methods

Study population

The Korean Metabolic Syndrome Mortality Study (KMSMS) is a retrospective cohort study based on private health examination from 1994 to 2004, which included 557,940 Koreans of 20–93 years old who received routine total health check-ups at 15 health examination centers in Korea. The causes of death in Korea were derived from the Korean Statistics Information Service (KOSIS) database. The database was confirmed and guaranteed by national death records from the Statistics Korea (KOSTAT), the national agency of Korea. The KOSTAT estimates that its registry includes more than 99% of the deaths in Korea; cause of death was available beginning in 1992. The KOSTAT records provide the residence registration number (RRN; a unique 13-digit number assigned to all Koreans), the cause of death (International Classification of Diseases, 10th Edition), and the date of death. In Korea, all people must register the birth and death by law. The death certificate including the cause of death was almost written by doctors. Therefore, the cause of death is very reliable. The study data were merged with Cause of Death Database by residence registration number up to the year 2011. The follow-up time for the study population was a minimum of 7 years and extended up to 17 years.

From the total of 15,732 deceased people, 13,106 cases were classified as intrinsic death by referencing a previous research [27]: the numbers of death caused by cancer and non-cancerous diseases were 7,250 and 5,856, respectively. Extrinsic causes of death, such as suicide, accident, infectious disease, pregnancy-related death, or unknown causes were excluded from the study (Additional file 1: Table S1). The research proposal was approved by the Institutional Review Board of Human Research of Yonsei University.

Construction of PCA model for biological age

The statistical engine for computing biological age was developed using a data set including total 469,754 Koreans (277,029 men and 192,725 women), which is different from MSMS data, based on PCA algorithm. From over 60 variables measured in the general health check-up, a total of 15 variables were selected as biomarkers, including WC, systolic blood pressure (SBP), diastolic blood pressure (DBP), FEV1, gamma GTP (G-GTP), BUN, HDL, LDL, TG, fasting blood sugar (FBS), erythrocyte sedimentation rate (ESR), BMI, BFP, BMP, and albumin/globulin ratio (AGR). The biomarker selection process was based on the stepwise variable selection method by maximizing coefficient of determination, which was computed using chronological age and biological age as the dependent variable and independent variable, respectively. More detailed explanation regarding the construction of PCA models to compute biological age is shown in the Additional file 1 (Appendix in Additional file 1). By using the constructed model, biological ages of the subjects consisting of MSMS data were computed at the baseline survey. The age increment of biological age over chronological age was defined as AgeDiff = biological age – chronological age for each subject: AgeDiff was used as a covariate for logistic regression analysis and Cox proportional hazards regression analysis to assess the absolute effect of biological age on mortality.

Statistical analysis

SAS software, Version 9.2.0 (SAS Institute Inc., Cary, NC) and R language ver. 3.0.1 (R Foundation for Statistical Computing, Vienna, Austria) were used for all statistical analyses. If there were missing values in the biomarkers, population averages of the biomarkers measured in the relevant age and gender group were used. Binary logistic regression analysis and chi-square statistical tests were used to analyze the influence of biological age on the mortality adjusted by gender and chronological age. For estimating 17 years risk of mortality affected by age increment of biological age over chronological age (i.e. AgeDiff), Cox proportional hazards regression models were constructed for total subjects and for subgroups, considering chronological age and gender. Age increments of biological age over chronological age were categorized into 3 subgroups (AgeDiff < 2, 2 ≤ AgeDiff < 5, and AgeDiff ≥ 5) to estimate the discrete effect of biological age on mortality. Kaplan-Meier analysis with log rank test was applied to analyze and compare the survival probabilities of subgroups defined by AgeDiff.

Results

Construction of PCA model

Two PCA models were constructed to compute biological ages of men and women, respectively, which consisted of four unrotated principal components with corresponding eigen values of ≥ 1.0. About 55% of total variance was explained by these four principal components for men, and 53% of total variance for women, respectively. More detailed results about the construction of PCA model is shown in the Additional file 1 (Appendix in Additional file 1).

Descriptive statistics of the biomarkers

From the total 557,940 subjects, 316,848 were men (56.8%) and 241,092 were women (43.2%). Average chronological ages of men and women at the baseline survey were 43.5 years and 43.6 years, respectively, and average death ages in men and women were 64.5 years and 65.4 years, respectively (Table 1). The baseline chronological ages were significantly different between alive and deceased subjects for both men and women (two sample t-test, p < 0.001). The weight of body muscle could not be obtained from the MSMS data, so population averages and standard deviations of BMP were estimated using a different data set (refer to Appendix in Additional file 1 for more detailed information). The proportions of the data which were used for the analysis excluding missing data were between 4.6% and 99.8%; 14.4% (WC), 99.8% (SBP), 92.4% (FEV1), 94.9% (G-GTP), 98.8% (BUN), 90.4%(HDL), 98.7% (TG), 99.7% (FBS), 90.7% (LDL), 21.2% (BFP), 0% (BMP), and 94.3% (AGR) for male data, and 16.5% (WC), 99.3% (DBP), 92.1% (FEV1), 93.5% (G-GTP), 98.6% (BUN), 88.6% (HDL), 98.5% (TG), 99.7% (FBS), 4.6% (ESR), 88.9% (LDL), 99.8% (BMI), 20.2% (BFP), 0% (BMP), and 93.0% (AGR) for female data. The overall distribution of biomarkers is described in the Additional file 1: Table S2. The numbers of deceased subjects whose cause of death by cancer and non-cancerous diseases were 7,250 and 5,856, respectively (Additional file 1: Table S3). Death by senility was regarded as death by non-cancerous disease, because too few cases were observed to be analyzed separately (127 cases in men and 80 cases in women, data not shown).

Table 1 Baseline characteristics of the study subjects

Distribution of age increment of biological age over chronological age

The biological ages of the subjects at the baseline survey and AgeDiff were computed as described in the Methods. The average biological age and chronological age were nearly the same in the alive subjects, however, biological age was larger than chronological age in the deceased subjects: in the deceased subjects, average AgeDiffs were 0.39 years both in men and women, which were significantly larger than the average AgeDiffs for the alive subjects, which were nearly zero (p value < 0.001, Table 2). AgeDiff was the largest in the subjects whose baseline chronological age was within 50–59 years (AgeDiff = 0.51 ± 1.93), and smallest in the people whose baseline chronological age was within 20–39 years (AgeDiff = 0.19 ± 2.14). In the oldest subgroup (baseline chronological age ≥ 60 years), AgeDiff was smaller than the other subgroups except 20–39 years subgroup (AgeDiff = 0.34 ± 1.66). However, the distribution of AgeDiff was different between men and women: in women, AgeDiff was smallest in the 40–49 years subgroup (AgeDiff = 0.31 ± 2.17). Meanwhile, when the cause of death was considered, AgeDiffs were computed to be larger in the subjects whose cause of death was non-cancerous disease, rather than cancer, regardless of chronological age subgroup or gender. When AgeDiff was categorized into 2 subgroups, 261,525 subjects were AgeDiff ≤ 0 (49.3%) and 268,495 were AgeDiff > 0 (50.7%) in alive subjects. In the deceased subjects, the proportions changed to 41.9% and 58.1%, respectively. When AgeDiff was categorized into 3 subgroups of AgeDiff < 2, 2 ≤ AgeDiff < 5, and AgeDiff ≥ 5, the proportions became 88.5%, 10.8%, and 0.7%, respectively in alive subjects. Meanwhile, in the deceased subjects, the proportions changed to 83.5%, 14.9%, and 1.6%, respectively: the ratios of deceased subjects relative to alive subjects significantly increased from 2.2% to 5.1% as AgeDiff became larger, whereas, ratios of alive subjects decreased from 97.8% to 94.9% as AgeDiff became larger (linear trend test, p value < 0.0001, Table 3). Similar patterns were observed when AgeDiff was categorized into 2 subgroups (AgeDiff ≤ 0 and AgeDiff > 0, data not shown). The trends of proportions separated by the cause of death and gender were presented in the Additional file 1: Figure S1.

Table 2 Average and standard deviation of AgeDiff according to gender and chronological age subgroup at the baseline survey
Table 3 Distribution of deceased or alive subjects according to gender and subgroups of AgeDiff

Influence of AgeDiff on mortality and survival analysis

Significant influence of AgeDiff on mortality was found by binary logistic regression analysis. In men, if biological age was computed to be larger than chronological age by 1 year (i.e. AgeDiff increased by 1), the mortality was increased by 17.3% (OR = 1.173, 95% CIs = 1.158-1.189, p value < 0.001, Additional file 1: Table S4). In women, the mortality was increased by 13.0% (OR = 1.130, 95% CIs = 1.110 - 1.150, p value < 0.001, Additional file 1: Table S4). The effect of AgeDiff on mortality was more evident in deceased subjects whose cause of death was non-cancerous disease, compared to cancer: OR = 1.263 for non-cancerous disease and OR = 1.083 for cancer.

Table 2 shows that averages of AgeDiff were different according to baseline chronological age subgroups. For this reason, Cox proportional hazards regression models were fitted separately for each baseline chronological age subgroup using AgeDiff as the continuous covariate (Table 4, Additional file 1: Tables S5, and S6). The hazard ratio in the people whose baseline chronological age was within 50–59 years was increased by 17% in per unit time as AgeDiff increased by 1 year, regardless of the event type: the hazard ratio was larger when death occurred by non-cancerous disease, compared to cancer (HR = 1.30 vs. HR = 1.10, adjusted by baseline chronological age and gender). The hazard ratio in the oldest chronological age subgroup (baseline chronological age ≥ 60 years) was slightly less than in the people whose baseline chronological age was within 50–59 years (HR = 1.14 vs. HR = 1.17). Except in the 20–39 years subgroup and event of death by cancer, all hazard ratios were statistically significant (p values < 0.001, Table 4). Figure 1 shows the distribution of hazard ratios for each baseline chronological age subgroup separated by gender and cause of death (means and 95% confidence intervals). In men, hazard ratios gradually increased as baseline chronological age increased from 20 years to 59 years, and decreased slightly in the oldest subjects (baseline chronological age ≥ 60). This increasing tendency was more distinct when the cause of death was non-cancerous disease, rather than cancer. Hazard ratio was largest in the subjects whose baseline chronological age was within 50–59 years (HR = 1.34, 95% CIs = 1.29 - 1.39), and smallest in the 20–39 years subgroup (HR = 1.10, 95% CIs = 1.02 - 1.18) (Additional file 1: Table S5). In women, however, no distinct pattern of increase or decrease in hazard ratio was observed across all baseline chronological age subgroups (Additional file 1: Table S6).

Table 4 Hazard ratios for total subjects (men and women) according to baseline chronological age subgroups and cause of death
Fig. 1
figure 1

Distribution of hazard ratios according to subgroups of baseline chronological age, gender, and cause of death. Means and 95% confidence intervals are plotted. a Distribution of hazard ratios when the cause of death was cancer. b Distribution of hazard ratios when the cause of death was non-cancerous disease. c Distribution of hazard ratios when the cause of death was cancer and non-cancerous disease

When AgeDiff was categorized into 3 subgroups, such as AgeDiff < 2, 2 ≤ AgeDiff < 5, and AgeDiff ≥ 5, hazard ratios became larger as AgeDiff increased across all the chronological age subgroups and gender. For example, the estimated hazard ratio in the subgroup 2 ≤ AgeDiff < 5 was increased by 77% compared to the base subgroup (AgeDiff < 2) in men whose baseline chronological age was within 50–59 years (HR = 1.77, 95% CIs = 1.60 - 1.96, Additional file 1: Table S7). The largest hazard ratio was observed in men in the subgroup AgeDiff ≥ 5, whose baseline chronological age was within 50–59 years (HR = 4.79, 95% CIs = 3.66 - 6.27, Additional file 1: Table S7). Similar results were found even if the cause of death was separated by cancer or non-cancerous disease; however, the increment pattern of hazard ratios was more obvious when the cause of death was non-cancerous disease, rather than cancer (data not shown).

Kaplan-Meier analysis was performed, and survival curves were plotted to compare survival probabilities of 3 subgroups defined by AgeDiff by considering gender, event type, or baseline chronological age subgroup (Fig. 2, Additional file 1: Figures S2, S3, S4, and S5). The survival probabilities were significantly different among the 3 subgroups: in the base subgroup (AgeDiff < 2), 95.2% men survived over 15 years, whereas 92.4% and 86.5% men in the subgroup of 2 ≤ AgeDiff < 5 and AgeDiff ≥ 5 survived over 15 years, respectively (log rank test, p value < 0.001, Fig. 2). In women, the difference of survival probabilities was smaller than men, but still significant (log rank test, p value < 0.001, Fig. 2). The difference of survival probabilities among the 3 subgroups of AgeDiff was more obvious when subjects were within 50–59 years chronological age subgroup: in the base subgroup (AgeDiff < 2), 93.4% men survived over 15 years, whereas only 69.7% men in the subgroup of AgeDiff ≥ 5 survived. In women whose baseline chronological age was within 50–59 years, survival probabilities at 15 years dropped from 97.1% to 88.2% (log rank test, p value < 0.001, Fig. 2). Even if the cause of death was separated, survival probabilities among the 3 subgroups of AgeDiff remained significantly different, except in the case that women and the cause of death was cancer (log rank test, p value = 0.234, Additional file 1: Figure S2). The difference of survival probabilities was obvious across all baseline chronological age subgroups (20–39, 40–49, 50–59, and ≥ 60 years) and gender when the event was death by non-cancerous disease (log rank test, p value < 0.002, Additional file 1: Figure S5).

Fig. 2
figure 2

Kaplan-Meier survival plots when the cause of death included both cancer and non-cancerous disease. Blue (1), red (2), and green (3) curves are for the subjects in the AgeDiff < 2, 2 ≤ AgeDiff < 5, and AgeDiff ≥ 5 subgroups, respectively: AgeDiff = biological age - chronological age; STIME = survival time (years); Log rank test p value < 0.001 for all cases; a survival plots for men; b survival plots for women; c survival plots for men and women; d survival plots for men whose baseline chronological age was within 50–59 years; e survival plots for women whose baseline chronological age was within 50–59 years; f survival plots for men and women whose baseline chronological age was within 50–59 years

Discussion

In the current study, we suggested that biological age could be used as a useful index to predict a person’s risk of death in the future. 15 variables were selected as biomarkers to compute biological age, as was described in the Methods section. Subjects’ biological ages were computed, using a PCA-based statistical algorithm. For the study purpose, cohort data for a total of 557,940 Koreans of 20–93 years old from MSMS was used (maximum follow-up time was 17 years). Some similar studies have been conducted [17, 2023, 28], in which several hundreds to near ten thousand subjects have been used to compute biological ages by applying multiple linear regression, principal component analysis, or the algorithm suggested by Klemera and Doubal [15]. Then, using the same data, Cox proportional hazards regression models were constructed to estimate the effect of biological age on mortality in the future. The data size and follow-up time used in our study surpasses these previous studies. In addition, of particular importance, we have successfully validated biological age as a useful index for mortality in the future using an independently generated data set, which is different from the data set used to construct a model for computing biological age.

The finding AgeDiff was significantly greater in the deceased subjects implies that biological age might be functioning as a latent marker for mortality. Meanwhile, the increasing pattern of AgeDiff slightly declined in the oldest subgroup (baseline chronological age ≥ 60 years), which implies that the influence of biological age on the mortality diminished as people became over 60 years old. Otherwise, the biomarkers used to compute biological age did not have sufficient impact on the mortality in the oldest subjects analyzed. In the current study, the 15 biomarkers used to compute biological age have 2 types of relationship with chronological age: 11 biomarkers, including WC, SBP, DBP, G-GTP, BUN, LDL, TG, FBS, ESR, BMI, and BFP, showed a positive correlation with chronological age, and 4 biomarkers, such as FEV1, HDL, BMP, and AGR, showed a negative correlation with chronological age. A positive correlation means that a larger value is observed in the biomarker as age increases. A negative correlation means the converse. As the correlation was an overall estimation over entire chronological age ranges, age-specific correlation within some specified age range, such as chronological age ≥ 60 could be different.

In the MSMS data, 4 biomarkers, G-GTP, HDL, TG, and FBS, showed an opposite trend in deceased men: smaller values of G-GTP, TG, and FBS, and larger value of HDL were observed in the deceased subjects whose baseline chronological age ≥ 60 compared to the people of 50–59 years subgroup, which might act as diminishing factors for computing biological age. In women, smaller values of G-GTP, ESR, and BMI were observed in deceased people whose baseline chronological age ≥ 60 compared to the people of 50–59 years subgroup (data not shown). As such, it is conjectured that if different sets of biomarkers were used to compute biological age, monotonically increasing pattern of AgeDiff could be possible as the baseline chronological age becomes larger; in any case, the influence of biological age on mortality is clear in the elderly.

In addition to these results, another interesting finding is that AgeDiff was computed to be larger when the cause of death was from non-cancerous disease, compared to cancer (p values < 0.001, data not shown), which could be explained by the type of biomarkers used to compute biological age in this study. The majority of biomarkers were associated with metabolic syndrome, WC, SBP, DBP, HDL, TG, and FBS, or physical strength, FEV1, BFP, and BMP, but not, or little, related to cancer. Cancer-related biomarkers were only G-GTP and AGR [29, 30]. Meanwhile, the pattern of average AgeDiffs separated by baseline chronological age subgroup and/or cause of death was different between men and women. One possible explanation is that there may exist a somewhat different mechanism of senescence in men and women to influence mortality; otherwise, a biased result might be induced due to the small number of deceased female subjects, compared to men. In fact, the variances of AgeDiff in the deceased women were larger than the deceased men.

In the clinical research, chronological ages were usually categorized into several subgroups: for instance, less than 30 years, 30–60 years, and more than 60 years. In a similar way, biological age could be categorized into several subgroups; however, instead of categorizing biological age directly, the age increment, AgeDiff, was categorized into 2 or 3 subgroups, because biological age is not ‘real’, only computed as a value proxy for individual aging. As such, AgeDiff could be used to assess the absolute effect of biological age on mortality, adjusted by baseline chronological age. The ratios of deceased subjects relative to alive subjects significantly increased as AgeDiff became larger, which implies that people whose biological age was computed to be larger than their baseline chronological age were at a higher risk of mortality. The increased pattern was more apparent when the cause of death was non-cancerous disease (data not shown).

For different hazard ratios according to chronological age subgroups, gender, and cause of death, we could make some conjectures: 1) the biomarkers used to compute biological age were appropriate to assess mortality for middle to early old-aged subjects, but not for relatively younger (20–39 years) subjects, 2) there is an intrinsically different influencing mechanism of biological age on mortality between men and women, 3) current biomarkers have a relatively stronger relationship with non-cancerous disease, compared to cancer. For verification of these conjectures, a different set of biomarkers should be identified and analyzed for a mortality study jointly considering gender, chronological age, and/or cause of death.

Additional file 1: Figures S3, S4, and S5 show different decreasing patterns of survival probabilities of subjects in the different subgroups of AgeDiff, separated by baseline chronological age subgroups, gender, and cause of death. When the event was defined as death by cancer, no conspicuous difference of survival probabilities were shown in some of the baseline chronological age subgroups of men or women (p value > 0.2, Additional file 1: Figure S4). On the other hand, significant decreases in survival probability were observed for subgroup AgeDiff ≥ 5 when the cause of death was non-cancerous disease (p value < 0.002, Additional file 1: Figure S5). For instance, survival probability at 15 years was 61.7% in men whose baseline chronological age ≥ 60 when their biological age were computed as 5 years more than their chronological age. However, the survival probabilities of the subjects in AgeDiff < 2 and 2 ≤ AgeDiff < 5 subgroups were 87.6% and 76.7%, respectively, significantly larger than 61.7% (log rank test, p value < 0.001). These results imply that biological age could influence mortality at discrete levels, as well as in a continuous way.

In spite of the large study population and successful validation of biological age on the mortality in this study, there are limitations in this study and some improvements remain necessary. The proportions of total variance explained by the PCA models were about 50%, which is low compared to the usual case. In general, the proportion goes beyond 67% of total variance. For instance, Cho et al. reported that 66.9% of total variance had been explained by three components of which eigen values greater than 1.0 [16]. Meanwhile, Bai et al. reported that four components with eigen values ≥ 1.0 had explained 55.6% of total variance in a study based on healthy people in China [31]. In the cases when the first principal component was used alone, the proportion varied from 20.4% to 57.6% [7, 9, 10, 14, 19]. Another consideration is that different biomarkers should be identified and analyzed to assess their influence on mortality. The biomarkers used in our study were selected mainly based on the statistical correlation with the baseline chronological age. As such, it is possible that different biomarkers may explain questions that remain following our study. Finally, different algorithms should be additionally considered to compute biological age as some researchers insisted that the algorithm developed by Klemera and Doubal [15] was superior to the other algorithms in estimating biological age used here [16, 22, 28]. Despite these shortcomings, the current study remains valid as we provide strong evidence that biological age influenced mortality and could be used as a useful index to predict future risk of death by analyzing the effect of biological age on the mortality in a manifold of ways.

Conclusions

In the current study, we suggest that biological age could be used as a useful index to predict future risk of death using a very large cohort data. Larger hazard ratios were observed in middle to old-aged people compared to younger people, which implies that biological age might be more strongly functioning as a latent marker for mortality in the senior age group. Meanwhile, larger hazard ratios were more obvious in men and when the cause of death was non-cancerous disease. However, considering the limitations and possible bias in this study, further research should be conducted in order to confirm the validity of biological age as a predictive marker for mortality.