Introduction

Estimation of skeletal age-at-death is fundamental in forensic anthropology as it is among the key aspects used in the construction of the biological profile of the individual under study. Several methods focused on the analysis of skeletal and dental structures have been proposed for biological age-at-death estimation, that is, age estimation from skeletal remains [1]. For juveniles, such methods are overall highly accurate because they are based on developmental processes [2]. In contrast, for adult individuals, age-at-death estimation is based mainly on skeletal degenerative changes, which may be affected by the individual’s activity levels, diseases, dietary patterns, and other factors. Therefore, they exhibit inter-individual and inter-population variation at their timing and rate of expression [3,4,5,6,7,8]. This issue is exacerbated in older individuals since during an individual’s lifetime, biological age tends to increasingly diverge from chronological age, that is, the amount of time that has elapsed since an individual’s birth [9].

An additional important bias is the so-called age mimicry, that is, when using regression models to predict chronological age, the age estimates, to some degree, reflect the age structure of the reference sample [10]. Attempts to overcome age mimicry have employed Transition Analysis and Bayesian statistics, which calculate the probability of an individual having made the transition from one skeletal change stage to the next at each age [11]. However, Bayesian analysis requires prior knowledge of the distribution of ages-at-death in the population, which is often not available. At the same time, several studies have found that Bayesian statistics and Transition Analysis do not improve age predictions compared to traditional methods [12, 13].

An alternative approach has been the combined use of multiple age-related traits across the skeleton [14], an approach also supported by the scholars working on the revised version of Transition Analysis [11]. In this direction, a recent paper proposed the study of several age markers across the human skeleton (up to 64 skeletal traits from seven anatomical regions), using novel recording criteria and reaching age-at-death predictions via machine learning based on deep randomized neural networks [14]. To facilitate the implementation of this method, the authors developed an open access software, DRNNAGE. According to the team who developed this initiative, the application of this method on a Portuguese assemblage resulted in a mean absolute error of the age estimation of ~6 years across the entire adult age span. These results are impressive and, if confirmed in other test assemblages, they are very promising in forensic anthropology.

Indeed, any forensic anthropological method needs to be validated by testing its accuracy on diverse assemblages. Therefore, this work aims to test the accuracy of the DRNNAGE software in age-at-death estimation using a documented modern Greek sample.

Materials and methods

The present study uses 219 adult individuals (121 males and 98 females) from the Athens Collection, housed in the Department of Animal and Human Physiology of the National and Kapodistrian University of Athens. Information on the sex, age-at-death, occupation, cause of death, and place of birth of each individual in the collection is derived from death records [13, 15]. The year of birth for all individuals ranges between 1879 and 1965, while their age-at-death ranges from 19 to 99 years old. The age distribution of the assemblage per sex is presented in Fig. 1, while Table 1 shows the age and sex distribution of the skeletons in relation to the different anatomical regions recorded. The individuals were selected so as to maximize sample sizes per sex and age group, while maintaining a balanced representation of all demographics. Nonetheless, as is common in modern documented collections, there is an over-representation of older (> 50 years) individuals. To avoid biases in the results, analyses were performed for the pooled age and sex group sample but also separately for males-females and per age group (younger or older than 50 years). Furthermore, for better evaluation of the DRNNAGE performance, analyses were performed in a pooled sex sample on a decade-based segmentation. Any individuals with evidence of pathological deformation that could have affected the recording of the age-related changes required for DRNNAGE were excluded from the assemblage.

Fig. 1
figure 1

Age-at-death distribution of the sample for the male and female subgroups. The mean age-at-death for each group is represented with a dashed and dotted line for males and females accordingly

Table 1 Sample distribution per anatomical region for each examined scenario

The DRNNAGE software can be accessed at the webpage https://osteomics.com/DRNNAGE/. It utilizes up to 64 skeletal traits (Table S1), which encode both developmental and degenerative aspects from seven anatomical regions: cranial sutures, vertebrae, upper and lower limbs, pubic symphysis, sacroiliac joint, acetabulum, clavicle, and first rib. These seven regions may be used individually for age estimation as well as in combination. The DRNNAGE software can handle missing values in the input dataset, which is particularly important given the damaged and partial preservation of many skeletons due to taphonomic factors. The researcher can choose one of the four available network algorithms (randomized network, ensembled randomized network, ensemble autoencoder—U, and ensemble autoencoder—S) to estimate the age-at-death of an unknown specimen.

The classification performance of DRNNAGE for each anatomical region individually and the combination of all was evaluated in this paper utilizing all four available network algorithms on each demographic group of the Athens Collection separately and on the pooled sample. All tests were ran using the software’s default settings regarding the parameters of the neural networks. Footnote 1 In the present article, the tables present the results of the ensembled randomized network (ERN) for age-at-death estimation. The results of the other three network algorithms are provided in the Supplementary Material.

All skeletal traits were scored by the first author (RE), and input to the DRNNAGE software for age estimation. Individuals were classified as correctly aged when their documented age-at-death lay in the age range estimated by the DRNNAGE software. Furthermore, the reliability of the estimates was tested by measures of bias and inaccuracy, where bias expresses the mean over- or under-prediction of an individual’s chronological age and is estimated as Σ(estimated age − actual age)/n, and inaccuracy expresses the average absolute error of age estimation and is estimated as Σ|estimated age − actual age|/n. In these estimates, the actual age was the documented age-at-death of each individual and the estimated age was obtained from DRNNAGE.

To assess intra- and inter-observer error, all traits were recorded on 15 randomly selected individuals from the Greek population twice by the first author (RE) and twice by Weronika Flis, from Jagiellonian University, Poland. Intra- and inter-observer error was evaluated through Kendall’s W concordance coefficient, which ranges from 0 (no agreement) to 1 (perfect agreement) [16]. The intra- and inter-observer error analysis was performed using R Statistical Software [17] via the “irr” package [18].

Results

Regarding the intra-observer error, the average Kendall’s W concordance coefficient was similar between the two observers. More specifically, for the first observer (RE), the average was 0.717, and for the second observer (WF), it was equal to 0.748. The traits with the lowest agreement (< 0.5) included “L4 Inferior Surface” (L4IS), “Humerus capitulum and trochlea” (HM04), “Ulna olecranon” (UL02), and “Radius head” (RD01). In contrast, those with the highest agreement (> 0.9) were 15, with some notable examples being the “Lamboid-pars asterica” (CRS06), “C3 Inferior Surface” (C3IS), “Ulna prox. articular facet” (UL01), “Femur lesser trochanter” (FM04), “Tibia condyles” (TB01), “1st rib’s Costal face” (RB101), and “Patella base” (PT02). With reference to the inter-observer error, the average Kendall’s W concordance coefficient was 0.615. The traits with the lowest agreement (< 0.5) included “Palatine-Posterior, Median” CRS01, “L2 body inferior surface and margin” L2IS, “Radius head” RD01, “Os coxa iliac tuberosity” OC01, “Os coxa ischial tuberosity” OC02, “Calcaneus tuberosity” CLN01, “Symphyseal topography” PSY02, “Symphyseal texture” PSY03, and “Sacroiliac texture” SAS01. The highest agreement (> 0.8) was achieved for eight traits, the “Coronal-Sagittal-pars bregmatica” (CRS03), “Coronal-pars pterica” (CRS04), “Lambdoid-pars asterica” (CRS06), “C6 body superior surface and margin” (C6SS), “S1 Superior Surface” (S1SS), “S1-S2 fusion” (S1S2F), “Proximal femur (trochanteric fossa)” (FM02), and “Patella (base)” (PT02). The average total Kendall’s W, calculated by all four observations, was 0.498. The results for intra- and inter-observed error are given in detail in Table S2 in Supplementary Material.

The results of the DRNNAGE software validity when utilizing the ensembled randomized network (ERN) option for age-at-death estimation are presented in Table 2. When combining all anatomical regions, females were more frequently correctly aged than males (52% vs. 41.3%). Regarding the anatomical regions individually, for the pooled sample, the cranial sutures exhibited the highest validity (82.9%), followed by the clavicle and 1st rib (79.9%), the acetabulum (79.3%), and the pubic symphysis (79.2%). In contrast, the vertebrae showed the lowest validity (42.6%). The above pattern also characterized males and females when each sex was examined separately, as well as the two broad age groups (below and over 50 years) when studied independently. It is striking that individuals over 50 years old were classified in their correct age group with much higher frequency (64.4–94.9%) compared to individuals under 50 years old (6.2–67.9%) in the majority of anatomical regions. Regarding the vertebrae and lower limb, although individuals over 60 years old expressed higher validity compared to those younger than 49 years old, the age group 50–59 years old showed extremely low validity. Furthermore, the upper limb appeared to under-perform as an age predictor in the ages between 40 and 59 years old, while expressing higher validity in the other age groups. Finally, the cranial sutures performed much better in the younger age groups, presenting a downward trend with increasing age.

Table 2 DRNNAGE validity when utilizing the ensembled randomized network (ERN) for age-at-death estimation

Regarding the separate anatomical regions, the greatest bias was found in the vertebrae, lower limb, pubic symphysis, and sacroiliac joint (Table 3). The lowest bias was seen in the clavicle and first rib while for the remaining anatomical regions, bias depended largely on the demographics under study. When using the vertebrae, upper limbs, lower limbs, pubic symphysis, and sacroiliac joint, DRNNAGE always overaged individuals, irrespective of their sex and age group. This is interesting because most existing methods tend to overage younger adults and underage older ones. As noted, except for the individuals under 50 years old, the method under-predicted the age-at-death of females more than males when analyzing the cranial sutures. The same applies to individuals over 50 years old when analyzing the acetabulum. Similarly, the method under-predicted the age-at-death of females over 50 years old regarding the clavicle/1st rib. In all other cases, an over-prediction of an individual’s chronological age was observed. The vertebrae exhibited the highest over-prediction of an individual’s chronological age, while the clavicle/1st rib and the acetabulum showed the lowest, followed by the pubic symphysis. Furthermore, except for the pubic symphysis, where the opposite applied, the bias was higher for males and individuals under 50 years old compared to females and individuals over 50 years old. However, for the acetabulum, the differences between the sexes were small. In addition, regarding the overaging of younger adults and the underaging of older ones, for the acetabulum there was a strong overaging of individuals younger than 50 years and a slight underaging of those older than 50 years, in the clavicle and 1st rib this pattern was only seen among females, while in cranial sutures the overaging of young adults was slight but the underaging of older ones very pronounced. Finally, when using a decade-based segmentation, the results showed an underestimation of age-at-death for individuals over 70 years old in most anatomical regions. It is worth mentioning that the cranial sutures expressed an upward trend in underestimation of age-at-death for individuals over 40 years old.

Table 3 Bias (measured in years) of the DRNNAGE software utilizing the ensembled randomized network (ERN) to predict age-at-death

The results of inaccuracy (Table 4) broadly agree with those for bias. The vertebrae, lower limbs, pubic symphysis, and sacroiliac joint, but also now the cranial sutures showed the highest inaccuracy rates. In contrast, the clavicle and first rib, followed by the acetabulum and upper limb had the lowest inaccuracy scores. The levels of inaccuracy were largely comparable between males and females. However, inaccuracy was generally higher for individuals younger than 50 years old compared to those older than 50, in agreement with the results of Table 2. The only exception to this pattern was cranial sutures, where the opposite was observed, but also the clavicle and first rib where the difference between age groups was very small.

Table 4 Inaccuracy (measured in years) of the DRNNAGE software utilizing the ensembled randomized network (ERN) to predict age-at-death

As mentioned above, the DRNNAGE software has four available network algorithms to create regression models for age-at-death estimation: randomized network algorithm, ensembled randomized network, and two different ensemble autoencoder networks. The comparison of the available network algorithms showed no major differences in the DRNNAGE validity; instead, all networks gave the same broad patterns (Tables S3-S11). However, the bias and inaccuracy when utilizing the ensembled autoencoder (S) network were higher in most anatomical regions and sample groups, a pattern particularly visible for the cranial sutures. In addition, the bias for cranial sutures when using this network was always negative, irrespective of the age group under study.

Discussion

Any forensic anthropological method must be tested for repeatability, reproducibility, and accuracy before its use is generalized and it becomes admissible to legal contexts [19, 20]. According to the DRNNAGE software developers [14], all skeletal traits employed in their method presented a very high (average value of 0.907) and statistically significant concordance coefficient regarding intra-observer error, with only exception the “Radius head” (RD01) and “Femur head” (FM01). In contrast, the average intra-observer error concordance coefficient in our study was 0.717 for the first observer and 0.748 for the second. There was great variability in the coefficient’s values among different traits; thus, certain traits should be favored and others should be used cautiously or be avoided altogether.

Inter-observer error, as expected, showed an even smaller concordance coefficient (average value 0.615). Once again, some traits showed much higher reproducibility than others and should be thus preferred, while others should be avoid. At this point, we must stress that DRNNAGE involved many traits that are binary-coded. Therefore, we would have expected better reproducibility and repeatability results since methods using a narrower scale of categories produce greater agreement among researchers [20].

The use of different network algorithms to predict age-at-death had a minimal impact on the results, though in our sample, the ensemble autoencoder S network showed higher bias and inaccuracy values, so any of the remaining three options should be preferred.

The validity of DRNNAGE was overall average in the modern Greek assemblage, both for males and females. Very interestingly, this software showed a high validity for individuals older than 50 years old, which is often a problematic category when using “traditional” skeletal age-at-death estimation methods, whereas the results were very poor for those younger than 50. Surprisingly, the cranial sutures exhibited the highest validity, even for younger adults. Among the remaining anatomical areas, those with high validity included the clavicle and 1st rib, the acetabulum, and the pubic symphysis, while the lowest validity was achieved by the vertebrae. Similarly, the greatest bias and inaccuracy were found in the vertebrae, and the lowest in the clavicle and first rib. The above observations are corroborated by the analyses performed on the decade based segmented sample. However, it is important to note that the sample size for the group of individuals older than 90 years old is very small and the respective results should be interpreted with caution.

These results are partly in agreement with another validation study performed on the Athens Collection, employing traditional age-at-death estimation methods focused on the public symphysis, iliac auricular surface, and cranial sutures [13]. In specific, Xanthopoulou and colleagues found that the iliac auricular surface when recorded using the Lovejoy et al. [21] method works satisfactorily for all age groups; cranial sutures and the pubic symphysis were found to perform satisfactorily for individuals younger than 50 years old but poorly for older ones, while the iliac auricular surface recorded using the Buckberry and Chamberlain [22] method gave the most accurate results for individuals older than 50 years. The differences in the performance of the DRNNAGE software compared to these traditional methods, even when focusing on the same anatomical areas, must be attributed to the different ways in which skeletal changes are recorded in each method but also to their different statistical treatment for age prediction.

The vertebrae, and more specifically the fusion of the superior and inferior epiphyses, have been established and popularized over the years as a viable method of skeletal age estimation in teenagers and young adults [23,24,25]. As a rather recent example, Albert et al. [26] achieved over 78% classification accuracy when studying 57 individuals aged 14–27 years. For older adults, several studies have shown that osteophyte formation could be useful for estimating the age-at-death [27,28,29]. Very recently, Sluis and colleagues [30] tested three methods based on osteophyte formation on 88 individuals from the Middenbeemster cemetery and achieved over 72.73% classification accuracy. The DRNNAGE scoring system for vertebrae covers the whole spectrum from the fusion of the epiphyseal ring to the formation of lipping. However, all these major changes from ring fusion to lipping are covered within merely three stages, which do not express sufficiently different degrees of lipping that are anticipated in middle aged and older adults Therefore, the low validity and high bias and inaccuracy observed in our study may be due to the skewed age-at-death distribution towards older people in the Athens Collection.

The cranial suture closure pattern has been studied as a potential age-at-death predictor for nearly a century [31]. Several studies since then have proposed variants of different recording schemes for age prediction based on different sutures and suture combinations [32,33,34]. In parallel, numerous validation studies have stressed the poor performance of this method (e.g. [35]). The high validity in age estimation achieved in the Athens Collection for individuals younger than 69 years old via DRNNAGE is thus surprising; however, it also aligns with a recent review that stressed the potential of this anatomical area as a useful indicator for age estimation [36].

With regard to the other anatomical areas that showed high accuracies, Kunos et al. [37] were the first to use the first rib for age-at-death estimation because it is easily identifiable, not influenced by mechanical stress in the same manner as the lower ribs and exhibits a prolonged span of remodeling into the eighth decade. In a recent study on 260 skeletons from the Raymond A. Dart Collection of Human Skeletons, Jooste and Steyn [38] concluded that the first rib can be used to make age-at-death predictions but should ideally be used in combination with other skeletal traits. The DRNNAGE software combines the first rib with the clavicle, which has the potential to aid age estimates beyond the traditional “mature adult” age category (> 46 years) [39], while its usefulness in providing precise age estimations between the ages of 16 and 30 years has been identified by several studies [2, 40,41,42,43]. Therefore, our results from the Athens Collection, showing high age-group classification rates for the clavicle and first rib are not surprising.

The acetabulum and pubic symphysis were the remaining two anatomical areas that gave overall high validity values in the Athens Collection. The adult human pelvis has been among the most useful areas for age-at-death estimation and contains different anatomical structures that have been used for this purpose: pubic symphysis, auricular surface, and acetabulum. Bony degenerative changes in these regions have been shown to correlate with age [44,45,46]. Although several studies have demonstrated that relevant methods can most commonly support age estimates between the late teens and 50–60 years, where the observations of the progressive degenerative changes reach their peak breakdown and plateau [21, 22, 47, 48], in the present study the DRNNAGE software performs better for individuals over 49 years old. In what concerns the epiphyseal union at the upper and lower limbs, this has been established as a viable method of skeletal age estimation in teenagers and young adults [42]. For more mature adults, the most commonly used age estimation methods based on the upper and lower limbs focus on degenerative changes on the articular surfaces [21, 47,48,49]. This latter approach is also the one followed at DRNNAGE. The upper and lower limbs showed a generally moderate validity in skeletal age estimation for individuals aged from 19 to 59 years old at the Athens Collection, with the exception of the upper limb performance for the age group 19–29 years old. This may be linked to the fact that articular changes (usually osteophytes and porosity) are strongly associated with mechanical stress linked to daily occupations but also body weight and other factors, besides age [50].

Finally, for a more direct comparison with the performance of the DRNNAGE software as reported by its developers [14], the variable combinations described in their work were also tested. The results of these analyses are provided in Supplementary Material (Table S12 and Figure S1). According to the comparison, the anatomical regions of the sutures, the 1st rib, and the pubic symphysis showed similar validity values in both studies, while major differences in validity were observed for the variable combinations regarding axial, appendicular, sacroiliac, and standard traits. Specifically, the DRNNAGE models severely underperformed in the Athens Collection. Similarly, the anatomical regions of the clavicle and the acetabulum underperformed in the Athens Collection sample; however, the observed differences in validity values were moderate.

Although it has been often supported in the literature that using a multivariate approach for skeletal age estimation is more proper than using any single method alone [51, 52], in the present study the lowest classification rate was obtained when combining all available anatomical regions. It is well known that the aging process is controlled by various internal and external factors [6, 7], which affect different anatomical areas differently and this can become a source of bias in age estimation. Furthermore, the validity of skeletal age estimation can be affected by within and between individuals and populations variation in the rate of senescence [53]. The DRNNAGE software was trained utilizing skeletal collections hosted at the University of Coimbra (CISC, XXI-ISC) which are composed of individuals of Portuguese ancestry. Furthermore, the age-at-death distribution of the reference sample used by the DRNNAGE developers was homogeneous across the represented age-at-death span, whereas the proportion of older individuals is higher in the Athens Collection. Therefore, the validity loss observed could be attributed to either population specificity or the different age-at-death distributions of the utilized samples.

In conclusion, the DRNNAGE software produced partly accurate age-at-death predictions in a modern Greek assemblage. This method was particularly successful for males and females older than 50 years, but it performed poorly for those younger than this threshold, with only exception the use of cranial suture closure. Moreover, different anatomical areas showed very different repeatability, reproducibility, and validity. Further evaluation studies in different assemblages are necessary in order to test the performance of this software more broadly.