Introduction

Chronological age is, together with biological sex and ethnicity, the most important human feature to be considered in anthropological and forensic studies [1]. Besides, chronological age estimation is used daily in legal procedures where the birthdate of the involved subjects can not be verified due to either the absence of birth certification or the suspicion of false documentation. This applies to migration controls or trials involving undocumented people since the attainment of legal age has many implications according to the laws of most countries. It is also important in the adoption processes of undocumented children. In all these cases, an expert performs a somatic maturity examination.

The development status of bones has been used successfully to estimate chronological age. In this regard, many skeletal parts have been used, such as pubic symphysis, auricular surface, or sternal ribs [2]. Also, it is worth noting that there is not a single method based on bone development that outperforms others systematically, as the performance of each one depends on numerous factors. For instance, there are specific age estimation methods developed for subadults and others that work better in adults [3].

One of the most widely used body part in the field of age estimation is the teeth, mainly because dental mineralisation has been reported to be less affected by external factors (e.g. genetics or environment) than bone mineralisation [4]. In this regard, dental imaging techniques represented a step forward because they allowed clinicians to assess bone development with less invasive and faster procedures, and thus enabled them to perform chronological age estimation.

The estimation of age from dental radiographic records is based on the evaluation of some characteristics such as the formation of jaw bones; the appearance of tooth germs, the degree of crown completion and its eruption, the degree of resorption of deciduous teeth; the measurement of open apices in teeth; the volume of the pulp chamber and root canals; the formation of physiological secondary dentin; the toot-to-pulp ratio; or the development and topography of the third molar [5].

It is worth noting that the panoramic X-rays (ortopantomographies or OPGs) provide the least invasive radiologic technique to estimate age, as it only requires a single image to capture the whole dentition. Besides, other bone structures can be seen, such as the mandible, the nasal fossa, or the vertebrae, which are also helpful for further examinations. In the following, a review of the main methods to estimate the age of dental radiographs has been carried out.

Material and methods

For the review purpose, a conducting protocol approved by an expert reviewer and compliant with the PRISMA guidelines for systematic reviews [6] has been established. Scopus and PubMed databases have been used to retrieve a collection of full-text studies on age estimation from dental radiographies published in the last 6 years (from 2016 to 2022). This specific period was chosen for two main reasons. First, the number of published studies is sufficiently high to report significant conclusions. Second, automatic methodologies in the field of dental age estimation have been mainly used in this period, and not before, and therefore including earlier years would have diluted their relevance in this review. Then, a study selection process has been carried out, as seen in Fig. 1.

Fig. 1
figure 1

Flow diagram of the study selection process

The query used in Scopus was:

  • TITLE-ABS-KEY ((“age estimation” OR “age assessment” OR “age regression” OR “age determination” ) AND (dental OR tooth OR teeth OR mandib* OR incisor OR canine OR premolar OR molar ) AND (x-ray OR radiolog* OR radiograph* OR opg OR orthopantomograph* OR panoramic OR ct OR cbct OR mri ) ) AND (LIMIT-TO (DOCTYPE , “ar” ) OR LIMIT-TO (DOCTYPE , “re” ) OR LIMIT-TO (DOCTYPE , “cp” ) ) AND (LIMIT-TO (PUBYEAR , 2022 ) OR LIMIT-TO (PUBYEAR , 2021 ) OR LIMIT-TO (PUBYEAR , 2020 ) OR LIMIT-TO (PUBYEAR , 2019 ) OR LIMIT-TO (PUBYEAR , 2018 ) OR LIMIT-TO (PUBYEAR , 2017 ) OR LIMIT-TO (PUBYEAR , 2016 ) ) AND (LIMIT-TO (LANGUAGE , “English” ) ) AND (EXCLUDE (SUBJAREA , “BIOC” ) ) AND (EXCLUDE (SUBJAREA , “PHAR” ) OR EXCLUDE (SUBJAREA , “PHYS” ) OR EXCLUDE (SUBJAREA , “VETE” ) OR EXCLUDE (SUBJAREA , “ARTS” ) OR EXCLUDE (SUBJAREA , “EART” ) OR EXCLUDE (SUBJAREA , “MATE” ) OR EXCLUDE (SUBJAREA , “BUSI” ) OR EXCLUDE (SUBJAREA , “CENG” ) OR EXCLUDE (SUBJAREA , “CHEM” ) OR EXCLUDE (SUBJAREA , “ENER” ) OR EXCLUDE (SUBJAREA , “PSYC” ) )

The query used in PubMed was:

  • (age estimation [Title/Abstract] OR age assessment [Title/Abstract] OR age regression [Title/Abstract] OR age determination [Title/Abstract]) AND (dental [Title/Abstract] OR tooth [Title/Abstract] OR teeth [Title/Abstract] OR incisor [Title/Abstract] OR canine [Title/Abstract] OR premolar [Title/Abstract] OR molar [Title/Abstract] OR mandib* [Title/Abstract]) AND (x-ray [Title/Abstract] OR radiograph* [Title/Abstract] OR radiolog* [Title/Abstract] OR opg [Title/Abstract] OR orthopantomograph* [Title/Abstract] OR panoramic [Title/Abstract] OR CT [Title/Abstract] OR MRI [Title/Abstract] OR CBCT [Title/Abstract]) AND 2016[DP]: 2022[DP] AND full text [SB] AND english [LA]

As it can be seen, the query is not strictly the same, as Scopus allowed also for excluding certain unwanted subject areas, such as Veterinary or Arts. As a result, a set of 537 studies were collected from Scopus and 336 from Pubmed on February 24th, 2022, which in the end represented a body of 613 unique works. The abstract of each work was reviewed to discard unwanted studies, according to the following exclusion criteria: (1) studies not aimed at chronological age estimation in humans; (2) non-radiological studies; (3) studies that use non-human samples; (4) studies relying on a sample smaller than 50 subjects or studies that do not report the sample size; (5) studies whose full text is not available.

Regarding the collection of studies aimed at evaluating the age estimation methods proposed in the literature, only those reporting at least one of the following metrics were evaluated. In terms of numerical age estimation studies, a statistic on the residual error (dental age minus chronological age) and the absolute error—mean, median, or standard deviation—, the standard error of the estimates, and/or the coefficient of determination R2. Methods geared toward age classification were required to report the accuracy, sensitivity, and/or specificity of the classification results. Although dental development is less affected by genetic or environmental factors than other bones, this process is still subject to variations, and so the age estimation methods were usually assessed in different populations and/or ethnic groups all over the world.

To reduce as much as possible the risk of bias in this work when comparing the results obtained by different methods, the collected studies were analysed to detect evidence of malpractice. As a result, five studies were discarded due to the non-compliance with basic aspects such as good wording or a comprehensive description of the experiments, as this could also indicate a problem in the peer review process. It is worth noting that only the most flagrant cases were taken into account to minimise the bias that the observer could introduce in this evaluation process. In the end, a set of 286 studies was selected for further analysis.

Age estimation methods

Tooth-based manual methods

The studies retrieved in this work relied on a wide variety of age estimation methodologies. However, as dental formation is highly correlated with chronological age and, therefore, is a key indicator for age estimation, most methods are based on dentition analysis. In this regard, the first approaches were purely manual, that is, they required experts not only to retrieve the correspondent information from the X-ray image but also to translate this information into an age value. These approaches [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39] are shown in Fig. 2.

Fig. 2
figure 2

Main manual methods for estimating chronological age from dental X-ray images

Children and young adults

Age estimation via dentition analysis has reportedly led to better results when dealing with newborns to subjects aged 22 to 25, that is, during tooth development. This makes studies aimed at estimating the age of children and/or adolescents to be more numerous than those focusing on age estimation in adults. Regarding the former, some methods aim to assess specific development milestones (such as dental eruption) to predict age [7, 40]. Though, they have proven to lead to very limited estimates, as they rely on very quick changes, from which little information can be collected.

Other methods aimed to assess the development of the teeth over a longer period. That is the case for dental Atlases, which are graphic representations of dental development and eruption that provide an easy way to estimate chronological age by comparing the status of the dentition using radiological or osteological techniques to the charts provided in the Atlas [8,9,10,11]. Other authors went a step further and developed dental scoring systems (DSS), consisting of dividing the development period of each tooth into a set of developmental stages with associated scores and using those scores to estimate the numerical chronological age. In this regard, the number of stages varied depending on the specific system. For example, Gleiser and Hunt [18] proposed 15 stages, Nolla [12] developed a division into 11 stages, Demirjian et al. [13] reported eight alphabetical stages, and Liliequist and Lundberg [21] proposed the use of seven stages, in a clear attempt to reduce the complexity of the method. Furthermore, some authors developed population-specific scoring tables on top of the Demirjian et al. [14, 16, 17] and Gleiser and Hunt’s systems [19, 20], while others mixed several staging systems to improve the overall estimation performance [22].

Cameriere et al. [23] introduced a different method for estimating age, based on tooth measurements. Specifically, the authors measured the open apices of the seven left permanent mandibular teeth. These measurements, previously normalised by tooth height, were highly and negatively correlated with chronological age. Furthermore, the number of teeth with completely closed root apices was reportedly correlated with age. These findings led the authors to develop a regression formula that depends on the sex of the subject and the normalised measurements of the seven teeth and the number of teeth whose root development is completed.

Adults

Although the development of teeth ends once the third molar is completely developed, some authors focused on other age-related changes that are radiologically observable to estimate age in older subjects. In this regard, three different families of methods can be identified. On the one hand, some authors explored the use of specific measurements or ratios between them to perform age estimations. For example, Kvaal et al. [26] proposed to measure dentin apposition indirectly through the assessment of the dental pulp radiopacity. The researchers carried out several linear measurements of both the pulp and the tooth and associated those measurements via linear ratios. Cameriere et al. [30] proposed a similar idea, but they replaced the linear measurements with area assessments. Another similar example is the Tooth Coronal Index (TCI), studied by Ikeda et al. [27], which represents a height ratio between the crown and the pulp cavity at the crown level.

On the other hand, a set of studies focused on the visibility of some structures established staging systems with which that visibility could be assessed. The structures most studied in this regard were the periodontal ligament and the root pulp, with the staging systems proposed in this regard by Olze et al. [31, 33] standing out.

Finally, some authors reported that a series of degenerative changes can be assessed through a staging system and therefore be used to estimate chronological age. In this regard, Gustafson [35] set multiple evaluable criteria, namely secondary dentin formation, periodontal recession, attrition, apical translucency, cementum apposition, and external root resorption. The degenerative stages proposed in the original work, which were intended to be applied to extracted and ground teeth, proved to be applicable to radiographic images as well, as confirmed with the methodologies proposed by other authors, such as Olze et al. [36] and Timme et al. [37].

Non-numeric age estimation

Besides age estimation methods developed for obtaining a numeric and continuous output, other authors focused on designing classification methods to estimate the probability that a subject belongs to a specific age group. Most of these studies relied on conventional numerical age estimation methods and adapted them to be used as age group classifiers. This is the case, for example, of the study of Sehrawat and Singh [41], which used the Kvaal et al.’s method [26] to perform a classification into four groups.

However, the majority of these studies are focused on a binary classification with two groups of subjects younger and older than a given threshold, which can be the legal age of maturity or any other specific age with high relevance in legal procedures. In this regard, Mincer et al. [42] relied on the staging system proposed by Demirjian et al. [13] to assess the development of the third molar, with the objective of estimating the probability of being older than a certain age for each stage. De Luca et al. [43] used measurements of the open apices to establish a cut-off value of 0.08 over the normalised Cameriere measurement of the third molar, above which the individual is considered older than the legal age. Other numeric age estimation methods were evaluated in age classification problems, such as Nolla’s [44] and Kohler et al.’s [45].

Age estimation on other radiologically observable structures

Although most age estimation methods from dental radiologic records are based on the analysis of the teeth, there are other structures whose characteristics may also be useful for age estimation. In the period covered by this systematic review (from 2016 to 2022), the number of works is very limited and all of them rely on mandibular measurements. Some examples are the approach followed by Motawei et al. [38], who established a relationship between the length of the ramus and chronological age, or the proposal by Acharya [39], in which the gonial angle was used as the main age indicator.

Automatic methods

Recent advances in image processing have allowed for automating dental age estimation methods to a greater degree and have led to the development of numerous methodologies. In this regard, the authors explored the same objectives as those covered in the traditional methods, as can be seen in Table 1: numeric age estimation [46,47,48,49,50,51,52,53,54,55] and age group classification [56,57,58,59,60], with the particular case of age thresholding [47, 61].

Table 1 Main automatic methods for estimating age. OPG, ortopantomograph; FA, fully automatic; CNN, convolutional neural network; NAS, neural architecture search; SHN, stacked hourglass network; NA, numeric age; AGC, age group classification; AT, age thresholding; Mi, i th molar; PMi, i th premolar

Among the methods developed for numerical age estimation, some authors focused on subjects with developing dentitions (up to 20 or 25 years old), such as Čular et al. [46], De Back et al. [48], and Wallraff et al. [52], while others were targeted at a wider range of age cohorts, such as Vila-Blanco et al. [49], Zheng et al. [53], Hou et al. [50], and Milošević et al. [55]. It is noticeable that Vila-Blanco et al. [54] also developed an automatic method to estimate age in children and subadults that is not focused on the entire panoramic image or the dental area but the shape of the mandible.

On the other hand, some methods were designed to perform an age group classification, with a variable number of target groups. While De Tobel et al. [56], Merdietio et al. [57], and Banar et al. [58] treated the problem as a ten-stage, Kim et al. [59] and Kahaki et al. [60] reduced the number of stages to five. Moreover, Guo et al. [61] focused on the binary classification problem of legal age detection, establishing the thresholds of 14, 16, and 18 years.

Regarding the applied methodologies, one of the first attempts to rely on image processing techniques was made by Čular et al. [46], where the authors proposed the use of an Active Appearance Model to localise the third molar and parameterise its shape and texture. In a second step, these parameters are introduced into a neural network to estimate the chronological age. As both steps do not need human intervention, the method works automatically.

As in most of the topics involving image processing, deep neural networks (DNNs) helped not only to automatise some tasks but also to improve their performance. Regarding age estimation, De Tobel et al. [56] developed a staging system for the third molar based on modified Demirjian stages and used a DNN to classify the third molar image crops into one of those stages. This method only required a minimum intervention of the expert to crop the region of interest to frame the third molar area. This approach was updated by Merdietio et al. [57] by replacing the manual crop step with a DenseNet network, which allows estimation to run automatically. Banar et al. [58] developed a similar method, with a slightly more complex third molar segmentation, in which the tooth is first localised and then segmented.

Kim et al. [59] followed a similar approach to that of De Tobel et al. [56]. The authors also developed a two-step approach which firstly requires a manual crop of the third molar, although in this case the four third molars are required. In the second step, each of the four teeth is classified into different age groups, and the classifications are merged through a majority voting system. The authors established two different age group divisions: the first grouped subjects younger than 20, subjects aged 20 to 49, and those over 50; the second split the middle group into three subgroups, namely subjects aged 20 to 29, subjects aged 30 to 39, and those aged 40 to 49.

Although deep learning methods had already been introduced in the studies mentioned above, De Back et al. [48] proposed the use of a DNN, specifically a Bayesian Convolutional Neural Network, as the only step to estimate chronological age. Therefore, the expert does not need to specify which features of the image should be taken into account, as the network focuses automatically on those regions which contributes the most to the age estimation. Furthermore, the age estimation process can proceed even if several teeth are missing.

Vila-Blanco et al. [49], following the clinical finding that dental development is different in boys and girls, developed a method to automatically integrate sexual information into the age estimation process. Thus, they proposed the use of two identical CNNs, one for age estimation and the other for sex classification, so that the sex CNN learns sexual dimorphic features and propagates them to the age CNN at intermediate points to improve the age estimation performance.

Similarly to that proposed by De Back et al. [48] and Vila-Blanco et al. [49], Hou et al. [50], Wallraff et al. [52], Milošević et al. [55], and Guo et al. [61] also developed one-stage methods that do not require the presence of specific teeth to work.

Evaluation studies

The studies retrieved in this work were categorised according to the age estimation methods they relied on. As it can be seen in Fig. 3a, where the ten most used methods are represented, Demirjian et al.’s approach [13] has been applied in more than a third of the studies (100 out of 286), with some methods derived for it also in the first positions—Willems et al.’s [14] and Chaillet and Willems’ [16] were applied in 45 and 7 studies, respectively. The first method not aimed originally at estimating the numerical age is the approach proposed by Cameriere et al. [62]. This method, focused on the classification of subjects younger or older than legal age, was used in 43 studies. On the other hand, it is noticeable that 239 studies relied on OPGs to conduct the experiments, representing 84% of all the retrieved works (Fig. 3b). The rest of the studies used CT-based techniques (such as CBCT or conventional CT) and, to a lesser extent, intraoral images, MRI, or the cephalometric view.

Fig. 3
figure 3

a Number of studies applying the ten most assessed methods for age estimation in dental radiographic images in the period evaluated in this review (2016 to 2022), where PTVR represents the works that used pulp-to-tooth volumetric ratios; b Number of studies relying on the different types of radiological dental images used in the period evaluated in this review (2016 to 2022)

Regarding the performance of the age estimation methods, a maximum of one study was evaluated for each population and each method, specifically that evaluated in the largest sample due to the greater significance of the reported results. This ensured a good representation of different ethnicities while avoiding overcrowded result tables. Following the same order as in the previous section, the approach based on tooth eruption assessment proposed by Haavikko [7] was evaluated in a wide range of populations since its development, but it clearly lost popularity in comparison to other methods. As shown in Table 2, only four studies that met the inclusion criteria have been analysed, all focused on subjects younger than 16. In terms of performance, these studies reported systematic underestimations given by residual errors (difference between estimated age and real age) with means ranging from − 0.22 to − 1.35 years and standard deviations around one year. The mean absolute errors yielded mean values of 0.33 to 1.45 years.

Table 2 Evaluation of the Haavikko’s method [7]

The atlas-based methods listed in Fig. 2 were also applied in the last few years, although only the London Atlas proposed by AlQahtani et al. [10] was tested in more than one population sample. As shown in Table 3, the work by Baylis and Bassed [63], which compared the three Atlas-based methods in a New Zealander population, reported a slight underestimation with the Schour and Massler Atlas [8] (− 0.03 to − 0.39 years of mean error), a slight overestimation with the Blenkin and Taylor method [11] (+ 0.07 to + 0.34 years), and a noticeable overestimation with the London Atlas [10] (+ 0.40 to + 0.74 years). The latter resulted in a systematic overestimation in the rest of the populations in which it was applied. In terms of absolute error, the mean values ranged from 0.43 to 1.43 years, while the standard deviations ranged between 0.36 and 1.05.

Table 3 Evaluation of Atlas-based methods to estimate age in subadults. When two values are given, they correspond to the metrics obtained in males/females

Regarding the staging methods, the one proposed by Nolla [12] was used in eight different studies, as seen in Table 4, showing mean residual errors between − 1.12 and + 0.54 years, and standard deviation values between 0.23 and 3.30. The mean absolute errors ranged from 0.66 to 1.10 years. Most of the studies were focused on subjects between five and 15 years of age, although Berkvens et al. [64] conducted a study on subjects aged up to 30.

Table 4 Evaluation of the Nolla method [12]. When two values are given, they correspond to the metrics obtained on males/females

The method developed by Demirjian et al. [13] is perhaps one of the most studied approaches for the estimation of dental age. In the analysed period of time, a set of 40 studies using Demirjian et al.’s method [13] reported any of the required metrics in different populations, as shown in Table 5. The range of ages was also wider than in the case of Nolla’s method [12], with subjects ranging from two to 30, although most of them focused on the interval between five and 23. Regarding the obtained results, a clear overestimation can be seen, being the mean errors between − 0.58 and + 2.13 years. Absolute errors indicate that the error magnitude lies between 0.13 and 1.48 years, while the reported R2 values were over 0.60 in any case.

Table 5 Evaluation of the Demirjian et al.’s method [13] and revisited Demirjian’s method [99]. When two values are given, they correspond to the metrics obtained on males/females

The modified Demirjian’s method developed by Willems et al. [14] led to numerous studies focused on testing its applicability in different populations. In this review, a set of 28 studies was analysed (Table 6). On average, the method was applied to a narrower age range, working most of the authors in the range between five and 16 years of age. Although more investigations that show overestimation than underestimation—13 vs. 12, respectively—, this trend is much less noticeable than in the case of the method by Demirjian et al. [13]. The absolute errors also tended to decrease with this method, as the values lied between 0.61 and 1.16 years. Bedek et al. [15] proposed a modification of Willems et al.’s method [14], which was evaluated in an Indian population by Sheriff et al. [65], as it can be seen in Table 7. The results showed a notable underestimation (up to − 0.55 years of mean error), but the low standard deviation values (0.05 to 0.06 years) indicated that the error was consistent between all subjects.

Table 6 Evaluation of the Willems et al,’s method [14] to estimate age in subadults. When two values are given, they correspond to the metrics obtained in males/females
Table 7 Evaluation of Bedek et al.’s method [15] to estimate age in subadults. When two values are given, they correspond to the metrics obtained in males/females

The modification of the Demirjian et al.’s method [13] proposed by Chaillet and Willems [16] was applied to four different samples in the collected studies, as presented in Table 8. The range of application, however, is narrower than in the Demirjian’s applications, as the subjects were in every case younger than 18. Unlike the systematic overestimation of the Demirjian et al.’s method [13], Chaillet and Willems’ [16] tended to underestimate age, with mean errors between − 2.79 and − 0.07 years. Absolute errors ranged from 0.66 to 1.14 years on average, with standard deviations between 0.49 and 0.52 years.

Table 8 Evaluation of Chaillet and Willems’ method [16] to estimate age in subadults. When two values are given, they correspond to the metrics obtained in males/females

Finally, the modified Demirjian’s method developed by Blenkin and Evans [17] was applied to two different populations of subjects aged six to 17 (Table 9), yielding errors with mean values ranging from − 0.05 to − 0.55 years and standard deviations up to 1.04 years. The absolute errors ranged from 0.61 to 0.91 years on average.

Table 9 Evaluation of Blenkin and Evans’ method [17] to estimate age in subadults

The tooth staging criteria proposed by Gleiser and Hunt [18] led to the development of several methods, such as those proposed by Moorrees et al. [19] and Kohler et al. [20]. Six studies applied the former in different samples of subjects aged from three to 30, as seen in Table 10, with mean errors between − 1.01 and + 0.34 and so a tendency to underestimating the age. In absolute terms, the error ranged between 0.63 and 1.42 years. On the other hand, Kohler et al.’s method [20] was applied to two different samples of subjects up to 24 years of age, reaching systematic underestimations in both cases and mean absolute errors up to 2 years.

Table 10 Evaluation of the methods based on Gleiser and Gunt criteria [18]. When two values are given, they correspond to the metrics obtained on males/females

The widely used tooth staging method proposed by Liliequist and Lundberg [21] was used in two of the retrieved studies, in Brazilian and Croatian populations, respectively. As it can be seen in Table 11, the method led to an age underestimation in both cases, though it was more noticeable in the former, with a mean error of − 0.58 years. Absolute errors were very similar in both studies, with mean values of 0.97 and 0.99 years and median values of 0.83 and 0.81, respectively.

Table 11 Evaluation of Liliequist and Lundberg’s method [21]

The method proposed by De Tobel et al. [22], which mixed both Demirjian et al. [13] and Kohler et al.’s [20] staging systems, was applied to a Belgian and Dutch sample of subjects aged 14 to 26 as presented in Table 12. The authors reported mean errors closer to zero and thus an equal tendency to overestimating or underestimating age, while the mean absolute error was between 1.70 and 2.00 years.

Table 12 Evaluation of De Tobel et al.’s method [22] to estimate age in subadults and young adults

As seen in Table 13, Cameriere et al.’s method [23] based on the evaluation of open apices led to a systematic underestimation in most evaluated populations—13 out of 14 reporting the residuals—, with a maximum mean error of − 1.36 years, although it is noticeable that in the Rivera et al. study [66], the mean error is negative while the median error of male subjects is positive. The absolute errors ranged between 0.57 and 1.60 years on average.

Table 13 Evaluation of Cameriere’s method [23] for age estimation in subadults. When two values are given, they correspond to the metrics obtained on males/females

The proposed approaches for estimating chronological age in adults produced systematically worse results than their children-orientated counterparts, and the available studies are much scarcer. Moreover, the studies that apply adult-based methods tended to report mostly the standard error and R2 values instead of the residual and absolute error measurements, in opposition to the previously presented methods. In this regard, 35 studies were collected related to the evaluation of metric methods based on a set of linear and volumetric tooth analysis, as seen in Table 14. The most common approach is the pulp-to-tooth linear, area, and volumetric ratio (PTLR, PTAR, PRVR), used in eight works each, and the tooth-to-crown index (TCI) and the pulp-to-crown volume ratio (PCVR), each one applied in three studies. It is also worth noting that most of these works relied on 3D images (such as CT-based records) instead of flat X-rays, as they allow the volume of the different tooth structures to be analysed accurately.

Table 14 Evaluation of methods based on evaluation of linear and/or volumetric dental measurements. When two values are given, they correspond to the metrics obtained in males/females. TCI, tooth coronal indexes; PLR, pulp linear ratio; PTLR, pulp-to-tooth linear ratio; PTAR, pulp-to-tooth area ratio; PTVR, pulp-to-tooth volume ratio; PV, pulp volume; PCLR, pulp-to-crown linear ratio; PCVR, pulp-to-crown volume ratio; PEVR, pulp-to-enamel volume ratio; PDVR, pulp-to-dentin volume ratio

Huge variability in the results reported by these studies is observed. For example, the mean absolute error varied not only depending on the measurement but also across the studies using the same measurement (from 5.66 to 25.85 years in the case of PTVR). This can also be seen in the standard error metric, which lied between 4.66 and 15.29 years. In terms of variance explained, the models moved between 1 and 97%. The specific method of Kvaal et al. [26], which is based on pulp-to-tooth linear ratios, yielded very different behaviour in the available studies. For example, while the study by Li et al. [67] reported a noticeable overestimation, the work by Roh et al. [68] pointed out a great underestimation, both of them evaluated in samples of adults aged around 20 to 70. In the same way, studies published on the evaluation of the Cameriere et al.’s method [30] based on the pulp-to-tooth area ratio focused on the statistical analysis of the method and reported limited information regarding the residual error of the model. Only Dabbaghi and Kazemi [69] reported the mean error, pointing out an overestimation of 8.83 years. Furthermore, it is noticeable that the lowest absolute errors were obtained in a multi-ethnic sample [70].

Methods that have associated radiographic visibility of several oral structures with chronological age usually do not aim to estimate a numerical age value, so only two of them reported estimation error metrics. As shown in Table 15, the study of Chaudhary and Liversidge [71] pointed out an overall overestimate of 7.21 years for males and 6.87 years for females, being the mean absolute error of 7.91 and 7.74 years in the same two scenarios. On the other hand, Timme et al. [72] did not report the error metrics, but a standard error of 3.55 years and the percentage of explained variance (69%).

Table 15 Evaluation of methods to estimate the age of adults based on the radiographic visibility of several structures. When two values are given, they correspond to the metrics obtained in males/females. PL, periodontal ligament; RP, root pulp

Methods based on the evaluation of degenerative tooth changes, based on Gustafson’s criteria [35], are summarised in Table 16. The results show again a high degree of variability, with standard errors reaching a minimum of 0.69 and a maximum of 10.92 years, even if in both cases the Gustafson criteria are assessed in a population of the same ethnic origin. Only two studies reported the mean absolute error, with very different values in each one (3.34 and 3.68 years in the first method and 11.08 years in the second). Finally, R2 values were in the range 0.23–0.80.

Table 16 Evaluation of methods for estimating age in adults based on Gustafson’s criteria [35]. When two values are given, they correspond to the metrics obtained on males/females

As mentioned in “Age estimation on other radiologically observable structures” section, the estimation of chronological age was also approached by mandibular bone analysis, specifically by measuring ramus length [38] and gonial angle [39]. The former produced a model that represented 62% of the data variance, while the latter led to an absolute error of 13.98 years. As it can be seen in Table 17, both studies reported different metrics, so they are not directly comparable.

Table 17 Evaluation of methods to estimate the age based on the measurement of the mandible

The most widely used methods for numeric age estimation were jointly analysed regarding the obtained underestimation or overestimation. As it can be seen in Fig. 4, two of the six methods showed a clear pattern of overestimation, namely those proposed by Demirjian et al. [13] and the London Atlas [10]. On the other hand, the methods developed by Cameriere et al. [23] and Nolla [12] led to a systematic underestimation of age. Finally, the methods based on linear and volumetric measurements of the teeth, as well as that proposed by Willems and Chaillet [14] yielded a more balanced performance, with almost the same number of studies underestimating and overestimating age.

Fig. 4
figure 4

Underestimation or overestimation produced by the methods for numeric age estimation. Only those methods reporting the mean error in at least five studies were included. The methods based on linear and volumetric measurements of the teeth (L/V measurements in the figure) were grouped for a better representativity

As mentioned in the previous section, some age estimation methods were adapted to work as a binary classifier for detecting people younger or older than the legal age. The results obtained in this regard are presented in Table 18. First, the methods based on tooth eruption presented by Haavikko [7] and Olze et al. [40] were assessed in the problem of 14-year-old detection. The former led to accuracy between 78 and 81%, while the latter yielded better performance, with 83 to 86%. Also, the method proposed by Olze et al. showed a more balanced behaviour, with similar sensitivity and specificity values.

Table 18 Evaluation of age thresholding methods. When two values are given, they correspond to males/females

There is only one study that evaluated an Atlas-based method for binary age classification. Specifically, De Moraes et al. [73] used the London Atlas [10] for classifying dental records according to the 18-year-old threshold. Although the accuracy reached a reasonable value of 80%, the methods were heavily biased, as they produced a very high sensitivity—that is, they correctly detected subjects older than 18— but very low specificity —it only correctly classified half of the subjects younger than 18.

Dental staging methods were used to a greater extent. Regarding the Nolla method [12], it was applied to the Portuguese and Montenegrin populations. In the first, the method obtained accuracies from 82 to 90%, depending on the age threshold, while in the latter the accuracy was 90% for males and 87% for females. It is noticeable that Nolla’s method was highly biased in both directions in the Portuguese population, as it obtained higher sensitivity values with the thresholds of 14 and 21 years, and better specificity values with the age thresholds of 16 and 18.

The Mincer et al.’s method [42] based on the development stages proposed by Demirjian et al. [13] led to maximum accuracies of 91, 90, and 94 % when using the legal age thresholds of 12, 14, and 18 years, respectively. Although the sensitivity and specificity were almost balanced in most studies, there are some cases where a significant bias was observed, such as the study by Mwesigwa et al. [74] with a legal age of 16—sensitivity of 69% vs. specificity of 97%. Pinchi et al. [75] also tested the Willems et al.’s scores [14] in an Italian population, reaching a slightly worse result than in the case of the original scores, especially in the sensitivity values (74–78% vs. 80–82%).

Gleiser and Hunt staging system [18] was also studied on the problem of binary age classification via the derived methods of Moorrees et al. [19] and Kohler et al. [20]. The former was applied in a Portuguese sample with 14, 16, 18, and 21 thresholds, obtaining accuracies from 83 to 90%. Again, the sensitivity and specificity values were highly unbalanced, especially when using the 14-year-old threshold (92% of sensitivity and 59% of specificity). On the other hand, the Kohler et al.’s method reached an accuracy of 91% in an Indian population and sensitivity and specificity values of 87–80% and 87–90%, respectively, in a Russian sample.

The adaptation of Cameriere’s method for legal age classification [62] was by far the most used method for binary age classification. Among all the experiments carried out with this approach, 29 out of the 40 established an age threshold of 18 years. The accuracy values ranged from 72 to 98%, although 35 studies yielded values greater than 80%. As with most methods, there are some cases where a great bias between sensitivity and specificity can be seen, the most significant example being the study by AlQahtani et al. [76], where the sensitivity was 51–52% and the specificity was 100–97%.

Finally, Olze et al.’s [31] method based on the assessment of the root pulp visibility was evaluated in Turkish and Indian samples. Only the latter study reported the accuracy—77% in males and 80% in females—yielding also a notable imbalance between sensitivity and specificity.

The two most widely applied methods for age thresholding, namely those proposed by Mincer et al. [42] and Cameriere et al. [62], were compared using a reference value of 90% accuracy. As it is shown in Fig. 5, the method of Mincer et al. obtained a performance better than the reference when establishing an age threshold of 12 years and three times out of four with a threshold of 18 years. When the threshold is set to 14 years of age, one study obtained better performance than the reference, and another work reported worse performance. Regarding the Cameriere et al’s method, all studies that set an age threshold of 12, 14, or 16 years reported accuracy values lower than the reference, while studies applying a threshold of 18 years showed a more balanced performance, with 12 studies performing better than the reference and 16 works reporting worse results.

Fig. 5
figure 5

Performance of the methods aimed at age thresholding with respect to an accuracy reference value of 90%. Only those methods used in at least five studies were included

Regarding the automatic approaches proposed for age estimation, each method was tested in a single population. In those aimed at estimating a numerical age value (Table 19), the residual error was systematically closer to zero, being the median between − 0.07 and + 0.12 years. Absolute error varied depending on the age range of the assessed sample. As reported by Vila-Blanco et al. [49], the mean absolute error in a Spanish sample ranged from 0.75 years in subjects younger than 15 years to 2.84 years in subjects younger than 90 years. In this case, the median absolute error was as low as 0.64 years. Comparatively, the work by Čular et al. [46] reported a mean absolute error of 2.28 years in Croatian subjects between 10 and 25 years of age, being 1.75 years on a German sample of subjects aged 5 to 25, as noticed by De Back et al. [48]. Performance was significantly worse in the methods tested in older subjects, as reported by Milošević et al. [55], Zheng et al. [53], and Pham et al. [51], with mean absolute errors of 3.96, 7.17, and 6.97–7.07 years, respectively. On the contrary, the method of Hou et al. [50], which was tested in a large population sample and very wide in terms of subject age (from 0 to 93 years of age), led to a very low mean absolute error, being of 1.64 years. It is also noticeable that the method of Vila-Blanco et al. [54], which relies only on the mandible shape instead of the whole dental image, yielded a mean absolute error of 1.57 years, which is comparable or even better than other methods relying on the whole dental image.

Table 19 Results of the main automatic methods for numeric age estimation, evaluated in different populations. When two values are given, they correspond to the metrics obtained in males/females

Automatic methods focusing on classifying the subject’s age into a defined set of age ranges led to different results depending on the number of classes, as shown in Table 20. For example, the approaches of De Tobel et al. [56], Merdietio et al. [57], and Banar et al. [58], which proposed to use 10 different development stages, obtained classification accuracies between 51 and 60% and Kappa values between 0.82 and 0.84. It should be noted that these three methods were assessed in the same Belgian sample. On the other hand, Kim et al. [59] and Kahaki et al. [60] reduced the classification problem to five age groups, leading to significantly higher accuracy values of 82 and 90%, respectively.

Table 20 Results of the main automatic methods for age group classification, evaluated in different populations

Finally, Štern et al. [47] and Guo et al. [61] tested their automatic age thresholding approach in Austrian and Chinese populations, respectively, as it can be seen in Table 21. It is worth noting that the latter sample consisted of more than 10,000 OPGs. Guo’s method led to accuracy values between 93 and 96%, depending on the specific age threshold, and was very consistent in terms of sensitivity and specificity. On the other hand, the method by Štern et al. reached a slightly worse accuracy value in the same scenario (85% vs. 93%), and a notable imbalance between sensitivity and specificity is observed.

Table 21 Results of the main automatic methods for age thresholding, evaluated in different populations. When two values are given, they correspond to the metrics obtained in males/females

The applicability of the methods used in the studies included in this work was assessed in terms of the age of the subjects. As Fig. 6 indicates, there are notable differences among the proposed approaches. While the methods developed by AlQahtani et al. [10], Nolla [12], Willems and Chaillet [14], and Cameriere et al. [23] focused on a very constrained group of patients aged two to 18, approximately, Demirjian et al.’s method [13] could be applied to a wider group of patients of even more than 25 years of age. The methods focused on post-developmental dental features, such as the one proposed by Kvaal et al. [26] or the pulp-to-tooth volumetric ratios, have as their natural field of application those subjects aged 18 to 70, approximately. On the other hand, the automatic methods have been proven to be applicable in a wider age range, covering both the subjects with developing dentitions and the subjects with fully developed teeth.

Fig. 6
figure 6

Application of the most widely used methods for age estimation regarding the age of the subjects included in the analysed studies. Only those methods applied in at least ten studies were included

Discussion

The oral cavity, and especially the teeth, has been used for decades as they show a high correlation with development patterns. In this regard, great efforts have been made since the late nineteenth century to develop teeth evolution Atlases, with a view not only to the formation of teeth concerning age but also to the sexual dimorphic patterns of that development. The democratisation of radiology led to a number of improvements, one of them being the collection of bigger databases to use in new studies, which in the end increases the statistical significance of the findings. Another major revolution came with the arrival of computers, which brought the possibility of acquiring X-ray images directly in a digital format and, therefore, to speed up the measuring processes.

With the aim of using the structures that are observable in the dental images to estimate the chronological age, different approaches have been followed. Only three studies collected in this work focused on the mandible bone [38, 39, 54], while the rest relied on tooth analysis methods. Among the latter, two groups of works can be noticed, namely those aimed at estimating age in children and young adults [12, 13, 23] and those focused on adults [26, 31, 36]. Moreover, some authors approached the age estimation problem as a classification task instead of a regression problem. In this regard, some studies tried to classify the age of the subjects into two groups, usually with the main purpose of estimating the chances a person has to be older than the legal age. Other works generalised this idea and applied an age classifier with more than two target classes. The usual way to classify the age in both cases was through some modification of a numeric age estimation method [42,43,44].

As tooth and bone development is reported to depend on factors such as ethnicity or environment, these methods have been evaluated to a greater or lesser extent in different populations around the world. To assess the differences between the different approaches and the population samples used, this work retrieved a corpus of studies applying age estimation methods in dental images published in the last 7 years (from 2016 to 2022). A total of 613 unique studies were obtained, of which 286 were selected after applying the inclusion criteria. The first point to highlight is the great difference in the number of studies that apply each method. For example, methods such as those proposed by Schour and Massler [8] or Blenkin and Taylor [11] were applied in one single work in the evaluated period, which decreases the significance of the results to a large extent.

Regarding the performance obtained by the analysed methods, it has been proved that the eruption-based methods are useful for a very short period (until 15 years of age, approximately), and the only method included in this work, proposed by Haavikko et al. [7], led to a systematic underestimation of the chronological age in every tested population. Similarly, Atlas-based methods assessed have also been applied to very constrained population samples in terms of age, being the London Atlas method [10] the only method evaluated in multiple populations.

Dental staging approaches have proven to be highly relevant in this period, as they accounted for nine of the ten most commonly used methods in this work. This allowed us to compare these methods to each other accurately and confirm the findings reported in previous works, such as the tendency of Nolla [12] and Demirjian et al.’s [13] methods to underestimate and overestimate the age, respectively [77], or the balanced behaviour yielded by the Willems modification of Demirjian’s method [14]. The methods based on Gleiser and Hunt stages [18] achieved slightly worse results than Nolla’s [12] or Demirjian et al.’s [13], while both Moorrees et al.’s [19] and Kohler et al.’s [20] approaches showed a significant underestimation and greater absolute errors. It should also be noted that the method proposed by De Tobel et al. [22], which combines multiple dental staging methods, showed a behaviour with no tendency to underestimation or overestimation, but greater absolute errors than those obtained with the methods it relies on individually. On the other hand, Cameriere’s method [23] based on the measurement of open apices systematically underestimated the age, while the absolute errors were highly inconsistent compared to the previously mentioned methodologies.

Estimating chronological age in adults has proven to be a much more difficult task, as confirmed in the studies retrieved in this work. None of the studies reported absolute errors lower than 2 years, and trends of under- or overestimation were also more pronounced [69, 78]. Regarding the methods using linear or volumetric dental measurements, there is a clear improvement in terms of standard error values when volumetric information is taken into account. It is also noticeable that adult-oriented methods do not report as many performance metrics as those targeted at children, which hinders the assessment of their performance. The most problematic cases are those involving the radiographic visibility methods and the mandible-based methods, where a direct comparison is not possible.

The methods proposed for detecting if someone is younger or older than a predefined threshold (usually the legal age) provided a more suitable environment for comparability purposes, as most of them reported the accuracy, sensitivity, and specificity metrics. The majority of studies yielded reasonably high accuracy, with values over 80 or even 90%, suggesting that these methods could be used either alone or combined with others to produce confident estimations. However, it is worth noting that the results are very dependent on the age cohort used to test the methodology. If subjects are not well-balanced around the threshold, that is, if most subjects are younger or older than the legal age, the performance will be abnormally good, as the classifier will tend to assign the majority class to all subjects [79]. A similar problem can occur if the testing age cohort is too wide, as the further away the subjects are from the age threshold, the easier it is to classify them and, therefore, the better the classification performance.

The inclusion of modern image processing techniques for age estimation led to a set of improvements, being the first one the time and cost saving by means of full or almost-full automation. Moreover, the performance achieved by these methods is clearly higher than that obtained with the manual methodologies. Although in some cases the errors were high due to the exclusion of children from the population sample [51, 53, 55], the overall performance was remarkable, with absolute errors lower than one year for patients younger than 20 [49] or 1.64 years in a huge sample of almost 28,000 images from patients aged zero to 93 years [50]. In this regard, the applicability of these methods is better compared to that of the classic manual approaches, with an almost avoidance of underestimation and overestimation problems and the subjectivity intrinsic to human-guided processes, as well as the possibility to run estimations in a wider age cohort.

The studies that applied automatic methodologies to perform binary age classification or multiple group age classification are not as numerous as those aimed at performing numeric age regression. Moreover, traditional methods in this regard reached very good results, with almost no room for improvement. However, the proposed automated methodologies kept the same high-performance level while providing a series of benefits such as economic and resource-saving [59, 61].

Regarding the specific computational approaches followed by the automatic methods, most studies rely on fully automatic and deep-learning-based solutions, which lead to two key enhancements. As with every end-to-end method, the dataset only needs to be annotated with the expected output, that is, the age, reducing the time of this process and, thus, making it possible to compile bigger datasets of thousands of images. Furthermore, these methods do not rely on specific bone structures designated by an expert, but rather on the image parts that the algorithm considers to be most relevant for that specific task. As there is no need for specific teeth to be present, these methods can work even if some pieces are missing.

Although these automatic methods have been shown to improve the performance and applicability of age estimation methods, their validation needs to be improved. The relative recency of deep learning techniques causes that no automatic method has been tested in populations or acquisition devices other than the original ones, which raises doubts about their generalisation to different scenarios. However, the ease to compile new databases in this regard allows these methods to be easily adapted to different situations through the application of specific domain-adaptation techniques, such as transfer learning or fine-tuning [80].

Conclusions

In this work, the current application of age estimation methods in recent years has been studied, specifically those methods using radiological dental images. Although classic methods have been thoroughly evaluated in many populations of different ethnicities all over the world, automatic methods based on deep learning techniques led to an improvement not only in terms of performance but also regarding the applicability in a real scenario. This represents a turning point in the field of chronological age estimation, since the speed at which the estimations can be applied can be significantly higher and the subjectivity inherent to the observer analysis can be completely avoided. Future work in this area should involve a deeper assessment of the proposed automatic methodologies, specifically their evaluation in samples of different ethnicities, to improve their generalisation capabilities.