1 Introduction

Non-invasive age and gender estimation from radiographs has notable roles in dental diagnostics and forensic investigations [1]. Applications outside of the dental practice range from estimate the age of the corpse or to help in determining the identities of the deceased who succumb to calamities such as explosions in law enforcement and judicial trials evaluating the truthfulness of an individual’s age at specified times of interest or in the cases of undocumented children [2, 3]. To provide scientifically backed evidence on age and biological sex, forensic dentistry determines the age of an individual through estimating the stage of development of a tooth and maxillofacial arches. The development of a tooth occurs through several stages, starting from the formation of the tooth bud in the embryonic stage to the eruption and maturation of the tooth in the oral cavity. Panoramic radiographs, also known as orthopantomogram (OPG) or panoramic imaging, is a specialized, easily accessible, and cost-effective dental imaging technique that captures a wide-angle view of the entire oral and maxillofacial region in a single image. It provides a comprehensive 2-dimensional overview of the dentition, maxillofacial and mandibular bone anatomy, temporomandibular joints (TMJs), sinuses, and other surrounding structures and have been used previously to report on root-canal treatment progression [4].

While advanced 3-dimensional techniques such as computed tomography (CT) [5, 6] and cone-beam computed tomography (CBCT) [7, 8] have gained recent popularity, panoramic radiographs still remain the most commonly used technique in both dental diagnostics and deep learning applications related to dentistry [9,10,11,12]. The innovation of high-resolution biosensors and subsequent imaging processes in recent times have produced large quantities of data that can be examined with the help of computer programs. OPGs are considered to contain most two-dimensional landmark information used to reach a preliminary diagnosis and are usually the first step in determining whether three-dimensional computed tomography is required. Traditional automations of dental age estimation involve phases, such as image-preprocessing, segmentation, feature extraction, and classification (categorical) or regression (numerical). In the case of classification, these processes aim to identify age categories for people, whereas the regression phase aims to identify their exact ages.

Deep Learning (DL) methods have been applied in recent years to automate activities utilizing OPG images. DL techniques, most notably convolutional neural networks (CNNs), have shown promise in various applications involving digital images of panoramic radiographs that are able to extract and segment features within the maxilla and mandible to isolate each tooth from other objects in the image such as jaws[13]; the detection and classification of individual teeth involve the identification and labeling of each tooth within a dental image [14,15,16]; the detection of previous treatment, e.g., endodontics [17]; the reconstruction of OPG images where a patient was badly positioned [18]; the diagnosis of osteoporosis [19] and jaw tumors [20].

While there has been extensive exploration of supervised CNN-based age estimation in previous literature, the integration of unsupervised learning and explainability is a relatively nascent area in terms of both design and approach [21]. The application of unsupervised learning to radiographic assessment to reduce operator-related variability is of particular interest. In this context, our research introduces a novel unsupervised deep learning approach, termed PENViT, which combines EfficientNet and Vision Transformer (ViT) models with Addictive Angular Margin Loss (ArcFace). This amalgamation of different deep learning models and loss functions aims to elevate the accuracy and resilience of dental age estimation. The primary objective of the present study was to explore existing methods and devise novel strategies for advancing automated age prediction using weak and minimal supervision. To this end, the study posed the following research questions:

  1. a.

    Which model architecture can correctly estimate age and biological gender using regression-based neural network?

  2. b.

    Does margin losses (ArcFace, TripletMarginLoss) increase performance of OPG-based age classification as compared to pure cross entropy loss?

  3. c.

    Does Hard Triplet Mining task improve validation accuracy of Triplet network?

  4. d.

    Can the novel PENViT model backbone perform on par in both general form and triplet-like network (TriplePENViT) as compared with any other backbone?

  5. e.

    Does a two-step semi-supervised pseudo labelling workflow improve validation accuracy of age estimation?

  6. f.

    Can AI interpretation produce medically sound regions of explainability on radiographs for predicting age?

2 Related literature

The anatomical form of the maxilla and mandible along with the alveolar bone region development has most correlation with the individual’s chronological age [22]. When classifying jaw development, striking age-related features include the development of deciduous dentition, followed by each permanent tooth and finally the root completion of third molars [22]. The current study aimed to implement a series of methodologies from previous literature and generate a hybrid model that can be used to identify biological gender and estimate age.

2.1 Regression tasks

Age estimation from orthopantomograms or panoramic radiographs constitutes an application that leverages regression models [2, 3, 23]. The primary objective revolves around gauging an individual’s age, drawing from diverse variables encompassing mandible development, tooth germs, and areas of missing space within the dental arch [3, 23]. Previous inquiries have adopted Mean Absolute Error (MAE) metrics to delineate the efficacy of the regression model [10, 21]. In a manner akin to the present exposition (with results expounded in a subsequent segment), Fan et al. [21] similarly identified a diminished MAE vis-à-vis alternative CNN-only architectures, all pertaining to regression tasks. Demonstrating an automated methodology, Atas et al. modified the InceptionV3 framework to yield an innovative neural network model [10].

A more accurate and relatively faster dental age estimation stemmed from curtailing the array of attributes inherent in the devised model structure. He et al. introduced profound relation learning for regression, aiming to unearth diverse correlations within pairs of input images [24]. In parallel, Fan et al. aspired to formulate a hybrid deep neural network, termed DASE-net, amalgamating Transformer and CNN components. This novel architecture aimed at age prediction via dental x-rays, juxtaposing its performance against CNNs and manual techniques executed by forensic dentistry experts [21]. A contemporaneous study also evaluated gender using dental x-rays, employing DenseNet Architecture alongside comparative models [25]. The authors experimented with four distinct deep learning network structures: VGG, ResNet, EfficientNet, and DenseNet. Out of these, the proposed DenseNet121 model, endowed with fewer parameters, manifested superior outcomes compared to its more parameter-laden counterparts.

An independent exploration introduced the lightweight SFCN model, capable of accurate age prediction through a solitary fully connected layer, thus minimizing parameter count, in contradistinction to multi-layer counterparts [26]. After contrasting SFCN’s performance against that of ResNet18, ResNet50, ResNet101, and ResNet152, it was deduced that deeper models did not inherently outperform shallower counterparts in predicting brain age. Among the gamut of tested architectures, SFCN emerged as the pinnacle performer. Alternatively, prior literature also documented Bayesian convolutional neural networks as a possible approach to estimate age uncertainty [27].

2.2 Classification tasks

Age estimation can alternatively be tackled as a classification task, wherein the objective is to categorize individuals into predetermined age groups or classes. The classes employed for classification in this research are indicated in Table 1, and they were modified based on prevalent patterns in dentistry but in a simplified manner to ensure the limited dataset can generate meaningful and reliable data [28, 29].

Table 1 Classification Task: Age group classification

An automated approach for determining individuals’ age groups was presented by a group of researchers, employing transfer learning techniques on two convolutional deep neural networks: AlexNet and ResNet-101 [2, 3, 11, 23]. The classification process involved utilizing decision tree (DT), k-nearest neighbor (K-NN), linear discriminant (LD), and support vector machine (SVM) methods. Another study by Vila-Blanco et al. introduced two fully automatic methods for estimating chronological age [12]. The first approach, named DANet, employed a sequential Convolutional Neural Network (CNN) for age estimation. The second approach, known as DASNet, extended this by incorporating a second CNN path to predict gender and leveraging gender-specific features to enhance age estimation performance. Comparative results indicated the superior performance of DASNet over DANet.

In a different context, Almalki et al. delved into object detection using the YOLOv3 deep learning model, creating an automated tool to diagnose and classify dental abnormalities from panoramic dental radiographs. Meanwhile, Farhadian et al. employed the pulp-to-tooth ratio for age estimation [30]. Recent literature also introduced saliency map-enhanced age estimation techniques, capable of automatically estimating age based on lateral cephalometric images [31]. To identify the most suitable convolutional neural network model for automated age estimation, Milosevic et al. employed pre-trained parameters from general-purpose vision models [32]. Through ablation experiments, the authors identified the key anatomical areas within the dental system that significantly contributed to the age estimation process.

2.3 Pseudo labeling

Pseudo-labeling is a semi-supervised learning (SSL) technique that involves using the estimations of a trained model on unlabeled data to generate pseudo-labels, which are then used to augment the labeled dataset and train the model further. Pseudo-labeling can be a useful approach when labeled data is limited but unlabeled data is abundant. Fengbei Liu et al. in 2022 proposed a new and effective semi-supervised learning (SSL) algorithm in medical image analysis (MIA), called anti-curriculum pseudo-labelling (ACPL), which introduced novel selection and balancing techniques of unlabelled samples, that in turn facilitated the model to work with both multi-label and multi-class problems while allowing for the estimation of pseudo labels using ensemble classifiers [33].

More recently, in 2023, the Bayesian Pseudo Labels were used by the Xu et al. to illustrate the entire generalization of pseudo labels under the Bayes principle [34]. Then, by learning a threshold to choose high-quality pseudo labels, they offer a variational technique to learning to approximate Bayesian pseudo labels. A connection was built between pseudo labeling and the Expectation Maximization algorithm which partially explains its empirical successes. Rhee and Cho, through their research, offered a new confidence-based weighting technique for obtaining pseudo-labels with varied contributions based on the confidence in addition to an adaptive threshold adjustment strategy to supply enough and precise pseudo-labels throughout the training [35]. The ambiguity of pseudo-labels for perplexing samples in SSL was then drastically reduced by the unique pseudo-labeling schemes suggested later by Ham et al.[36] The investigators used the Easy-to-Forget (ETF) Sample Finder to execute our approach, Pruning for Pseudo-Label (P-PseudoLabel), comparing outputs of the model versus the trimmed model to find samples that are perplexing. Then, utilizing the perplexing samples, they execute negative learning to reduce the likelihood of giving inaccurate information and to enhance performance.

2.4 Siamese or triplet networks

An age estimation method was proposed by Zhang and Kurita from periods of age using Triplet Network [37]. The proposed model extrapolates the age values from that age period based on similarities across age periods. Triplet Network was utilized to record the age connection between the facial photos in order to achieve this functionality. Next, linear regression is used to estimate each image’s age. Hajamohideen et al. more recently proposed a Siamese Convolutional Neural Network (SCNN) architecture employing the triplet-loss function to represent MRI image inputs as k-dimensional embeddings [38]. To convert images into the embedding space, they employed CNNs that had been trained and those that had not. Afterwards, the 4-way classification of Alzheimer’s disease utilized similar embedding techniques. In Jeong et al.’s work, the investigators trained a convolutional neural network (CNN) model using the deep metric learning method based on a binary classifier Siamese network for class clustering operations [39].

2.5 Explainable AI

Selvaraju et al. presented a method for explaining decision to visual outputs made by a wide range of Convolutional Neural Network (CNN)-based models, which improved their transparency. This method, called Gradient-weighted Class Activation Mapping (Grad-CAM), identifies the most important areas in an image that are relevant to the estimation of a target concept by analyzing the gradients of the concept that flow into the final convolutional layer. Chattopadhay et al.[40] built upon the work of Selvaraju et al. by introducing Grad-CAM++, a refined version of Grad-CAM that provides visual explanations of CNN model estimates object localization and attempts to explain the presence of multiple instances of a class in a single image. Later, Omeiza et al. improved upon this by combining SMOOTH GRAD and Grad-CAM++ to present Smooth Grad-CAM+ +; The result was a model that could explain visual sharpness, object localization, and was adept at explaining multiple occurrences of objects in a single image [41]. Recently, gradient-based visualization techniques have been subject to criticism in academia, and there is ongoing debate about their effectiveness. Ramaswamy proposed a new methodology for generating visual explanations for deep Convolutional Neural Networks (CNN) using Ablation-based Class Activation Mapping (Ablation CAM). This approach applies ablation analysis to determine the importance of individual feature map units with respect to a particular class [42]. The authors later used ablation analysis to visualize the major components of learned representations from convolutional layers, and their innovative Eigen-CAM technique was used to improve explanations of CNN estimates without the need for accurate model classification [42].

Wang et al. [43] propose developing Bayesian deep learning techniques that are both explicable and implementable to quantify uncertainties precisely and identify the causes and potential solutions for reducing their impact. While FullGrad has recently gained attention for its model interpretability capabilities [44], if the highlighted red areas in the FullGrad analysis are consistent with medical theories, then the current study’s model can be deemed interpretable according to explainable AI and medical theory concepts.

3 Methodology

This section outlines the methodology employed in this study, including the dataset description, preprocessing techniques, the workflow, the PENViT and TriplePENViT model architecture, and triplet networks hard triplet mining, as well as the pseudo labelling task for semi-supervised training of neural networks.

3.1 Creation of the dataset

The current study utilized a two-part dataset comprising "Dataset A" and "Dataset B," both of which contain grayscale panoramic radiographs of the maxillofacial region.

Dataset A served as the primary labeled and deidentified dataset, with each image annotated with the patient’s age and gender at the time of data input as obtained from the history sheet. The dataset was expanded with additional radiographs during revision to provide greater support to the models when estimating age and gender across several variables. The age and gender distributions for Dataset A are reported in Table 2. The use of Dataset A was approved by the related organizations.

Table 2 Age and gender distribution in dataset a following revision

Dataset B was a publicly available dataset from Tufts University (http://tdd.ece.tufts.edu/Tufts_Dental_Database/Radiographs.zip), which lacks annotated labels for age and gender, similar to those seen in multicenter deep learning implementations [45]. It comprised of a collection of 1000 unannotated radiographs [46]. The said Dataset B was utilized as the source for unlabeled data in the semi-supervised task undertaken in the current study, and for the rest of the task, Dataset A was used. While Dataset B lacked information on chronological age, it added geographic variation to the datasets used to investigate pseudo-labelling technique (5).

The word “age” in context of the current study has two meaning: chronological age (labels) and estimated dental age (estimated by the neural networks). In the context of dataset interpretation, the former is used, however, when discussing neural networks estimation the authors of the current study use the latter.

Dataset A initially consisted of total 525 radiographic images, from where 101 samples either had labelling issues or exhibited distorted features and were therefore discarded leaving 425 images. The images were subsequently split into training and validation datasets. To explore more versatile approaches, Dataset A was later expanded during revision stages to 706. All images in Dataset A consisted of sizes of a minimum dimension of 2000 pixels in width and 1000 pixels in height. To ensure compatibility with the pretrained model, which has fixed input image size, each file was resized depending on the model to either 384 × 384 or 224 × 224 pixels during both training and validation stages.

Stratified resampling was then applied based on age labels to divide the initial dataset of 424 files into a training-validation split of 296 files for training and 129 files for validation, maintaining a ratio of 70:30 [47]. Once the age classifier was ready, the dataset was expanded further to train the gender classifier.

At this stage, the age distribution of the training samples was not balanced, resulting in class imbalance. To address this, medical data augmentation techniques were applied only to the training set, while the validation set remained fixed throughout the study [48]. The augmentation techniques employed included Horizontal Flip, Geometric transformations (Rotate, scale), and Intensity operations (gamma contrast and linear contrast) [48]. By augmenting the training data, the sample size increased to n = 924, and the class imbalance issue was completely mitigated, with each class having 154 images in the training set.

3.2 Model architecture

In this subsection, the proposed PENViT and TriplePENViT model architecture is demonstrated with information about the proper flow of data through the neural network, input–output dimensions, network’s layers details, and the architecture figure.

3.2.1 PENViT model architecture

In the current research, the PENViT model architecture (Fig. 1) incorporates two pretrained models, EfficientNet and Vision Transformer were used in parallel. Each model takes \(3\times 24\times 224\) images as input and produces intermediary vectors \({E}_{1}\in {\mathbb{R}}^{1000}\) and \({E}_{2}\in {\mathbb{R}}^{1000}\), respectively.

Fig. 1
figure 1

PENViT model architecture

To obtain the combined embedding vector \({E}_{c}\in {\mathbb{R}}^{2000}\), the following operation was performed:

$${E}_{c}=concat({E}_{1},{E}_{2})$$

The combined intermediary vector \({E}_{c}\) was then fed into a fully connected block consisting of a 60% dropout layer, followed by a Dense layer with 512 units, ReLU activation, another 60% dropout layer, and finally a Dense layer with 256 units. This fully connected block outputs the final embedding vector \(E\in {\mathbb{R}}^{256}\) for PENViT.

The authors in the current study performed \({l}_{2}\) normalization of the embedding vector \(E\). They then used the dot product of the weight matrix \(W \in {\mathbb{R}}^{256\times 6}\), where the weight matrix is in its \({l}_{2}\) normalized form. The dot product is equivalent to \(\mathrm{cos}({\theta }_{{y}_{i}})\) in the context of ArcFace Loss [49]. To obtain the logit projection, the authors applied the following transformations:

$${\theta }_{{y}_{i}}=\mathrm{arccos}\left(\mathit{cos}\left({\theta }_{{y}_{i}}\right)\right)$$
$$logits=s\times \mathrm{cos}({\theta }_{{y}_{i}}+m)$$

Here, \(s\) represents the scale factor, and m is the additive margin [49]. By applying the ArcFace loss, the final cross-entropy loss (also known as SoftMax loss) is computed as:

$${L}_{1}= -\frac{1}{N}\sum_{i=1}^{N}\mathrm{log}\frac{{e}^{s(\mathrm{cos}\left({\theta }_{{y}_{i}}+m\right))}}{{e}^{s(\mathrm{cos}\left({\theta }_{{y}_{i}}+m\right))}+{\sum }_{j=1, j!={y}_{i}}^{n}{e}^{s(cos{\theta }_{j})}}$$

For performance evaluation, the ArcFace component is omitted, and the logits are calculated as follows:

$$logit{s}_{validation}=E.W$$

In the current study, Pretrained frozen EfficientNet B0 and Pretrained Frozen Vision Transformer Large Patch Size 32 models were used for the proposed PENViT architecture. Both models were pretrained on ImageNet and subsequently fine-tuned on ImageNet1k.

3.2.2 TriplePENViT model architecture

In the current study, for the Triplet Network experiments (Fig. 2), a partial PENViT backbone was utilized that was reused up until the combined intermediary vector \({E}_{c}\in {\mathbb{R}}^{2000}\). \({E}_{c}\) then underwent a 60% dropout followed by a Dense layer with 256 units, resulting in the backbone’s embedding vector \(E\in {\mathbb{R}}^{256}\).

Fig. 2
figure 2

Triplet network with PENViT backbone, TriplePENViT architecture

In the Triplet Network, the backbone is denoted as M with parameterization θ. The same parameterization θ was used in three of the models within the triplet, denoted as \({M}_{\theta }\). For each triplet, consisting of an anchor, positive, and negative sample, three embeddings were produced: \({E}_{a}, {E}_{p},\) and \({E}_{n}\) each having a dimension of \({\mathbb{R}}^{256}\). The Triplet Margin Loss, denoted as \({L}_{2}({E}_{a}, {E}_{p}, {E}_{n})\), was employed to train the triplet network:

$${L}_{2}({E}_{a}, {E}_{p}, {E}_{n})=\mathrm{ ReLU}(\mathrm{Distance}({E}_{a},{E}_{p}) -\mathrm{ Distance}({E}_{a},{E}_{n}) +\mathrm{ m})$$

Here, the distance function used was the Euclidean distance, and \(m\) represents the margin value for the triplet loss.

For the evaluation of TriplePENViT, two methods were used. The first involved training a classification block, which consisted of a single dense layer, with the triplet network using only the anchor’s embedding. The second method involved calculating the distance between all pair embeddings and predicting the label of the current image as the label of the closest embedding. Both methods were reported in the results section.

Additionally, when training a classifier with the anchor’s embedding alongside training the triplet network, the authors proposed a customized task-specific loss function denoted as \({L}_{3}\):

$${L}_{3}={scal{e}_{1}\times L}_{c}(logits,G.T.)+scal{e}_{2}\times {L}_{2}({E}_{a},{E}_{p},{E}_{n})$$

The equation \({L}_{c}\) represents the classifier’s classification loss, which is the cross-entropy loss, and \({L}_{2}\) is the triplet margin loss. The scales, \(scal{e}_{1}\) and \(scal{e}_{2}\), are multiplied with each loss to ensure that none overpowers the others. However, during the initial warmup periods, \(scal{e}_{1}\) was forcefully assigned a value of zero (0). This was done to allow the triplet network to focus on learning to discriminate among the embeddings in the latent space before attempting to train the classifier. This customized loss function allowed for a balanced optimization of the classifier and the triplet network, incorporating both the classification, and embedding objectives in a joint training process.

Hard Triplet mining was used to choose the three samples (positive, negative, and anchor) as the input of TriplePENViT network. (Fig. 3).

Fig. 3
figure 3

Triplet mining example of current study, choosing the farthest positive(green) sample and the closest negative(red) sample with respect to current anchor(blue) sample

3.3 Details of the training process

Initially, the regression experiments were conducted with various backbone architectures. For regression tasks, the mean squared error (MSE) loss function was employed during training, while for validation, we reported the mean absolute error (MAE) loss in line with the metrics used by existing literatures [9, 21].

For classification task, continuous age values from the labels of Dataset A were converted into six class for the neural networks to classify. Different loss functions were used depending on the specific task and model type. The loss functions employed were: Cross Entropy (also known as SoftMax Loss), ArcFace loss, and Triplet Margin Loss [49]. However, all Triplet networks, including our proposed TriplePENViT, employs Triplet Margin Loss. All ArcFace experiments were conducted with an additive margin value of 34.3 degrees, and all Triplet Margin Loss experiments margin was 1.0 (unless mentioned otherwise).

The classification tasks were further challenged by conducting pseudo-labelling semi-supervised experiments (Fig. 4) using Dataset B that had unlabeled radiographs, i.e., no chronological age information was made available. A PENViT model was trained with Dataset A and trained up until 68.99% validation accuracy, and later used in the workflow of (Fig. 4) to complete the experiments of Table 6.

Fig. 4
figure 4

Semi-supervised pipeline workflow

In all experiments, an initial learning rate of \({10}^{-2}\) was utilized and employed the “Reduced Learning Rate on Plateau” training scheduler, with a patience level of 5 and a gamma factor of 0.9, to adjust the learning rate based on the validation loss. All experiments were conducted for a minimum of 400 epochs to evaluate the performance of various backbones. A weight decay value of 0.9 was utilized for the purpose.

For classification task, after getting the best performing backbone from above training process, the best backbone (PENViT as denoted in results section’s Table 4) was subjected to a three-day experiment consisting of 3000 epochs using 1 × Nvidia Tesla M60 on the Microsoft Azure ML Compute platform. However, it is worth noting that the model converged in less than 600 epochs. All other experiments were conducted on 1 × Nvidia Tesla T4. A minimum of 70 + experiments were conducted during this study, with only the most significant ones being presented in the results section.

In the case of the triplet network with a classifier (\({L}_{3}\) Loss), the authors experimented with initial warmup periods ranging from 20 to 100 epochs. During this warmup phase, the \(scal{e}_{1}\) value of \({L}_{3}\) was forcefully set to zero (0), allowing the triplet network to initially focus on learning to discriminate among the embeddings in the latent space, rather than attempting to train the classifier. Following the warmup period, the scale values of \({L}_{3}\) were set to be equal, with \(scal{e}_{1}\) and \(scal{e}_{2}\) both set to 1.0.

Throughout all the experiments, the batch size varied between 296 and 500, depending on the available GPU memory during runtime. Additionally, data augmentation techniques were applied during training to enhance the robustness and generalization of the models.

3.4 Evaluation methods

For the current study, the validation set remained the same for all the task and experiments. Therefore, validation accuracy was used as the sole performance metrics in the current study’s classification task, whereas for regression reliability or performance metrics, MAE was used.

4 Results

This section illustrates the results of the regression task, classification task, comparison between cross entropy and ArcFace performance, hard triplet minding task results, PENViT backbones performance, pseudo labelling workflow’s performance, and FullGrad images for model interpretability. MAE outcomes of the regression task have been highlighted in Table 3. The comparison between using SoftMax Loss and ArcFace loss was later reported in Table 4. The outputs have been described in Tables 3,4,5,6.

Table 3 Regression Task
Table 4 Classification task: pure cross entropy versus ArcFace (with and without gender classifier)
Table 5 Classification task: triplet networks and Siamese networks
Table 6 Classification task: training with Semi-supervised pseudo labelling techniques (all pseudo labelling done with PENViT trained up to 68.99% validation accuracy)

4.1 Regression: estimation of age using different model architecture

Table 3 presents the results of regression tasks using multiple layer CNN backbone, ViT, autoencoder, fully connected layers, and RESNET-like backbone. Among these popular backbones, pretrained ViT demonstrated superior performance in regression tasks. Compared to pure ViTs, the pretrained ViT achieved a lower Mean Absolute Error (MAE) of 2.83 years in the regression task.

4.2 Pure cross entropy versus ArcFace margin loss: rigorous experiments

Initially, cross entropy loss was utilized during the experimentation phase. Both ViT L32 and the novel PENViT model emerged as top performers, achieving validation accuracies of 68.11%.

To further enhance performance, the top-five models were selected and investigated the application of ArcFace Loss. Notably, the PENViT model demonstrated superior performance, reaching a validation accuracy of 70.54%. It was also observed that combining ArcFace Loss with a ResNet backbone led to increased validation accuracy in certain cases. However, it is important to highlight that in three out of the five cases evaluated, the application of ArcFace Loss resulted in a decline in overall performance.

These findings highlight the effectiveness of PENViT in conjunction with ArcFace Loss, consistently outperforming other models. The synergy between the ResNet backbone and ArcFace Loss was found to be beneficial in specific scenarios. The gender classifier alongside the age classifier resulted in partial degradation of validation accuracy from 68.21% to 67.44%. Nevertheless, also affirming that the PENViT architecture performed better for panoramic radiographs when combining age and gender classifiers.

4.3 Evaluating hard triplet mining task

In the Siamese or Triplet network family, the authors introduced the TriplePENViT model architecture, which outperformed other models by incorporating hard triplet mining and a classification block that utilizes only the anchor’s embedding. The TriplePENViT model, with its specific loss function, achieved an accuracy of 67.44%.

Interestingly, increasing the margin value of the loss function L3 did not lead to an improvement in the validation accuracy of the TriplePENViT model. This suggests that the chosen margin value was already optimal for the given task, and further adjustments did not yield significant performance gains.

4.4 PENViT backbone’s effectiveness against other backbones

In Tables 4 and 5, it is evident that the PENViT model and its variation, TriplePENViT, consistently outperformed other models, whether they were triplet networks or other types of neural networks. The PENViT model achieved a validation accuracy of 70.54%, while others achieved a maximum of 68.21%, resulting in a 2.33% increase in accuracy.

Similarly, in another instance with TriplePENViT utilizing the triplet network and its classification block, it surpassed other models by achieving a validation accuracy of 67.44%, compared to a maximum of 65.11% achieved by others. This again resulted in a 2.33% increase in validation accuracy.

Therefore, in both cases, when using either the PENViT backbone or its triplet network variation, TriplePENViT, there was a consistent improvement of 2.33% in validation accuracy compared to other models (Table 6).

4.5 Evaluating pseudo-labelling technique

The application of pseudo-labelling technique to train medical image data, which typically has a limited number of available samples, did not result in an increase in validation accuracy beyond the performance of the best model. Despite attempting pseudo labelling, the validation accuracy remained consistent and did not surpass the accuracy achieved by the best model.

4.6 PENViT’s model interpretability of classification

The model’s Explainable AI results (FullGrad) are provided in Fig. 5:

Fig. 5
figure 5

A FullGrad images for age group 0–5 (Deciduous Dentition). B: FullGrad images for age group 6–12 (Mixed Dentition). C: FullGrad images for age group 20–29 (Young Adults). D: FullGrad images for age group 30–59 (Middle Aged Individuals)

For Deciduous dentition (Fig. 5A) group and Mixed Dentition (Fig. 5B) groups, model took account of the developing tooth buds of permanent teeth and the relative proximity to the overlying deciduous dentition with some prioritization on the mandibular shape. For ages 20 to 29 (Fig. 5C), the model accounted for the permanent dentition and the alveolar bone density surrounding the formed root apices and the root formation and eruption status of the third molars. For ages above 30 years (Fig. 5D), the model additionally notes occlusal deformity resulting from missing permanent dentition and the condylar regions highlighting temporomandibular joints to be an important predictor for automated age estimation from 2D panoramic radiographs.

5 Discussion

To attain the objectives outlined, regression tasks were first adopted as the means to ascertain the deep learning models that yield the minimal Mean Absolute Error (MAE) when trained on a designated training set and validated on the same data sets. Notably, Vision Transformers utilizing self-attention mechanisms outperformed their counterparts, signifying the superiority of Vision Transformer-based backbones, particularly within Orthopantomograms (OPGs), for subsequent computer vision tasks such as classification. Consequently, the subsequent stage involved a transition from regression to classification, wherein the continuous age labels were discretized into six distinct classes. This also served as another test for the efficacy of the Vision Transformer hypothesis, reaffirming the dominance of Vision Transformers over Convolutional Neural Network (CNN) based backbones in the present study.

With the said pattern in consideration, the authors proposed the PENViT architecture as a hybrid solution, synergizing both CNN-based and self-attention-based backbones, which resulted in a notable performance boost. In further endeavors to enhance performance, a comparative analysis of two loss functions for multi-class classification was conducted, namely, SoftMax/CrossEntropy and the ArcFace Margin Loss function. Impressively, ArcFace exhibited even greater performance improvements. Although ArcFace is conventionally employed in Face Recognition models, its geometric interpretation elucidates its role in creating a margin in the latent vector space for class differentiation, as depicted in Fig. 6.

Fig. 6
figure 6

Geometry of ArcFace margin loss

The backbone acquires the ability to differentiate samples within the latent vector space; however, this differentiation becomes more pronounced with the introduction of ArcFace. The margin imposed by ArcFace enforces greater separation among samples, thereby augmenting the discriminative prowess of the neural network and consequently elevating accuracy.

Future research could contemplate the integration of alternative data modalities alongside OPGs for weakly supervised age estimation. While the present study focused solely on OPGs for this purpose, Schmeling et al.’s [50] suggestion of including hand-wrist radiographs in sequential aging analysis, in conjunction with OPGs[51], could yield further advancements in age estimation performance.

The pursuit of age estimation holds significance in dentistry, forensic science, legal proceedings, court hearings, and related domains. Clinical application of unsupervised age estimation may be present in aiding computer vision-based diagnostics of restorative treatment needs based on predictive age of exfoliation [52, 53]. Within the aforementioned contexts, the utilization of OPGs for the task emerges as a cost-effective and straightforward approach compared to other methodologies [54]. While alternate techniques such as Cone Beam Computed Tomography (CBCT) and Computed Tomography (CT) exist in specialized dental practices and have been documented in literature, it is important to note that authors such as Yuan et al. [55] applied pelvic radiographs and supervised CNNs to train 1498 images, while Othmani et al. [56] applied supervised CNNs on 45,000 facial photographs to attain an MAE of 2.35. In comparison, the current study attained an MAE of 2.83 using unsupervised learning using 706 radiographs.

5.1 Limitation of current study

The study lacked an evaluation of diagnostic accuracy, which could provide valuable clinical insights into the neural network model’s performance on unknown data, potentially affecting its clinical utility given the reported model’s performance of 70.54%. Secondly, the absence of a comparison between the model’s performance and that of a human diagnostician remains a research gap, although the potential for such a comparison through tools like FullGrad, GradCAM, or other EX-AI techniques is acknowledged for future research. Despite recognizing the challenge of immediate execution, this study’s limitation lies in not implementing a comparison with human diagnostic practices, as seen in similar studies [57]. Additionally, the proposed network architectures, PENViT and TripplePENViT, may lean heavily towards computer vision and deep learning techniques rather than a clinician’s perspective, indicating a potential mismatch in orientation. The study also lacks an investigation into class imbalances and which class has higher representation, despite the clear focus on evaluating different backbone performance, proposing concatenation of the best-performing ones, and assessing the impact of margin loss and pseudo-labelling. While prioritizing these aspects over reporting diagnostic accuracy, the study acknowledges the importance of this limitation. Lastly, the dataset’s size and class imbalance (across age and gender) are acknowledged to potentially yield unintended outcomes, despite artificial dataset size increase and balancing efforts. The study recognizes that more optimal results might have been achievable with larger, better-balanced datasets as is crucial for clinical applications.

6 Conclusion

From the current study the following answers can be inferred for the posed research questions:

  1. 1.

    ViT demonstrated superior performance over CNN architectures in regression tasks.

  2. 2.

    ArcFace showed mixed results, with instances of improved performance compared to pure cross entropy loss, but also cases where it deteriorated performance. In contrast, Triplet Margin Loss in a triplet network consistently outperformed other experiments, except for ViT L32 with Cross Entropy Loss, which performed slightly better.

  3. 3.

    The use of hard triplet margin alone resulted in poor performance but combining it with a classifier yielded comparable results to the best-performing approach.

  4. 4.

    The proposed PENViT backbone consistently outperformed other backbones, achieving higher validation accuracy.

  5. 5.

    Training the model with pseudo labelling did not yield satisfactory results compared to using annotated data only.

  6. 6.

    The FullGrad approach for explainability of the model highlighted that the most influential areas for predicting age brackets were deciduous teeth, areas of anodontia, extent of sinus cavities, periodontal regions, third molar regions, medullary regions of the mandible, and the temporomandibular joint complex that are consistent with medical explainability.