Introduction

Tooth extraction is one of the most commonly performed medical measures in the field of general dentistry/oral and maxillofacial surgery. The decision is based on the patient's records, which include medical history, clinical evaluation, and radiographs. Given its irreversible impact on the quality of life, the decision of extraction should be made with great care [1,2,3]. Certain X-ray signs are pivotal in determining the necessity for tooth extraction. These signs include the compromised structural integrity of the tooth, significant alveolar bone loss, or evident root fractures. In addition, massive periapical radiolucency may also suggest the extraction. Advanced internal or external resorption cases can also be identified on these radiographs, providing a clear indication for removal of the affected teeth [4].

Although indications are made clear in the extraction guidelines [5, 6], the decision-making process is not always easy for the practitioner in clinical practice [2, 4]. This decision may be confounded by many factors, such as the dentist's/specialist's own experience, the reliability of the clinical evidence, or even pressure from patients [5]. The interplay of these different potentially disruptive factors regarding diagnostic decision-making can lead to misdiagnosis and problematic therapy situations, especially in borderline cases. For example, incorrect tooth extraction is the third most common cause of tooth loss in periodontally damaged teeth [7].

However, leaving teeth that are not worthy of preservation is not an option, as they can cause massive pain [1] and can even be the starting point for life-threatening lodge abscesses in the head and neck region or cause fatal endocarditis, which ultimately affects the entire organism [8, 9]. At the same time, every tooth extraction has its risk of serious complications like persisting root fractures, dry sockets or damage to neighboring teeth. Therefore, the indication is also always a balancing of different requirements. In general, tooth extraction serves as a last resort when every other treatment option failed or is not indicated anymore [4].

Panoramic radiographs (PANs), commonly used due to easy access and low dosage, are crucial in evaluating a patient’s dental condition, providing insights into the whole dentition and relating structures [10]. However, accurate and comprehensive interpretation of PANs requires extensive training and considerable clinical experience. This expertise may not be fully developed in young practitioners, potentially leading to variability in diagnostic decisions [11]. Furthermore, seasoned practitioners may also be susceptible to cognitive and visual pitfalls when dealing with challenging cases [12].

Deep learning (DL), a subfield of artificial intelligence (AI), has revolutionized the field of medical imaging by extending the capabilities of human practitioners. These models are trained on vast datasets, allowing them to recognize patterns and anomalies with superhuman precision [13]. In the context of PANs, the DL models enable the detection and segmentation of anatomical structures in seconds, with performance improvements being noted on an ongoing basis [14,15,16,17,18]. With these segmentation and recognition results, the DL model can then classify and number the teeth systematically as dentists do [19, 20]. Moreover, DL models can identify subtle or complex pathologies that may be overlooked by the human eye, such as caries, cysts, periodontitis, and periapical lesions. These can be automatically annotated with high accuracy [21,22,23,24,25]. Such advancements demonstrate the potential of DL to serve as a powerful tool that enhances diagnostic accuracy and efficiency.

Despite these advancements, most research has focused on lesion diagnosis [26,27,28,29,30], with limited exploration into subsequent clinical decisions like tooth extraction. Furthermore, the model's predictions are often given with blunt probabilities without any explanation or reasoning process, which is crucial for clinical acceptance and understanding. Applying explainable DL has the potential to accelerate the decision-making process, resulting in timely and more effective interventions, ultimately leading to improved patient outcomes [31].

The study's main objective is to develop and internally validate a model that can predict the need for tooth extraction from PANs and compare its performance to dentists/specialists. Furthermore, the effect of contextual knowledge of teeth on the model's performance and its possible explainability will be visualized.

Material and methods

Study design and patients

The study used retrospective PANs from 2011 to 2021 from patients who underwent tooth extraction at the Department of Oral and Maxillofacial Surgery of the University Hospital RWTH Aachen. Patients with edentulous conditions, or without available panoramic radiographs taken within six months post-treatment were excluded. Additionally, patients with significant artifacts in their preoperative panoramic radiographs that affected the teeth were also removed from the study cohort.

The study was approved by the Ethics Committee of the University Hospital RWTH Aachen (approval number EK 068/21, chairs: Prof. Dr. G. Schmalzing and PD Dr. R. Hausmann, approval date 25.02.2021) and followed the MI-CLAIM reporting guideline for the development of AI models [32].

Dataset preparation

For the study, all PANs were exported in DICOM format from the hospital’s picture archiving and communication system. If a patient had received more than one PAN within six months post-treatment, the last PAN would be taken as the postoperative image. After the cohort's statistical summary, all PANs were stratified by patients and converted to PNG format for anonymization purposes.

Annotations and labeling of teeth in the preoperative PANs were performed by four investigators (I.M., J.B., K.G. and B.P.) using LabelMe [33]. For this purpose, all teeth were marked with a bounding box on the preoperative image and divided into a preserved and extracted class according to their presence in the postoperative image (Fig. 1). Implants or residual roots were marked in the same way as teeth. For quality control, the annotated images and labels were then reviewed by two investigators (I.M. and B.P.) for a second round.

Fig. 1
figure 1

Pipeline to prepare the dataset. Panoramic radiographs from the same patient were compared and annotations of teeth were made on the preoperative image with bounding boxes and labeled as preserved (green) or extracted (yellow). Different margin factors were used to resize the bounding boxes (red) in width and height. Teeth images were then cropped from the original image with margins (-0.5% to 10%)

The bounding boxes with different margin settings were then used to crop single tooth images out from preoperative PANs, with their class (preserved or extracted tooth) exported simultaneously. Since the distances (in mm) in PANs are not uniform and the teeth themselves have different sizes, we defined the margins in % of the PAN image height and width. Images were then exported with margins ranging from -0.5% to 10%, with 0% being the bounding box itself, resulting in 8 datasets. Figure 1 describes the pipeline of the dataset preparation.

Model development and validation

The dataset was first stratified by patient and then randomly divided into a training set, validation set, and test set in an intended 4:1:1 ratio. All single tooth images cropped from PANs were assigned accordingly. During training, we apply a random crop to the image, then resize it to 224x224 pixels and perform horizontal flip augmentations to enhance model generalization. Validation and test sets images are resized to 256x256 pixels and the 224x224 center-crop is extracted for improved classification.

The training was conducted on a high-performance cluster at RWTH Aachen University. We adopted a ResNet50 model pre-trained on ImageNet. The binary cross-entropy loss was used for our binary classification tasks. Training spans 50 epochs. The model employs the SGD optimizer with a learning rate of 0.01 and momentum of 0.9. A learning rate scheduler reduces the learning rate by a factor of 10 every 7 epochs, aiding in precise model tuning as training progresses (reduce by < 1 = increase). Model performance was evaluated based on accuracy and ROC-AUC metrics, with periodic checks to save the best-performing model based on the highest ROC-AUC achieved. Predictions were made on the test set using these best models, and the predictions were evaluated and saved. The corresponding code can be found on GitHub (https://github.com/OMFSdigital/PAN-AI-X).

Performance of dentists

In addition, the test images were evaluated by 5 dentists/specialists (A.P., J.B., I.M., K.X., B.P.) with different levels of experience (dentist in first year to specialist in oral and maxillofacial surgery) to evaluate human performance. For this purpose, the 4,298 test images (2% margin) were randomly distributed among the investigators. Each dental image was then given a score between 0 (preserved) and 10 (extracted) to determine the likelihood with which a human investigator would recommend a removal of the tooth. The 2% margin was chosen to compare dentists’ performance to the DL model with the best performance. To avoid a learning effect between the annotation in the PANs and the scoring of the individual tooth images by the investigators, there was a 6-month time delay between initial annotation and scoring.

Model explainability

To explain the basis of the prediction of the AI models, CAMERAS [34] was used. It uses class activation mapping to help visualize the regions of the input image that are important for the model's decision-making process (Figs. 4 and 5). In our case of binary classification where outcomes are extraction or preservation, CAMERAS highlights features based on the binary outcome. If the model predicts extraction, it highlights features leading to this decision; conversely, a prediction of preservation highlights or lacks features, indicating why the preservation is predicted. The intensity and frequency of these highlights can aid in interpreting model outputs, where more frequent or intense highlights correlates with a prediction with a higher probability.

Statistical analysis

The statistical analysis was performed in Python (version 3.11.0) using the scikit-learn package (version 1.4.0). The performance of the AI classifiers and dentists were assessed by using the area under the curve of the receiver operating characteristic curve (ROC-AUC) and the precision-recall curve (PR-AUC). We then calculated the maximum Youden's index for each ROC curve and acquired the optimal threshold for the corresponding model. Metrics of accuracy, specificity, precision (syn. positive predictive value), and sensitivity (syn. recall) were calculated with the thresholds above. The F1 score was calculated from precision and sensitivity. We used a set of thresholds of 0.3 and 0.7 to plot the confusion matrices with clinically relevant decisions, namely extraction, monitoring, and preservation.

Results

Patients

1,184 patients who met the criteria were selected in this study. The average age of patients was 50.0 years (range 11 – 99 years), with a standard deviation of 20.3 years. The gender ratio of the cohort was 61:39, with 722 males and 462 females. From a total of 1,184 preoperative PANs (one per patient), a total of 26,956 individual dental images were cropped and exported based on the corresponding bounding boxes. Among these dental images, 21,797 were classified as preserved and 5,159 were classified as extracted. The prevalence of tooth extraction in our dataset was 19.1%, compared to the majority of 80.9% of preserved teeth. The demographic and clinical characteristics of patients are described in Table 1.

Table 1 Demographic and dental characteristics of the patients and distribution across training, validation, and testing datasets

Performance of AI models

Eight different ResNet-50 models were trained on all 8 datasets with margin settings from -0.5% to 10%. In each dataset, all 26,956 single tooth images cropped from PANs were stratified by patient and then split into a training set (17,874), validation set (4,784), and test set (4,298). The performance of models is summarized in Table 2 and Fig. 2 based on the thresholds at the maximum Youden’s index. The model with 2% margin setting yielded the best results in both ROC-AUC (0.901) and PR-AUC (0.749). It also exhibited the best performance in all other metrics except for sensitivity. Shrinking of the bounding boxes (margin -0.5%) produced worse results in ROC-AUC and PR-AUC than the baseline (margin 0%). A general increase can be observed in both ROC-AUC and PR-AUC as the margin increases from -0.5% to 2%. Models with a 5% margin setting achieved the highest sensitivity (0.835), however, increasing the margin further to 10% reduced both ROC-AUC and PR-AUC, probably due to the limited input size of the ResNet (224x224). In confusion matrices displayed in Fig. 3, with thresholds of 0.3 and 0.7 for monitoring, the 2% margin model had the least cases of false positive (53). In this case, model with 3% margin had the highest accuracy (3455/4298).

Table 2 Performance of AI models with different margin settings as well as dentists' performance. Thresholds were different and were determined at maximum Youden's index in each model
Fig. 2
figure 2

(a) ROC curves and (b) PPV-Sensitivity curves of models with different margin settings. The 2% margin model performed best in both ROC-AUC (0.901) and PR-AUC (0.749), the average performance of dentists was ROC-AUC (0.797) and PR-AUC (0.589). Relationship between ROC-AUC and margins is displayed in (c). Relationship between PR-AP and margins is displayed in (d). A steep increase observed for both metrics from -0.5% to 2% margin and slightly drop from 5% to 10% margin

Fig. 3
figure 3

Confusion matrices showing prediction results. The results from AI models (a)(f) and dentists (g) with different margins were split into 3 decisions, namely extraction, monitoring, and preservation. Teeth with prediction probabilities from 0.3 to 0.7 were recommended to “Monitor”. Teeth with prediction probabilities below 0.3 were recommended to “Extract” while above 0.7 to “Preserve”. True labels were marked in y-axis

Performance of dentists

In contrast, the human assessment (average of 5 dentists/specialists) had a lower performance based on the 2% dental images compared to the AI models. The ROC-AUC was only 0.797 or PR-AUC of 0.589. This is also reflected by the confusion matrices where dentists have the most false-positives (131) and lowest accuracy (3085/4298).

Explainability

Figures 4 and 5 shows the activation map of the extracted and preserved predictions generated by CAMERAS with a 2% margin setting. In extraction cases, the model focused on the areas where roots are exposed in low density regions and crowns are buried in bone. On the other hand, in preservation cases, alveolar ridge and periapical regions were the most relevant.

Fig. 4
figure 4

Activation gradient heatmap generated by CAMERAS for extracted teeth with a margin of 2%. The probability (P: 0 to 1, where 0 indicates preservation and 1 indicates extraction) of the prediction is shown in the first row. The left image in each column is the tooth image used for the prediction, the right image is the class activation mapping with CAMERAS. Blue indicates no activation and red indicates strong activation. Green and yellow are in between

Fig. 5
figure 5

Activation gradient heatmap generated by CAMERAS for preserved teeth with a margin of 2%. The probability (P: 0 to 1, where 0 indicates preservation and 1 indicates extraction) of the prediction is shown in the first row. The left image in each column is the tooth image used for the prediction, the right image is the class activation mapping with CAMERAS. Blue indicates no activation and red indicates strong activation. Green and yellow are in between

Discussion

In this study, to our knowledge, we present the first clinical prediction model using DL to make a recommendation about teeth extractions. The main results of the study are, 1) the best model achieved a ROC-AUC of 0.901 with a PR-AUC of 0.749; 2) outperforming dentists/specialists, who on average achieved a ROC-AUC of 0.797 with a PR-AUC of 0.589; 3) additional contextual information through wide margins around the tooth led to a better prediction; 5) the visual explainability of the prediction for tooth extraction or preservation was comprehensible.

Decision aids are a useful tool, for example in healthcare, to reduce the dentists’ workload, as suggestions calculated by algorithms can contribute to the final decision-making or diagnosis and significantly speed up this process [35]. Similarly, decision aids can be used as an objective perspective, especially in borderline cases where otherwise subjective approaches are applied by the clinicians alone [35, 36]. In this regard, work in the medical field has already been done on identifying pathologies in medical imaging like X-ray scans. One of the first applications used for detection was in 1995 to detect nodules in X-rays of the lungs [37]. Another object detection algorithm was developed to detect and classify several entities in chest X-rays like cardiomegaly, calcified granulomas, catheters, surgical instruments or thoracic vertebrae [38]. The emergence of convolutional neural networks / DL more than a decade ago opened up completely new possibilities [39].

One recent application is described by Yoo et al. who proposed a DL model (VGG16 pre-trained on ImageNet) to predict the difficulty of extracting a mandibular third molar from PANs [40]. The model was trained to predict the difficulty of mandibular third molar extraction in terms of depth, ramal relationship, and angulation. The accuracies of the model for different difficulty parameters (depth, ramal relationship, angulation) were found to be 78.9%, 82.0%, and 90.2%, respectively. Yet the model was made to predict the difficulty rather than the necessity of the extraction.

In our study, we used a residual neural network (ResNet-50) pretrained on ImageNet for the development of our clinical prediction model. Compared to other convolutional neural networks, a ResNet is characterized by so-called residual skip connections, which add inputs to outputs of small blocks of layers in the network. These skip connections improve the gradient flow during training and significantly improve the performance of very deep networks [41]. An outstanding strength of our model was its ability to classify teeth not worthy of preservation across multiple indications, such as extractions for orthodontic space, misplaced wisdom teeth, caries-destroyed teeth, periodontally compromised teeth or teeth from mixed dentition. Equally noteworthy was the reliable classification even in radiographs with more difficult classification conditions, such as anatomical superimposition effects.

Evidence-based medicine encourages decisions based on patient-specific clinical evidence, however, DL models often provide blunt predictions without any explanation [42]. This results in a low acceptance of these predictions among practitioners due to the lack of visible evidence [31]. To address this problem, class activation map offers a solution to visualize and highlight the critical area of the image where the predictions are made [43, 34]. In the case of the caries classification task in the study of Vinayahalingam et al., areas that lead to the classification by the DL model were highlighted [44]. Such visual prompts can then correlate with established dental knowledge of the practitioners, which in turn explains the classification or recommendations.

We used CAMERAS, which, in contrast to methods such as GCAM or NormGrad, provides high-resolution mapping for ResNet and, thus, better insights into the explainability of DL methods [34]. The explainability can be illustrated using the examples of extracted teeth (Fig. 4) and preserved teeth (Fig. 5), including their prediction probability. In the case of healthy teeth, for example, this leads to activation of the bone, whereas in the case of root remnants this leads directly to the root itself. In addition to the recommendation, this activation map could also be offered directly to the dentist.

Interestingly, however, it can also be seen that due to the additional context information provided by the extended margin (2%) in Figs. 4 and 5, neighboring root residues are also included in the classification and may possibly lead to a misclassification. This could be remedied in the future by more modern architectures that consider the entire PAN instead of individual image sections with a tooth and the adjacent bone. This enables a holistic approach, where the DL model first detects the teeth and then classifies them.

Besides these technical aspects, the question arises as to how such a model could be translated into practice. An important challenge is that DL models fall under regulatory requirements such as FDA/Medical Device Regulation (MDR) as medical software. This means that the models developed in research cannot simply be applied in clinical encounters [45]. An important step here would be the external validation of the developed model [46]. In our department, the prevalence of tooth extraction was 19.1% (Table 1). This is influenced by the present population’s socioeconomic status, as well as the treating specialty (conservative dentistry, prosthodontics, orthodontics, oral and maxillofacial surgery) and pre-selection of cases, which has an impact that cannot be dismissed. This could represent a bias if the model is applied elsewhere. On the other hand, it could be argued that the reasons for tooth extraction are universal worldwide [3, 47]. Periapical radiolucency or deep caries should not be treated much differently around the world.

Clinical prediction models such as ours usually divide cases into two treatment recommendations based on a single threshold (perceive/extract). When using classifiers, this is often set by default to a threshold of 0.5. If the probability is above it, the tooth is extracted, and if it is below it, it is preserved. For an actual application scenario, however, the question of design is particularly crucial for optimal clinical usefulness [48]. This could involve dividing teeth into three groups based on two thresholds (0.3 and 0.7) instead just of a single threshold (0.5), using a low threshold (with a high negative predictive value) to distinguish teeth that are definitely worth preserving from suspect teeth. Another higher threshold (with a high positive predictive value) could separate suspect teeth from definitely not preservable ones (like tooth remains). For values between, the suspect teeth could be monitored closely, while the healthy teeth are ignored and the decayed teeth are extracted. An example for this approach is shown in Fig. 3.

However, a major limitation of our results is that our model does not include clinical information (pain, tooth vitality, course of disease, diagnosis). On the one hand, this is impressive because a high level of accuracy has been achieved despite the lack of any clinical information surpassing humans. Nevertheless, in a real clinical setting this information would be available and should be used. In the future, multimodal AI models could be used to process additional clinical information and improve prediction.

Another limitation is that there was a maximum period of 6 months between pre- and postoperative PAN. Usually, significant changes are visible during this period, but the causes for the extraction may not have been visible on the preoperative image used in some cases, and only become visible shortly before the extraction itself (such as the involvement of teeth in a mandibular fracture).

Conclusion

In summary, our study presented the first AI model to our knowledge to assist dentists/specialists in making tooth extraction decisions based on radiographs alone. The developed AI models outperform dentists, with AI performance improving as contextual information increases. Future models may integrate clinical data. This study provides a good foundation for further research in this area. In the future, AI could help monitor at-risk teeth and reduce errors in indications for extraction. By providing a class activation map, clinicians could be able to understand and verify the AI decision.