Introduction

Early detection of ERR is crucial as it may lead to progressive, irreversible damage and tooth loss in severe cases [1, 2]. ERR is commonly revealed incidentally during radiographic examination though the prevalence has been reported as high as 28.8% [3, 4]. The periapical radiograph is one of the radiographic examinations commonly used to identify ERR. Although it has a high resolution, this image has several limitations, such as superimposition of two-dimensional image that may underestimate the true extent of ERR [5, 6]. It has been reported that CBCT is superior to intra-oral periapical radiographs in detecting ERR because it permits three-dimensional evaluation [6,7,8,9,10]. However, assessment of ERR on CBCT can be influenced by observer performance and viewing condition. A computer-aided tool may improve the identification and reduce the time in identifying pathologies such as ERR [11].

In dentistry, machine learning (ML), a subfield of artificial intelligence (AI) based tools, has been developed to automate the identification of oral and maxillofacial pathologies such as ERR [12]. Random Forest (RF) and Support Vector Machine (SVM) classifiers were the high-performing ML algorithms commonly utilized for image classification tasks in dentistry [13,14,15]. The use of multilayer convolutional neural networks (CNN) contributes to deep learning (DL) methods that can learn image features and perform classification tasks [16,17,18]. However, CNN requires high computational costs and needs to adapt a considerable number of parameters [19]. To address this issue, several pre-trained models have been established with pre-defined network architectures. To overcome the issue of overfitting due to limited sample data for deep learning training, a transfer learning with CNN had been recommended for small sample size studies [20]. Transfer learning model based on Visual Geometry Group with 16-layer (VGG16) and EfficientNetB4 has been reported to achieve excellent performance on several image classification tasks [19, 21, 22]. The ensemble of pre-trained architectures such as VGG16 and EfficientNetB4 with machine learning algorithms (RF and SVM) have resulted in high performance of classification tasks [19].

The development of machine learning models that incorporate medical diagnostic images for disease classification has encountered significant challenges resulting from the complex and large number of features present in these images [23]. To address this challenge, a process known as the feature selection technique (FST) was introduced. The process is specifically designed to extract the most relevant and significant subset from the original set of features [24]. Feature selection technique has been implemented to classify carious cavities with a reported high accuracy of 96% [25]. Another study on FST for breast cancer classification found that using conventional FST improved classification accuracy by 51% [26]. A novel FST wrapper method (Boruta algorithm) has recently been implemented to improve the performance of RF classifiers in classification models [27]. Additionally, to improve the performance of SVM classifiers, the RFE algorithm was widely used as a feature selection method [28, 29].

The current study utilised an innovative approach that combines transfer learning using VGG16 and EFNetB4 architectures. This novel methodology incorporates the integration of Support Vector Machine (SVM) and Random Forest (RF) classifiers to improve the accuracy of ERR classification. In this study, four deep learning models, which consist of a hybrid between ML algorithms (RF and SVM) with pre-trained DL architectures (VGG16 and EfficientnetB4) were developed for the identification and classification of ERR. Feature selection methods were implemented on all models to optimize the classification performance of ERR resulting in four additional optimised models. Therefore, this study aims to (1) evaluate the accuracy of deep learning models (DLMs) in ERR identification, (2) assess the effect of FST on DLM performance.

Materials and methods

Study protocol

This study assessed the effect of feature selection technique (FST) on DLM performance in classifying ERR lesions. In the first stage of this study, image preprocessing was performed using Contrast-Limited Adaptive Histogram Equalization (CLAHE) filter. Then, image classification analysis was conducted using two pretrained deep convolutional neural networks (CNN), namely, EfficientNetB4 [30] and VGG16. These two deep CNNs were ensemble with two machine learning classifiers; Random Forest (RF) and Support Vector Machine (SVM) to perform ERR classification. As a result, four DLMs were developed; RF with VGG16 (RF + VGG), RF with EfficientnetB4 (RF + EFNET), SVM with VGG16 (SVM + VGG) and SVM with EfficientnetB4 (SVM + EFNET) in the first stage. In the second stage, a feature selection algorithm (Boruta and RFE) was employed to generate four new optimized DLMs (FS + RF + VGG, FS + RF + EFNET, FS + SVM + VGG and FS + SVM + EFNET) [30]. The block diagram of the proposed model as discussed in the study protocol is given in Fig. 1. The Institutional Review Board of the Medical Ethics Committee Faculty of Dentistry University of Malaya (DF RD2030/0139 (L)) has reviewed and approved this study protocol.

Fig. 1
figure 1

Proposed model block diagram

Dataset

A total of 88 extracted premolars were collected from the Faculty of Dentistry, University of Malaya. The inclusion criteria set for this study were absence of root destruction, complete root formation, absence of caries or abrasions in the cervical region, and no endodontic treatment [31]. Tungsten burrs of various sizes (0.5 mm, 1.0 mm, and 2.0 mm) were used to simulate different depths ERR on each tooth. All teeth were scanned with a CBCT machine (CS 9000 CBCT, Carestream Dental, Atlanta, GA). The acquisition settings were 65 kVp, 5 mA, 10.8 S 5 × 3 cm F.O.V., 0.076 mm isotropic voxel size. In total, 2125 2D slices of CBCT images were obtained. All CBCT datasets were converted to Digital Imaging and Communication in Medicine (DICOM) format. The sample size was calculated based on a previous comparable study [26] by a priori power analysis in G*power 3.1.9.7, assuming an independent t-test dataset with a power of 80% and significance of 5%.

Ground truth labelling

Data analysis for ERR detection and labelling were performed by an oral and maxillofacial radiologist with five years of experience analyzing CBCT images and was considered as the ground truth. Each annotation was further classified into four groups of depths. All CBCT data was visualized on a Dell laptop (1920 × 1080 pixels, Dell Latitude E7450; Dell, Austin, TX). The ground truths dataset was prepared by segmenting the CBCT images (DICOM format) using a third-party A.I. tool (Makesense.AI) [32]. Teeth were grouped as 0 (ERR depth = 0.5 mm), 1 (ERR depth = 1.0 mm), 2 (ERR depth = 2.0 mm), 3 (no ERR). Figure 2 shows sample images from the dataset collected from the Faculty of Dentistry, Universiti Malaya.

Fig. 2
figure 2

External root resorption sample images

AI Network architecture and training

Image preprocessing

In Phase 1, the extraction of region of interest (ROI) and image enhancement was performed (Fig. 3). A bounding box of 160 × 320 pixels was assigned to all 2D slices, with the tooth centered in the box and converted into Portable Network Graphic format. All sagittal slices were used to train and test (ROI) from these bounding boxes. The ROI obtained from a single tooth ranged from 17 to 80 slices resulting in a total of 2125 number of ROI extracted from the CBCT volumes (training and validation 1700, test 425). Then, the CLAHE filter was applied followed by an adjustment in image intensity before the pre-processing procedure.

Fig. 3
figure 3

CBCT data image processing

Image classification

In Phase 2, four main DLMs (RF + VGG, RF + EFNET, SVM + VGG, and SVM + EFNET) were implemented to classify ERR lesions (Fig. 1). Subsequently, all four models were optimized using FST to produce four new enhanced DLMs (FS + RF + VGG, FS + RF + EFNET, FS + SVM + VGG and FS + SVM + EFNET). Training and testing ratios of 70:30 was selected as an optimum ratio for images classification as adopted in previous DLMs studies [33, 34]. Two-dimensional CBCT images of ERR were entered into a transfer learning with CNN models. In addition, these images were randomly distributed into training (70%), validation (10%) and test (20%) dataset. Subsequently, the ERR lesions observed in the images were classified as 0, 1,2 or 3 as the output, according to the depth of ERR in the images. In VGG16 and EfficientnetB4 systems, 555,328 and 18,764,579 parameters were utilized (Tables 1 and 2) [30]. Multiclass classification was performed by all models using Tensorflow and Keras phyton deep learning library.

Table 1 VGG16 parameters
Table 2 EfficientnetB4 parameters

Performance evaluation

The model’s performance was evaluated based on the calculation of accuracy. A confusion matrix summarized the prediction results on a classification task [35]. Five metrics were used to demonstrate the classification model’s performance: classification accuracy, F1-score, precision, specificity, and error rate [36]. Consequently, 70 values (7 × 10) were measured. The mean values for each group were calculated. Due to the non-normal distribution of the data, the Kruskal Wallis test, a non-parametric method, was employed to evaluate the difference in accuracy among all DLMs. The analysis was conducted using a statistical package for social sciences software (SPSS) 27.0 (IBM Corporation, Armonk, NY, USA). Following the Kruskal Wallis test for overall group differences, post-hoc analyses were conducted to examine pairwise differences between groups. Dunn’s post-hoc test was employed to identify specific pairs of groups with significant differences. Additionally, an independent t-test was conducted to assess any significant difference between the results obtained with and without FST. The metric evaluation was performed according to the following formula using confusion matrix in Table 3.

Table 3 Confusion matrix for binary classification
$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
(1)
$$Specificity=\frac{ TN}{FP + TN}$$
(2)
$$Weighted\;accuracy=\frac1{\sum\nolimits_{i=1}^cw_i}\sum\nolimits_{i=1}^cW_i\;\mathrm X\;\left({\mathrm{TP}}_{\mathrm i}\;{\mathrm{xTN}}_{\mathrm i}\right)/\mathrm{Total}\;\mathrm{population}$$
(3)
$$F1 = \frac{2 TP}{2 TP + FP + FN}$$
(4)
$$Recall=\frac{ TP}{TP + FN}$$
(5)
$$Precision=\frac{Correct\;Detected\;ERR}{Correct\;Detected\;ERR+False\;Detected\;ERR}$$
(6)
$$Error\;Rate=\frac{\left(FP+FN\right)}{Total\;Population}$$
(7)

Where, TP = True Positive, FN = False Negative, TN = True Negative and FP = False Positive.

Results

The multiclass classification models’ performances were presented in Table 4. The highest performance was achieved by FS + RF + VGG model with overall accuracy of 81.9%, weighted accuracy of 83% and 81.9% F1-score, precision, and specificity. The error rate for FS + RF + VGG was 18%, and AUC of 96%. In contrast the lowest performing model was RF + EFNET with an overall accuracy of 55.3%, weighted accuracy of 61%, and 55.3% F1-score, precision, specificity. The error rate of RF + EFNET was 45%, and AUC of 84%. Following the implementation of FST, the highest accuracy improvement was achieved by SVM + EFNET model (4.7%) while the lowest improvement was recorded by the SVM + VGG model (1.7%). Of all eight DLMs, the highest AUC was recorded by FS + RF + VGG (96%), while the lowest was by SVM + VGG (81%) (Fig. 4). The Kruskal Wallis test showed a significant difference, with the p-value for the H-test being 0.008, which is less than the significant level at α = 0.05 (p < 0.05). This study indicated that there is a significant difference in accuracy between different models (H (7) = 19.119; p < 0.05) (Table 5). In Table 6, pairwise comparisons using Dunn’s post-hoc test among DLMs indicated a significant difference only between RF + EFNET and FS + RF + VGG (p < 0.05). Subsequent comparisons across other DLMs revealed no significant differences (p > 0.05). Furthermore, independent t-test showed no significant difference in the classification accuracy among allDLMs before and after incorporating FST (Table 7). The prediction accuracy of all eight DLMs were summarized in 4 × 4 confusion matrices as shown in Fig. 5.

Table 4 Comparison of classification performance between the deep learning-based systems
Fig. 4
figure 4

AUC of eight trained models

Table 5 Kruskal Wallis test of significant difference in accuracy between different DLMs
Table 6 Dunn’s post-hoc test between models
Table 7 Independent sample t-test of accuracy improvement with Boruta (FST algorithm)
Fig. 5
figure 5

Confusion matrices showing prediction accuracy of all DLMS

Discussion

Precise detection of ERR lesions is crucial in preventing inaccurate management of this lesion which may subsequently result in irreversible root surface loss, discomfort, and non-vital tooth [1]. In this study, the performance of four DLMs on ERR identification were assessed using five parameters (classification accuracy, F1-score, precision, specificity, and AUC). Subsequently, the effect of FST on the DLMs performance was evaluated. The present study provides valuable insights into the potential of advanced machine learning techniques in improving ERR identification. To the best of our knowledge, the present study is the first to report the multiclass classification of ERR based on different depths of the lesions on CBCT images.

Deep learning-based algorithms play a significant role in developing an automated computer-aided diagnosis system for medical and dental radiographic image annotation, segmentation, and classification [19, 37,38,39]. Most deep learning algorithms require balance [40] and large data [41] to optimize an enormous number of weighting parameters in deep CNN. Hence, the current study introduced a transfer learning approach using pre-trained deep CNN algorithms to extract features from ERR lesions. Recent studies have reported that classification models incorporating pre-trained VGG16 and EfficientNetB4 architectures displayed robust performance in medical image analysis [19, 42]. The highest DLM accuracy of the current study was comparable to previous studies that had employed VGG16 for facial feature and jaw tumor classification [43, 44]. In the present study RF + EFNET demonstrated the lowest performance accuracy (0.55) than the other tested DLMs (RF + VGG, SVM + VGG, SVM + EFNET) i.e., more than 0.72. The performance of RF + EFNET in this study was even lower than previous ERR studies using panoramic radiograph [45, 46]. This can be attributed to a lack of compatibility between the RF classifier and the EfficientNetB4 algorithm used in this study. In general, DLMs had demonstrated a promising potential in assisting the identification and classification of ERR based on the lesion’s depth.

Feature selection technique (FST) can improve classification model performance by identifying and selecting the most informative features within the dataset [29, 47]. The utilization of FST, especially Burota and RFE, had decreased the risks associated with overfitting and improved the interpretability of medical image analysis [23, 48, 49]. The present study observed an increase in DLMs accuracy improvements (2–4.7%) when FST were combined during the post-processing phase. Similarly, high accuracy improvements (10% and 5.8%) were reported by previous studies using Burota [50] and RFE [29]. The low accuracy improvement that was observed in the current study might be due to imbalanced classes of the dataset, with a greater amount of data in 0.5 and 1.0 mm classes [51]. All DLMs in this study demonstrated improvement in classification accuracy. A study using FST on neurodegenerative lesions classification reported a selective DLM accuracy improvement (CfsSubsetEval, WrapperSubsetEval, ChiSquaredAttributeEval, and ClassifierAttributeEval) [52]. Therefore, it can be assumed that accuracy improvement might be influenced by the compatibility of FST and hybrid DLMs utilized in this study, as reported by Bhalaji et al. and Albashish et al. [53, 54].

In this present study, DLM systems have demonstrated considerable performance in identifying ERR. The main limitation was identified during conducting this study, namely the small CBCT dataset. To avoid overfitting due to small sample size, this study had utilized a high-quality training dataset specifically to emphasize ERR depths [55]. Furthermore, data augmentation [56] was performed to increase the training dataset, and transfer learning approach (VGG16 and EfficienNetB4) was implemented to enhance the performance of DLMs [57]. This study had exclusively utilized DLMs in identifying ERR on extracted premolar teeth. However, the ability of these newly developed models should be tested on real data before clinical applications. Although the experimental nature of this study might compromise the ability of these DLMs on real data [58], it allows standardized preparation techniques for ERR and CBCT scanning parameters [59]. Future research should focus on three main areas: expanding the dataset, exploring the ability of various FSTs, and conducting prospective clinical trials.

Conclusions

The present study explored the potential of eight newly developed DLMs in identifying ERR on CBCT images. The application of deep learning-based algorithms on CBCT images had demonstrated promising results for future automated ERR identification. Integrating compatible FST with deep learning-based models may enhance the performance of all DLMs in identifying ERR lesions.