Background

Since the 1990s, periodontitis has been a global public health burden, and severe periodontitis, with a 10.59% prevalence rate, ranks 6th among 369 assessed diseases and is responsible for 7.09 million disability-adjusted life years (DALYs), according to the 2019 Global Burden of Diseases (GBD) study [1,2,3]. Periodontitis affects local health and systemic conditions, meaning that if periodontitis is properly treated, systematic inflammation will be reduced [4,5,6,7,8]. However, manual classification based on dental images requires a lot of manpower and time. Furthermore, image quality and radiographic interpretation could compromise the accuracy of classification. All these issues could be alleviated by deep learning (DL) methods [9,10,11].

Both DL and machine learning (ML) are included in artificial intelligence (AI). ML aims at self-training algorithms based on existing data and making predictions for new information [12]. DL is a subgroup of ML that mimics the way the human brain works and is based on neural network structures [13]. Recently, DL, especially convolutional neural networks (CNNs), has been widely used in various fields of medical image analysis, such as segmentation, detection, classification of abnormality, and computer-aided diagnosis [14]. CNNs identify visual patterns directly from the raw pixels of an image, which is similar to the way humans observe objects, to learn the intrinsic features or patterns of the image [14]. They are multi-layered, feed-forward, neural networks using backpropagation algorithms, and consist of convolutional, activation, and pooling layers. Currently, CNNs are still considered the most successful method to process medical images [15].

In dentistry, there are four main applications of CNNs: (1) segmentation; (2) detection; (3) classification; and (4) image quality enhancement, which are all based on dental images, including intraoral (periapical radiograph and bite-wing image) and extra-oral (panoramic X-ray and cone-beam computed tomography [CBCT]) X-rays [9, 16]. For instance, Park et al. applied CNNs to segment tooth surfaces for caries diagnosis [17], and Lee et al. proposed a computer-assisted detection system to identify impacted mandibular third molar teeth [18]. Nowadays, there is a growing trend in the utilization of CNNs in periodontitis fields. Jaiswal et al. developed a novel Intelligent Ant Lion-based Convolution Neural Model (IALCNM) to segment affected parts and classify the wear and periodontitis using panoramic photographs [19]. Moreover, Chen et al. developed an ensembled CNN model to predict tooth position and recognize radiographic bone loss (RBL) using periapical and bitewing radiographs [20]. Furthermore, Moran et al. evaluated whether different pre-processing methods affect the result of periodontal bone loss (PBL) classification based on periapical images [21].

Although there are numerous studies conducted in the interdisciplinary of periodontitis and DL, the type of DL architecture employed in periodontitis classification, determination of the most effective model and comparison of performance against oral physicians have not been systematically reported. Therefore, this study aimed to review the studies on the classification of periodontitis by evaluating various dental images using DL methods, to summarise the types of different models employed, and to compare the performance of these models. This could identify the most appropriate model for the classification of periodontitis based on oral photographs in clinical practice. Moreover, we compared the performance of the DL model to the dental professionals which determines the reliability.

Methods

This systematic review and meta-analysis were conducted referring to the guidelines for Preferred Reporting Items for Systematic Reviews and Meta-analyses for Diagnostic Test Accuracy Studies (PRISMA-DTA). The study was registered at the National Institute for Health Research, International Prospective Register of Systematic Reviews (PROSPERO, registration number CRD 42022338627). Additionally, the study protocol was based on the following PIRD elements [22]:

Population

patients’ diagnostic images that illustrate the status of radiographic bone loss (RBL).

Index test

deep learning models for classification of periodontitis based on RBL.

Reference test

expert opinions according to the classification of periodontitis.

Diagnosis of interest

classification of periodontitis.

Data sources

A reviewer (XL) searched publications through EMBASE, PubMed, Web of Science, Scopus and Google Scholar databases up to November 2023 according to strategies set by two reviewers (DZ and XL). Search strategies combined terms including (1) periodontitis or periodontal disease or periodontal status; (2) image or image processing or computer-aided diagnosis or computer-based diagnosis or smart diagnosis; and (3) artificial intelligence or machine learning or deep learning or convolutional neural networks. The detailed search queries for all databases were provided in Supplementary Table 1.

Criteria for considering studies for this review

Studies that matched the following criteria were considered to be included: (1) Study population with a dental image; (2) Diagnosing with DL technology; and (3) English publications with all statuses, including in-press and unpublished studies. The exclusion criteria were: (1) Animal experiment; (2) Without full article; (3) Without statistical data; and (4) Conference proceedings or reviews or books or patents. (Table 1)

Table 1 Inclusion and exclusion criteria for this review

Study selection and data collection

After screening the titles and abstracts of all identified publications, two reviewers (XL and JXX) independently read the full text of all eligible articles and excluded inappropriate articles according to the inclusion/exclusion criteria. Disagreements between the reviewers were solved by discussing until a consensus was reached or by consulting a third reviewer (DZ). The following data were extracted from each publication: study characteristics (first author, publication year, country), study design (data sets, modality of medical images, machine learning algorithms, study factor, and its definition, algorithms application, comparison), primary outcomes, and conclusions.

Quality assessment

The quality of evidence was evaluated by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) on the following domains: study design, limitations (risk of bias), indirectness, inconsistency, imprecision, and publication bias (https://gdt.gradepro.org/) [23]. The quality of evidence was categorized into four levels: high, moderate, low and very low.

Based on the recommendation of the Cochrane Collaboration, the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool was used to evaluate the quality of all eligible articles in terms of the risk of bias and applicability [24]. The assessment was conducted by three reviewers (XL, JXX and YJL). When there were disagreements, it was resolved by discussion or by consulting a third reviewer (DZ) to make the final decision. There were four domains for the risk of bias section: patient selection, index test, reference standard, and flow and timing; the first three of these domains formed the applicability section [25].

Statistical analysis

Summarising the quality score to define high-quality studies is not a recommended method [26]. Moreover, the overall estimate may be similar regardless of the quality of the studies, but if only high-quality studies are analyzed, incomplete reporting may arise [27]. Therefore, all articles containing true positive (TP), false positive (FP), true negative (TN) and false negative (FN) data that were either supplied in the articles or could be calculated from the information provided were used to conduct a meta-analysis using Stata 16.0 software (StataCorp LLC, College Station, TX, USA). Spearman correlation analysis was conducted to assess the threshold effect, without which combined sensitivity, specificity, positive likelihood ratio (LR), negative LR and diagnostic odds ratio (DOR) were calculated directly by using the random-effects inverse-variance model. A forest plot of sensitivity and specificity was generated to visually show the differences among the included studies. Statistical heterogeneity was assessed using the Chi-squared–based Q statistic method and I2, and the level of significance was indicated by P < 0.05 and I2 > 50%, respectively. Influence analysis and subgroup analysis based on study factors including article quality (high/unclear risk of bias, low risk of bias), dental image modality (periapical radiograph images, panoramic dental radiographs), model type (single model, two-stage model) were performed to detect the source of heterogeneity. Two meta-regression models with sensitivity and specificity were carried out to investigate whether sample size has an impact on classification outcomes. A summary receiver operating characteristic (SROC) plot—a plot of scattered sensitivity-specificity points of each potentially eligible study—was constructed, and the area under SROC (AUSROC) was computed [24]. In addition, a Fagan nomogram was drawn to describe how DL methods may have helped clinicians increase the probability of an effective classification of periodontitis. Publication bias was investigated by Deeks’ funnel plot asymmetry test.

Results

Study selection

Figure 1 shows the study selection process and describes the reasons for full-text article exclusion. The five databases (EMBASE, PubMed, Web of Science, Scopus and Google Scholar) identified 1546 potentially relevant publications with 279 duplications. After screening the titles and abstracts of the 1267 remaining studies, 49 articles were selected for full-text reading. Based on the inclusion and exclusion criteria, 27 studies were included in this systematic review [20, 21, 28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52].

Fig. 1
figure 1

PRISMA Flow chat of study selection process

Methodological quality

The risk of bias and applicability were assessed using QUADAS-2 for all included articles, and the results were shown in Supplementary Fig. 1 and Supplementary Fig. 2, respectively. Nearly half of the included studies did not have clear information on whether patients were consecutively or randomly enrolled, resulting in 42.9% of the articles (12/27) showing an unclear risk of bias in the patient selection domain [20, 30, 32, 34,35,36, 38, 45, 48, 52, 37, 42]. Two studies were rated as having a high risk of bias, with one [29] designed to be a case-control study with a convenient sample collection and the other [31] using inappropriate exclusion criteria. Approximately one-fourth of the studies did not mention a prespecified threshold before a test, consequently, 22.2% of the articles (6/27) were ranked as having unclear risk of bias in the index test domain [21, 35, 39, 49, 51, 52]. Four studies were unable to accurately diagnose periodontitis based on their reference tests, as these studies attempted to classify healthy cases and periodontitis only using radiographs [21, 28, 42, 49]. The other studies (85.2%, 23/27) were ranked as having a low risk of bias in the reference standard domain [20, 29,30,31,32,33,34,35,36,37,38,39,40,41, 43,44,45,46,47,48, 50,51,52]. As the diagnostic tests are being conducted by DL algorithms, which do not affect the flow and timing, all articles in the present analysis were ranked as low risk. For the applicability section, all studies were ranked at low risk of bias in patient selection, 74.1% of the included studies (20/27) were ranked as low risk of bias in the index test and reference standard [20, 29, 30, 32,33,34, 36,37,38,39,40,41,42,43,44,45,46,47,48, 52]. The study quality assessment results are presented in Supplementary Table 2.

The quality of evidence based on the GRADE analysis can be found in Supplementary Table 3. Results are shown in different subgroups of model type and dental image modality. When one study was ranked as high risk of bias or unclear risk of bias based on QUADAS-2, the subgroup’s limitation was assessed as a high risk of bias. As a result, all subgroups were considered to be at high risk of bias, leading to one level of evidence quality deduction. Two level of evidence quality was downgraded in the single model using periapical radiograph images and two-stage model subgroups due to inconsistency and imprecise data. While one level of evidence quality was reduced in the single model using panoramic dental radiographs. Consequently, the quality of evidence was scored as very low in the single model with periapical radiograph images and the two-stage model and low in the single model with panoramic dental radiograph.

Study characteristics

The characteristics of all included studies are summarised in Table 2. All articles were published within the last five years, and there was a surge in 2021 with twice as many articles published than in 2020, while in 2022, the number of articles published was 1.5 times that of 2021 (Supplementary Fig. 3). Studies originated from 11 countries, most of which were in Asia. Except for one study that never mentioned data splitting [20], all included studies (26/27) split the datasets or used cross-validation, an approach to avoid model overfitting and evaluate the generalization ability of the model. Three studies used an external dataset to evaluate the performance of the algorithms [29, 43, 48]. In addition, three studies used public databases [35,36,37]. In terms of dental image modality, the studies employed periapical radiograph images, panoramic dental radiographs, and CBCT images to classify periodontitis, among which panoramic radiographs were used the most (15/27) [28,29,30, 32, 33, 35, 36, 38, 39, 42, 47,48,49,50,51] and only one study used CBCT [44]. More than two-thirds of articles (19/27) processed images before applying DL techniques by some common approaches, such as augmentation, normalisation and resizing the images [21, 28, 29, 31,32,33,34, 36, 38,39,40, 43,44,45, 47, 48, 50,51,52]. Furthermore, the DL-aided task has changed over time. In 2019 and 2020, the diagnosis of periodontitis was predominantly chosen, whereas the classification of periodontitis stages was selected in 2021 and 2022. Half studies opted diagnosis task and half chose the staging task in 2023. Regarding the algorithms, the studies mainly utilised deep CNNs (DCNN), with one article involving lightweight CNNs (LCNN) [35]. Eleven studies (11/27) used a two-stage design containing a tooth-identification or segmentation stage and a periodontitis-staging step [20, 30,31,32, 35, 36, 38, 42, 44, 47, 51]. Eight (8/27) studies utilised transfer learning [20, 21, 33, 39, 41, 45, 49, 51]. Reference tests were either experts’ direct opinions of periodontitis or their annotation of regions of interest (ROIs) based on different definitions. Sixteen studies (16/27) employed the new criteria proposed in the 2017 World Workshop on the Classification of Periodontal and Peri-Implant Diseases and Conditions [20, 29,30,31,32,33,34, 36,37,38,39,40, 42, 43, 45, 48], while one study (1/27) [41] used the International Workshop for Classification of Periodontal Diseases and Conditions (1999). Three studies (3/27) [28, 47, 52] carried out according to the World Health Organization’s standardized Community Periodontal Index (CPI) and four studies (4/27) [21, 44, 46, 49] roughly defined periodontitis based on the depth of bone resorption; the remaining two studies (2/27) [50, 51] did not mention the classification criteria. All studies compared the diagnostic performance of DL algorithms either with specialists or among different algorithms. More than two-thirds of articles (19/27) reported accuracy, while sensitivity, specificity, recall, precision, F1-score, ROC and AUROC were also reported among included studies.

Table 2 Characteristics of all included studies

Meta-analysis

From the 27 articles selected for the systematic review, 14 were excluded from the subsequent meta-analysis because TP, FN, FP and TN were not reported and could not be calculated. Consequently, 13 studies were included in the meta-analysis [21, 29, 33,34,35, 40, 41, 43, 47, 49,50,51,52]. The correlation analysis showed heterogeneity due to the threshold effect (r = 0.13; P = 0.02). Therefore, instead of directly combining the sensitivity and specificity to demonstrate the overall accuracy, an SROC curve was generated (Supplementary Fig. 4). The AUSROC was 0.94 (95% confidence interval [95%CI] 0.91–0.96). To investigate the source of heterogeneity, we conducted an influence analysis (Supplementary Fig. 5). Supplementary Fig. 5(c) and Supplementary Fig. 5(d) both indicated that the seventh article was an outlier [43], which can affect the stability of the results. When this article was removed, the threshold effect disappeared (r = − 0.45; P = 0.20), and the combined sensitivity, specificity, positive LR, negative LR and DOR were 0.88 (95%CI 0.82–0.92), 0.82 (95%CI 0.72–0.89), 4.9 (95%CI 3.2–7.5), 0.15 (95%CI 0.10–0.22) and 33 (95%CI 19–59), respectively.

Figure 2 illustrates the forest plot of sensitivity and specificity of the DL algorithms for the periodontitis classification. The AUSROC (Fig. 3) was 0.92 (95%CI 0.89–0.94), which implied that the diagnostic test had high accuracy. According to the Fagan nomogram (Supplementary Fig. 6), the prior probability of this diagnostic test was 50%, the positive LR was 6, the posterior probability after a positive test was 85%, and the negative LR was 0.10. The posterior probability after a negative test was 9%. The subgroup analysis results showed that heterogeneity of sensitivity was statistically significant in model type and dental image modality, and heterogeneity of specificity was statistically significant in article quality (Fig. 4). In detail, a single model would get a significantly higher sensitivity than a two-stage model (P < 0.01). Moreover, the modality of dental images may cause heterogeneity of sensitivity (P < 0.01). Diagnosis sensitivity based on periapical images was higher than that on panoramic images. Furthermore, articles scored as high or unclear risk of bias would get a significantly lower specificity than low risk of bias articles (P = 0.03). Both meta-regression results indicate that there is no statistically significant correlation between sample size and sensitivity (P = 0.069), as well as between sample size and specificity (P = 0.252) (Supplementary Fig. 7, Supplementary Fig. 8). The influence analysis demonstrated that the results were stable by removing one study at a time (Fig. 5). Deeks’ funnel plot asymmetry test illustrated no publication bias (t = 0.74, P = 0.48) (Fig. 6).

Fig. 2
figure 2

The forest plot for sensitivity and specificity of deep learning for periodontitis diagnosis

Fig. 3
figure 3

The summary receiver operating characteristic curve of diagnostic accuracy of periodontitis by deep learning excludes the seventh article. SENS, sensitivity; SPEC, specificity; SROC, summary receiver operating characteristic; AUC, area under curve

Fig. 4
figure 4

Subgroup analysis based on article quality, dental image modality and model type

Fig. 5
figure 5

Influence analysis exclude the seventh article

Fig. 6
figure 6

Publication bias of periodontitis diagnosis by deep learning

Discussion

In this systematic review, we compiled and evaluated studies that utilised DL methods to classify periodontitis based on dental images. With the rise of DL technology, an increasing number of articles have been published on the intersection of periodontitis classification and DL, especially in 2022. The overall quality of the included studies was limited, more high-quality studies are urgently needed. In addition, more than half of the included articles reported that the accuracy, sensitivity, and specificity of their algorithms for classifying periodontitis were > 0.8. The SROC curve also showed the high accuracy of the DL methods for classification. The study by Lee et al. [43], which reported the specificity as 1 for distinguishing non-periodontitis individuals, was an outlier in our meta-analysis. Moreover, the Fagan nomogram indicated that when a DL method classifies a positive result, there is a high probability of periodontitis, and if the classification is negative, the probability of periodontitis is low. These findings are further discussed in the following sections.

Characteristics of dental images

There are very few large and high-quality public databases of dental radiographs. Consequently, dental radiographs must be manually labeled, which is time-consuming and needs to be urgently addressed. Random shift augmentation, oversampling, adjusting weights in the loss function, and transfer learning were used to overcome class-imbalanced issues, which detrimentally contributed to DL classification performance [30, 39, 41, 42, 50, 51, 53].

In terms of modalities of dental images, the studies included in our analysis predominantly used periapical images, panoramic images and CBCT images for periodontitis classification. Nine studies detected RBL in periapical radiograph images. Periapical radiograph images capture the teeth and the surrounding alveolar bone, and therefore can fully provide information on RBL. However, the view of this modality is small, with only three to four teeth on a single image [54]. Over half of the studies in our analysis detected RBL in panoramic X-ray images, which show the whole mouth. However, as two-dimensional modalities, both periapical radiograph images and panoramic X-ray images cannot provide three-dimensional information and have problems with geometric distortion and anatomic noise [55]. All these limitations may affect the performance of periodontitis classification. Only one study in our analysis used CBCT and did detect RBL in the resulting images [44]. Although CBCT can provide three-dimensional information, there are still some limitations caused by artifacts, noise and poor soft tissue contrast [56]. Consequently, dental image processing plays a vital role in periodontitis classification.

Processing of dental images

Two aspects should be considered for an accurate periodontitis classification. One is the quality of dental images, and the other is model performance. To deal with image quality problems, the included articles employed super-resolution and noise reduction methods. One study conducted in Brazil reconstructed high-resolution images from low-resolution images by using four conventional interpolation methods (nearest, bilinear, bicubic, Lanczos) and two DL methods (super-resolution CNN and a variation of the super-resolution generative adversarial network) [21]. Two studies used the contrast-limited adaptive histogram equalization technique for image denoising [39, 40]. Besides noise reduction, one study conducted in the USA also introduced a series of processes to precisely draw the contour of bone, tooth, and cemento-enamel junction after model prediction to improve model performance [43]. In addition, a quarter of the studies resized and normalised the images to improve model performance. Furthermore, because obtaining dental images is difficult, almost half of the included articles used data augmentation techniques to increase the number of images [48, 50, 52].

Classification using dental images

Regarding the task of classification using DL models, classical models such as U-Net and YOLO were often utilised in the included studies [57, 58], regardless of the specific diagnosis task chosen. For tasks involving a two-stage design, U-Net was typically used for segmenting ROIs, while YOLO was employed for object detection. U-Net has been proven to quickly and accurately identify targets in medical images and generate high-quality segmentation results [59]. Additionally, the structure of U-Net can be flexibly adjusted according to the specific needs of the task [59]. Various versions of YOLO, from YOLOv3 to YOLOv5, have been utilised based on different study purposes. Feature Pyramid Network (FPN) was also employed for the ROI segmentation stage [60]. FPN fuses multi-layered features and makes predictions at each fused feature layer, thus, it shows significant improvement in small-object detection without considerably increasing computation. Faster region-based CNN (Faster R-CNN) combines a Region Proposal Network (RPN) and a Fast R-CNN that shares full-image convolutional features to overcome the computational problem, which is why Faster R-CNN is popular in periodontitis diagnosis [61]. Mask R-CNN, which is an extension of Faster R-CNN, has also been employed [62]. Danks et al. employed a symmetric hourglass network that can capture every scale information and combine them to make the final predictions [45].

Based on the included publications, transfer learning is an efficient method for training datasets with limited samples, and it can enhance the model training efficiency. In addition, using appropriate regularisation methods can improve model performance.

Strengths and limitations

Strengths

  1. 1)

    The strength of this review is that we systematically summarised and evaluated the studies on DL for periodontitis classification based on dental images. Moreover, we have described the development trend of DL technology in the field of periodontitis.

  2. 2)

    In addition, we used meta-analysis to quantitatively evaluate the threshold effect and heterogeneity of the included articles and analysed the possible sources of heterogeneity in detail.

Limitations

  1. 1)

    DL-based periodontitis classification is an emerging field and most studies conducted thus far have predominantly focused on Asian populations. This limited regional focus has resulted in a constrained sample representation, thereby impacting the external validity of the findings.

  2. 2)

    Except for three articles that utilised publicly available databases, the samples in the other studies were solely derived from hospital settings, thereby lacking representation from community-based data.

  3. 3)

    No study described the demographic information pertaining to the included subjects. Considering that demographic information could potentially influence the severity of periodontitis and consequently contribute to the heterogeneity observed, it is essential to address this aspect in future research.

  4. 4)

    Only three studies incorporated an external dataset to assess the performance of DL-based models. In contrast, all the other studies relied on training and testing datasets derived from the same source, potentially limiting the generalisability of their results.

  5. 5)

    Since the gold standard of periodontitis diagnosis and classification should be clinical attachment loss (CAL), it would lead to underestimation of periodontal status only based on RBL. However, the classification is still important in the clinical practice when the direct evidence (CAL) is not available.

Conclusions

In summary, the accuracy of DL is high for classifying periodontitis based on dental images. DL is an efficient approach to reducing the workload of dentists and the time consumed during clinical practice. Furthermore, the various DL models have their advantages and disadvantages, and the choice of model should be based on the specific task objectives and requirements. Future research should be designed rigorously to reflect the DL truth performance. The optimisation of DL architecture can promote the performance of periodontitis classification with dental images. Moreover, improving dental image quality and performing regularisation can yield higher periodontitis diagnostic accuracy. In addition, data imbalance is an issue that needs to be considered to enhance diagnostic performance.