Background

Oral cancer is one of the major causes of death globally, the 17th most common worldwide and the 11th most common in Asia. According to the World Health Organization, more than 370,000 new cases of oral cancer were reported and caused over 170,000 deaths in 2020 [1]. There are various types of oral cancer depending on its origin (carcinoma and sarcoma), but the most common type is oral squamous cell carcinoma (OSCC), which is mostly transformed from oral potentially malignant disorders (OPMDs). The definitive gold standard diagnostic tool of oral cancer and OPMDs is surgical biopsy and histopathologic evaluation [2, 3]. The treatment modalities for oral cancer were surgery, radiotherapy, and chemotherapy either alone or in combination, which is generally determined according to the stage of the disease. The treatment outcomes, especially in advanced stages, have resulted in high morbidity, affecting the masticatory function, facial esthetics, and quality of life of oral cancer patients [2]. Currently, advances in oral cancer treatment have not improved the prognosis of oral cancer over the past decade [4]. Oral cancer prognosis has been based on cancer staging [5], which decreases significantly in advanced stages compared to early stages of oral cancer or in the stage of OPMDs. Therefore, the early diagnosis of oral cancer is the crucial step to increase the survival rate of oral cancer patients.

Deep learning (DL), a subset of artificial intelligence (AI), is built based on neural networks, which are biologically inspired programming algorithms that have the ability to learn complex representations to improve pattern recognition from raw data [6]. These algorithms are composed of multiple layers, which transform input data (such as medical images) into outputs (such as diagnostic or prognostic recommendations) while automatically learning higher-level features [6, 7]. DL has been proven capable of analyzing complex data and is widely applied in the medical field, including diagnostics, detecting abnormalities in medical images, etc. [7]. Integrating DL technology into routine clinical practice relies on achieving diagnostic accuracy that is not inferior to professional healthcare. In addition, it must provide other benefits, such as speed, efficiency, reduced cost, enhanced accessibility, and ethical conduct [8].

Nowadays, DL research in oral cancer is highly dynamic and keeps increasing due to its feasibility and many advantages to improve the cancer survival rate in the aspect of detection, prevention, and prognostic prediction [8,9,10]. There are studies that developed a mobile phone-based application for the oral cancer screening as an alternative method for early detection of oral cancer with a high accuracy to distinguish oral lesions from clinically suspicious lesions, which showed the potential of the application of computer-assisted visualization in the clinical practice [11, 12]. Application of DL to oral cancer data can assist clinicians in the diagnosis, detection, and prognostic prediction of oral cancer in clinical practice for early diagnosis and selection of the most appropriate treatment to increase the survival rate of patients with oral cancer.

There have been some previous systematic reviews on AI and machine learning in oral cancer [13, 14]. This study, therefore, mainly focused on the application of DL, which is the neural network-based architecture that has an ability to learn complex features, on oral cancer data. The main objective of this study is to systematically analyze evaluation studies of the application of DL in oral cancer data to aid in the diagnosis, detection, and prognostic prediction of oral cancer, and compare their results regarding the reported performance measures. In addition, this study further aimed to synthesize the results and assess the robustness of the body of evidence of DL-based diagnostic and prognostic predictive models on oral cancer data.

Methods

This is a systematic review of diagnostic and prognostic prediction studies. Reporting of this study follows the PRISMA guideline [15]. The study protocol was registered at the international prospective register of systematic reviews (PROSPERO) (CRD42023425992).

Inclusion criteria and exclusion criteria

The eligible studies must have evaluated the diagnostic or prognostic significance of oral cancer using DL algorithms. Publications were selected for review if they satisfied the following inclusion criteria: full texts available in English language; studies using DL (of any class) to provide diagnostic and prognostic prediction information of oral cancer and OPMDs; studies providing outcomes of model performance (diagnostic and prognostic prediction accuracy) and/or compared to a human diagnostic performance. For DL-based diagnostic studies in clinical and radiographic images (classification, detection, or segmentation), ground truth of captured images was identified by histopathologic result as the gold standard diagnosis of oral cancer and OPMDs.

Studies with the following criteria were excluded: studies where ground truth of DL-based diagnostic studies was not explicitly confirmed; studies of machine learning (ML) applications without DL algorithms; studies without sufficient details on the data used for training and testing (e.g., dataset size, data modality, etc.); studies without a clear explanation of the DL model; studies that examined DL applications for normal oral mucosa, oral lesions (without cancer or OPMDs), periodontal disease, or dental caries, DNA and RNA microarray genes, proteomics, fluorescence spectroscopy, and genetic programming; articles in languages other than English. The details of the inclusion and exclusion criteria are presented in Fig. 1.

Fig. 1
figure 1

Flow diagram of search methodology and literature selection process

Information sources and search

An electronic search was conducted in the following electronic databases up to 14th June 2023: Medline via PubMed, Google Scholar, and Scopus. The search was conducted from January 2000 through June 2023. Each database was searched with adapted keywords. The search query for each database is described in Table 1.

Table 1 The results of the electronic search in the various databases

Study selection

For managing the citations, Endnote 20 (Clarivate, Philadelphia, USA) was used. Two independent reviewers performed title and abstract screening after removing duplicate papers (K.W. and S.S.). Then, the reviewers evaluated full texts of eligible studies based on inclusion and exclusion criteria. Any disagreements or discrepancies were resolved by discussion and consensus of the two reviewers.

Data collection and extraction

Two reviewers (K.W. and S.S.) independently collected data from the included studies. Any disagreements or discrepancies were resolved by discussion and consensus of the two reviewers. The following data items were extracted: bibliographic details (name of authors, the year of publication and country), data modality, dataset size (train/valid/test, if given), augmentation, DL algorithms examined in the study, the definition of the study objective (diagnostic or prognostic), ground truth identification and annotation and task (classification, detection, segmentation) in the DL based diagnostic study, hyperparameters of the DL models, hardware used, performance metrics reported, including precision, recall, accuracy, sensitivity, specificity, F1-score, average precision (AP), Dice index, area under receiving operating characteristics curve (AUC), Concordance index (c-index) and Integrated Brier score (IBS). If more than one model was used, this study only reported on the best performance model.

Risk of bias and applicability

The methodological quality of the included studies was evaluated using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [16] for risk of bias assessment. The QUADAS-2 checklist consists of four risks of bias domains, including patient selection, index test, reference standard, and flow and timing. Any disagreements between the two reviewers were resolved by discussion and consensus. Some questions were slightly modified to specifically assess studies on DL [17]. In “patient selection”, limited information about the presented dataset as well as unclear data allocation strategies were considered to indicate a high risk of bias. For “index test”, insufficient information on model construction, including hyperparameters, and lack of model robustness analyzes were considered to indicate high risk of bias. For “reference standard”, the lack of information on the definition of the reference standard and the use of a single examiner to establish the reference test were considered to indicate a high risk of bias. Finally, in “flow and timing”, the indicators used different reference standards in the same study and inappropriate intervals between the index test and the reference standard. Details of the modified QUADAS-2 tool are provided in the supplemental information (Table 1S).

Statistical analysis

All statistical analyses were performed using R software, version 3.6.3 (Vienna, Austria) and IBM SPSS Statistics version 26. Because a few studies reported the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). This study used the diagnostic odds ratios (DOR) as pooled outcome from the reported sensitivity and specificity to determine the diagnostic accuracy of the deep learning system [18], calculated as follows:

$$DOR=\frac{Sensitivity\times Specificity}{\left(1- Sensitivity\right)\times \left(1- Specificity\right)}$$

Results

Study selection and study characteristics

The search results and process of selecting articles are shown in Fig. 1. After the literature search, a total of 905 articles were identified. Articles were excluded for the following reasons: studies that were duplicated (n = 328), studies focusing on other topics (n = 461), and studies that were not written in English language (n = 2). A total of 112 studies were assessed in full text. Fifty-eight of these studies, including studies that did not use DL models, studies that did not report the desired outcomes and dataset size, and studies on clinical images that did not use biopsy as the gold standard, were excluded after full text assessment.

Characteristics of relevant studies

The individual studies are summarized in Tables 2, 3, 4 and 5 with each table showing studies using DL in diagnostic studies, including classification (Table 2), object detection (Table 3), segmentation (Table 4), and prognostic prediction studies (Table 5).

Table 2 Summary of findings in the selected diagnostic studies (classification)
Table 3 Summary of findings in the selected diagnostic studies (object detection)
Table 4 Summary of findings in the selected diagnostic studies (segmentation)
Table 5 Summary of findings in the selected prognostic prediction studies

Of the 54 included studies, 51 studies examined the use of DL applications in the diagnostic performance on medical images and 3 studies evaluated the prognostic prediction of DL applications. Most studies on the application of DL techniques in oral cancer were published recently, i.e. in 2019 and 2023 (n = 52) (Fig. 2). With regards to the regions of relevant articles, 37 of the studies were carried out entirely in Asia, 9 in Europe, 2 in Africa and 6 in the United States.

Fig. 2
figure 2

Number of DL studies for oral cancer and image type employed

Seven different types of imagery data were employed to the DL applications on diagnostic studies, including histopathological images (n = 30), CT images (n = 8), clinical oral images (n = 9), and other types of image (n = 4), including confocal laser endomicroscopy images, optical coherence tomography images, and endoscopic videos. Clinicopathological and treatment data (n = 3) were incorporated in the DL applications on prognostic prediction studies. In addition, types of oral cancer data which were used in the development of DL models included OSCC (n = 41), non-specific type of oral cancer (n = 5), OPMDs (n = 5), and multiclass analysis of OSCC and OPMDs (n = 3). In diagnostic studies, some studies used expert annotation to set the reference test (n = 19). Specifically, one human expert (n = 7), two (n = 3), three or more (n = 9) experts were involved in defining the reference test.

Regarding the DL task, the most often chosen task was classification (n = 40), followed by segmentation (n = 10) and object detection (n = 5). Various DL models were used. In classification studies, most of the studies used multiple DL models (n = 25), including transfer learning models and multi-layer perceptron, followed by customized CNN structures (n = 8), LeNet-5 (n = 2), AlexNet (n = 2), DenseNet121 (n = 1), EfficientNet B0 (n = 1), and Swin-Transformer (n = 1). Regarding segmentation, most of the studies used multiple DL models, including auto-encoders models (n = 5), customized CNN structures (n = 3), and single auto-encoders models (e.g., U-Net) (n = 2). Regarding object detection, one-stage object detectors (e.g., YOLO) or two-stage object detectors (e.g., Faster R-CNN) were used in the majority of studies (n = 5). Classification studies mainly reported on precision, recall (sensitivity), F1-score, accuracy, and specificity; other outcome measures were the area-under-the receiver-operating-characteristics curve (AUC). In object detection studies, most studies were focused on precision, recall, F1-score, average precision (AP) and the AUC. Segmentation studies were more heterogeneous but additionally reported the Dice coefficient index and the mean Intersection over Union (mIoU). Furthermore, studies in prognostic prediction consistently reported the Concordance index (c-index) and Integrated Brier score (IBS) in all studies (Tables 3, 4 and 5).

Risk of bias and applicability

Detailed information about modified leading questions of QUADAS-2 for critical appraisal and the risk of bias are presented in Table S1–S5. Among the included studies, 13 (24.1%) were found to have low risk of biases in all four domains. Moreover, 40 studies (74.1%) were evaluated as low risk for concerns regarding applicability. The most problematic domain was “Reference Standard”, where only 22 studies (40.7%) were classified as low risk of bias followed by “Patient selection” where 32 studies (59.3%) were classified as low risk of bias.

Findings of the studies

In diagnostic studies, classification studies reported accuracies ranging from 85.0 to 100%, 78.2 to 93.62%, and 76.0 to 98.58% for classifying oral cancer on histopathological images, CT images and oral clinical images, respectively. The detection performance of object detection studies reported the F1-score ranging from 79.31 to 89.0%. In addition, the model performance of segmentation studies reported the Dice coefficient index ranging from 76.0 to 96.3%. In prognostic prediction studies, the prediction performance of DL models reported the c-index and IBS ranging from 0.78 to 0.95 and 0.04 to 0.12, respectively.

As outlined, classification and segmentation studies of oral cancer were used for further synthesis. Of these, 23 studies could be pooled, including classification of 20 studies and segmentation of 3 studies. The pooled sensitivity, specificity, and DOR of classification studies were 0.92 (95% CI 0.87–0.97), 0.92 (95% CI 0.88–0.96), and 2549.08 (95% CI 410.77–4687.39), respectively (Fig. 3). The pooled sensitivity, specificity, and DOR of segmentation studies were 0.87 (95% CI 0.72–1.02), 0.96 (95% CI 0.86–1.06), and 340.68 (95% CI -414.87 – 1096.22), respectively (Fig. 4). In addition, the majority of studies used histopathological data to develop the DL-based image classification with a high sensitivity and specificity of 0.99 (95% CI 0.98–0.99), and 0.97 (95% CI 0.94–0.99), respectively.

Fig. 3
figure 3

The various reported sensitivity and specificity of classification studies by sample size. The diameter of the circles represents the size of the dataset

Fig. 4
figure 4

The various reported sensitivity and specificity of segmentation studies by sample size. The diameter of the circles represents the size of the dataset

Discussion

Oral cancer is a life-threatening malignancy with frequent tumor metastasis and recurrence, which affects the survival rate and quality of life of patients [73,74,75]. The number of studies investigating the application of DL in oral cancer has increased in recent years. Most of the studies in this systematic review were published in 2019. This study compiled and assessed studies involving the DL for diagnosis and prognostic prediction of oral cancer by analyzing medical data including histopathological, CT, clinical image data, clinicopathological and treatment modality features data. Notably, however, the studies were of limited quality overall and comparison between studies was impeded by heterogeneity in conducting and reporting of the studies.

This systematic review found that most of the studies showed relatively high accuracy, sensitivity, and specificity of DL for the diagnosis of oral cancer (generally exceeding 80%). Nevertheless, heterogeneity in study conduct and reporting was high, precluding further comparisons between studies or quantitative synthesis. This review found that the included studies lacked details on the annotation process, did not mention the separation of the test dataset and the proportion between training, validation, and test dataset, which resulted in a high risk of bias in the reference test and patient selection. Additionally, seven diagnostic studies that mentioned the annotation process were annotated by one expert, resulting in these studies lacking inter-annotator agreement. To reduce the high risk of bias, future diagnostic studies should address minimum standard guidelines, such as Standards for Reporting of Diagnostic Accuracy Study-AI (STARD-AI); standards for diagnostic studies using AI models [76], Checklist for Artificial Intelligence in Medical Imaging (CLAIM); and a checklist for AI in medical imaging [77].

Regarding the heterogeneity in DL diagnostic studies of oral cancer, most studies did not report the value of TP, TN, FP, and FN; which caused a limitation for this systematic review of qualitative analysis of the results of oral cancer diagnostic study. Alternatively, the authors considered pooling sensitivity and specificity to calculate summary DORs as a single accuracy parameter. Moreover, the hyperparameter of DL models is essential for the explanation of tuning DL models to achieve the best performance from the model. This study found that several studies did not report the hyperparameters of DL models. This had a significant impact on the reliability and explainability of DL model performance, leading to a high risk of bias in the index test. To the best of our knowledge, there are no guidelines on reporting the hyperparameter tuning outcome/procedure for DL as models for medical diagnosis and prediction. This could explain why the hyperparameters reported in DL studies were heterogeneous.

Only three prognostic prediction studies applied DL algorithms, such as DeepSurv and DeepHit, in clinicopathologic and treatment modality data. The number of studies on DL was even less than studies in the era of machine learning (ML) [13, 14]. Nevertheless, the predictive performance of DL also yielded high accuracy for this task, achieving a c-index of 0.78–0.95 [70,71,72]. The predicted parameters were still the same as those of the ML era, which was interested in using clinicopathological and treatment modalities data to predict the prognosis and survival rate of oral cancer patients [13, 14]. Furthermore, there are no prognostic prediction studies of oral cancer in DL using molecular, cytological, and genomic data as a predictor, especially during preoperative evaluation. Combining various types of oral cancer data with the AI model could develop future prognostic prediction models allowing clinicians to decide on the most appropriate treatment plan to increase the survival rate of oral cancer patients.

All the studies included in this systematic review highlighted that DL techniques provide an increased precision approach for clinicians in making informed decisions. It should be emphasized that almost all the included studies only determined the accuracy performance of the DL model, in a few cases comparing it against the clinicians or experts. Furthermore, a fundamental element in achieving safe and efficient deployment of DL models in clinical practices is that the models achieve reliable generalizability. That is, the performance of the model when it is applied to external cases outside of the data for which it was trained [8, 10]. Therefore, the international collaboration among multiple healthcare centers could collect the data from multiple sources to develop the DL-based medical diagnosis and prognostic prediction system with the potential to be used in clinical practice. Nowadays, there are no standard guidelines for the appropriate accuracy of AI for clinical practice. Clinicians should understand that AI models are a decision support tool to improve treatment effectiveness and efficiency, but management options are based on the clinician’s decision.

This study has a number of strengths and limitations of the included studies and the review analysis. First, this review comprehensively and systematically appraised studies on DL for the diagnosis and prognostic prediction of oral cancer, and thus allows a narrative synthesis of the calculated DOR. Second, for limitation, this study selected only the scope of DL in oral cancer and found that studies reported heterogeneity, including various types of data and different reported outcome parameters, which was limited in qualitative analysis. In addition, this systematic review did not analyze the diagnostic performance of classification studies with the receiver operating characteristic (ROC) curve, which is one of the most widely used to analyze the diagnostic accuracy of classification models [78]. Future studies should critically determine reference tests and patient selection by addressing the checklist for AI in medical diagnostic and prognostic studies [76, 77, 79], which could improve utility to assess potential impact and clinical utility. Furthermore, many DL-based clinical image studies used image data from a public database and did not report diagnostic biopsy of lesions, which is an important ground truth that shows the reliability of the data for pathological AI research. Therefore, the future study should address the method to verify the reliability of clinical image from public database apart from biopsy proven to verify the ground truth of clinical image data for the medical AI study.

Conclusions

This systematic review reveals the increasing number of DL studies in oral cancer with a diverse type of architectures. The reported accuracy showed promising performances for diagnostic and prognostic analyses in studies of oral cancer, Furthermore, this systematic review found that different oral cancer data modalities in diagnostic studies impacted the sensitivity and specificity results of DL. This presents researchers with opportunities to investigate DL algorithms to various data modalities. Finally, the application of DL in oral cancer appeared to have potential utility in improving informed clinical decision-making and providing better diagnosis and prognosis of oral cancer. Future work to improve the explainability and interpretability of DL models and the use of clinically applicable performance measures would be needed to translate these models for use in clinical practice.