Introduction

Background

Head and neck cancer is the seventh most common malignancy worldwide [1]. Tumors may affect the lips, oral cavity, nasal cavity, nasopharynx, oropharynx, hypopharynx, larynx, paranasal sinuses, and salivary glands with about 90 to 95% being of the squamous cell carcinoma variant [2, 3]. Predisposing factors for head and neck cancer are diverse including prolonged sunlight exposure; risk habits such as tobacco use, heavy alcohol consumption, and betel nut chewing; infectious agents such as human papillomavirus and Epstein Barr virus; lifestyle factors such as low fruit and vegetable consumption; and a family history of malignancies [3, 4]. Moreover, the prognosis of head and neck cancer is dismal with a five-year survival of about 50% to 60% for most affected sites [5].

To improve the screening, detection, and prognosis of head and neck cancer, artificial intelligence techniques and tools have been proposed to support clinicians in the diagnosis, decision-making, and risk stratification of the disease [6,7,8]. Specifically, machine learning (ML) models are being increasingly constructed to facilitate disease segmentation from different investigations during diagnosis, treatment planning, and treatment monitoring as well as to predict risk status for screening, prognostic, and treatment complication outcomes with high accuracy [9, 10]. Moreover, disparate reports have showcased that ML models outperform conventional statistical techniques in performing tasks related to head and neck cancer [6, 7, 11, 12].

One of the factors that may hamper the optimal performance of ML models and their generalizability is the quality of data used for model construction [13, 14]. Moreover, better application of the ML models is expected when developed with representative high-quality datasets that can be replicated in different clinical scenarios and settings [13,14,15,16,17]. Several studies and reviews have focused more on the validity estimates and performance of ML models in head and neck cancer without considering whether data quality and standards are sufficient to enable the meaningful realization of the potential of ML and artificial intelligence (AI) in clinical practice [6,7,8,9,10,11,12]. To tackle this knowledge gap, this study overviews the quality of structured and unstructured data that have been used to construct machine learning models for head and neck cancer to identify current limitations to their optimal performance and generalizability based on the datasets employed.

Research questions

S/N

Research questions

Motivation

1

What category of datasets was mostly employed to model/predict head and neck cancer outcomes using machine learning?

To highlight the common types of datasets and outcomes available or considered among researchers and clinicians for modeling head and neck cancer outcomes with machine learning algorithms

2

How good are datasets used to construct machine learning models for head and neck cancer outcomes?

To evaluate the quality of datasets that were used to implement machine learning models in head and neck cancer

3

What data quality criteria were often fulfilled or deficient in datasets used to construct machine learning models for head and neck cancer outcomes?

To determine the specific dimensions of data quality that were often met or lacking by datasets during the construction of machine learning models in head and neck cancer. The answer to this question will also assist relevant stakeholders (i.e., clinicians, health informaticians, engineers, developers, and researchers) to determine present limitations to obtaining optimal data quality in head and neck cancer machine learning prediction platforms. Knowledge of these limitations will highlight areas for improvement in future works

4

What is the effect of data quality on the median performance metrics of the machine learning models constructed in head and neck cancer?

To examine whether the dimensions of data quality impact the performances of machine learning models in head and neck cancer

Related research

The utilization and implementation of AI and ML platforms in oncology have increased significantly in recent years [18,19,20,21]. Moreover, head and neck cancer represent one of the most commonly modeled malignancies using AI/ML techniques [8, 22, 23]. While this systematic literature review sought to examine the data quality used to construct ML models in head and neck cancer, we highlight previous relevant studies that examined machine learning models for head and neck cancer outcomes to showcase the significance and novelty of our contribution.

Patil et al. [24] reviewed and summarized reports from seven ML models to conclude that the support vector machines (SVM) were mostly used with genomic datasets to predict head and neck cancer prognosis with an accuracy rate between 56.7 to 99.4%. Volpe et al. [25] reviewed 48 studies describing ML models in head and neck cancer from different imaging modalities to highlight their different potential clinical uses during radiation therapy. Bassani et al. [26] also reviewed 13 articles on diagnostic ML models in head and neck cancer and found that models largely had excellent accuracies above 90%; however, this study suggested that models were based on small/restricted datasets and lacked heterogeneity. Giannitto et al. [27] recently overviewed eight studies that utilized radiomics-based machine learning models for head and neck cancer lymph node metastasis prediction and concluded that the models had a sensitivity and specificity of 65.6 to 92.1% and 74 to 100%. Of note, this systematic review also assessed the methodology of the ML models and suggested that most of them had a biased feature extraction and lacked external validation [27].

Other studies also focused on ML models for specific head and neck cancer subtypes [6, 12, 28]. Alabi et al. [12] from 41 studies found that ML models could aid the diagnosis and prognosis of oral cancer with sensitivity and specificity that ranged from 57 to 100% and specificity from 70 to 100%; although there were concerns about the explainability of the models. Again, Alabi et al. [28] in another review focusing on deep learning models for oral cancer found that the models had an average AUC of 0.96 and 0.97 for spectral datasets and computed tomography (CT) image datasets respectively. However, ethical concerns were suggested to have limited the application of these deep learning models in oral cancer management [28]. Chiesa-Estomba et al. [29] reviewed eight studies reporting ML models for oral cancer survival and decision-making and suggested that the tools could potentially advance the prediction of these tasks in patient management but highlighted the small number of data available and the use of secondary datasets as limitations to their application. Of note, our group also showed that machine learning models had accuracies that ranged from 64 to 100% for different oral cancer outcomes but that the models were not streamlined enough for clinical application due to lack of external validation, the prevalent use of small single-center datasets, and lack of class imbalance correction [6].

Contributions and novelty

From the studies examined above, it is clear that the majority evaluated AI/ML models for head and neck cancer outcomes largely based on their performance measures and the modeling methodology/technique employed leading to the final model selection (model-centric approach). Moreover, of the few studies that reported on the data quantity, feature selection, or class imbalance correction of the AI/ML models reviewed; none examined the effect of these parameters on the discriminatory or calibration performances of the models. Of note, no review comprehensively highlighted the different dimensions of data quality for assessment or questioned whether the data infrastructures and their quality were sufficient to encourage the meaningful application of the AI/ML models in the management of head and neck cancer or its subtypes. Given this gap, our study sought to assess datasets used for constructing ML models in head and neck cancer using disparate parameters for assessing data quality. Additionally, our contribution examines the relationship between the data quality assessment criteria and the performance of the ML models for head and neck cancer.

Overview of the study

The remainder of this article is structured as broadly as “Review methodology”, “Results”, “Discussion” and “Conclusion”. “Review methodology” section details the processes underwent to arrive at the final studies assessed in this systematic literature review. Also, the data abstracted from the individual studies, their risk of bias rating, and the methods for result synthesis are detailed in this section. “Results” section is further divided into two sub-sections i.e., “ML models based on structured datasets” and “ML models based on unstructured datasets” based on the type of datasets utilized for model construction. In each of the “Results” subsections (structured and unstructured data), we report the general characteristics of the studies, the risk of bias assessment findings, and the findings of the different quality assessment criteria used for data evaluation. “Discussion” section justifies the findings of the data quality evaluation and presents the limitations of the review. Finally, the section “Conclusion” answers the aims and review questions of this study within the confinements of its limitations and provides suggestions for future works.

Review methodology

Eligibility criteria

Original research articles between January 2016 and June 2022 that reported on the use of machine learning models on custom datasets of head and neck cancer patients were sourced. The rationale for choosing this timeframe was informed by the introduction of a robust methodology and reporting standards such as the Transparent reporting of a multivariate prediction model for individual prognosis or diagnosis (TRIPOD) and the Standards of Reporting Diagnostic Accuracy (STARD) statements in 2015 [30, 31]. The scope of head and neck cancer included malignant tumors (carcinomas) of any histologic variants affecting the lips, oral cavity, nasal cavity, nasopharynx, oropharynx, hypopharynx, larynx, paranasal sinuses, and salivary glands (ICD-10 C00-14, C31 and C32). Machine learning algorithms were limited to conventional learning, deep learning, and deep hybrid learning models that were trained and had at least a form of internal validation using test-train split, cross-validation, or bootstrap resampling. Likewise, structured, and unstructured datasets sourced from health records (electronic or manual), clinical photographs, imaging modalities, spectroscopy techniques, digitized histology slides, and omics technologies were considered for inclusion.

Studies that included other diseases during modeling, explicitly conducted a pilot/preliminary/feasibility study as stated in the report, comprised unstructured data that were not in form of images or spectra data, and utilized nomograms to operationalize models were excluded. Further excluded were treatment planning studies that examined the utilization of ML models for the segmentation of normal structures or organs without emphasis on the tumor areas and studies that were not full research articles. Also, short communications, letters to the editor, case series, and conference proceedings that were not peer-reviewed were not included. Studies that utilized public datasets (such as TCGA or TCIA cohorts) during training were excluded from this study following qualitative evaluation of related articles. The rationale for this exclusion was due to: (i) the preference of this study for custom datasets with clear reporting of the methods used for data collection (ii) abundance of duplicate studies using these cohorts which will introduce bias during result synthesis (iii) most studies based on public databases selected only a proportion of the total cohorts to predict head and neck cancer outcomes which precluded the assessment of the entire database against any of the ML models (iv) ML models using public databases focused on the ML approach and improvement of AI techniques than validation and implementation (v) reduced likelihood of clinical implementation for models trained using these public databases. For duplicate studies of different ML models conducted with the same custom dataset, only the first study was included provided the dataset was not updated with new patients subsequently. Alternatively, the updated study was included while excluding the first set of models constructed.

Data sources and search strategy

Related studies were sourced from four electronic databases–PubMed, Web of Science, EMBASE, and Scopus. Search keywords were first selected based on words pertinent to the research questions and review objectives while also being in line with other literature reviews that assessed the performance of machine learning models in head and neck cancer [6, 7, 9, 12, 26, 28]. Search terms were then implemented in each database with Boolean operators as presented in Additional file 1: Table S1. Retrieved citations were exported to EndNote 20 (Clarivate Analytics, USA) for automated and manual deduplication.

Study selection

This study adopted a two-stage process for the selection of relevant articles. In one stage, two authors independently screened the title and abstract of the citations to identify research articles on ML models constructed for head and neck cancer outcomes. Then, during the second stage, full-length texts of articles considered after screening were assessed strictly based on the eligibility criteria. This was also performed in duplicate and discordant selections were resolved following discussions between the authors. Agreement between authors was the basis for the final study selection in this review. Following article selection, a supplementary manual search was also performed to bolster electronic database searching and ensure that studies missed are included.

Risk of bias assessment and data extraction

Quality rating of the selected studies was performed using the Prediction model Risk of Bias Assessment tool (PROBAST) [32] with domains evaluating the study participants, predictors, outcome, and analytical methods. These four domains were scored as high, low, or unclear using signaling questions recommended for each aspect. The overall risk of bias rating was rated as ‘high’ if at least one domain was rated as high or low risk of bias if all the domains were rated as ‘low’. For studies where three domains were rated as low, and one domain was rated as unclear, the overall risk of bias rating was deemed unclear.

Data abstracted from the individual studies included general study characteristics such as first author names, publication year, study location, income class of study location, number of centers, cancer type, model type, task performed, clinical utility, model outcome, data preprocessing, the use of a reporting guideline, type of validation, and model discrimination estimates (AUC and C-statistic). Data quality parameters were then extracted for all studies using datasets for generating results when available or study metadata if dataset was absent. Data were described as structured if all the features were assigned continuous or categorical labels (including features that were extracted from images or spectra) while unstructured data referred to images or spectra that were utilized without deliberate feature extraction with deep learning or deep hybrid learning techniques. For structured data, the data type and source, multidimensionality, timing of dataset collection, class overlap, label purity, outcome class parity, feature extraction, feature relevance, collinearity, data fairness, data completeness, outlier detection, and data representativeness were assessed. In addition to some of the aforementioned parameters, image or spectra quality was assessed for studies that utilized unstructured datasets. Definitions and methods of assessing individual data quality parameters are detailed in Additional file 1: Table S2.

Result synthesis and statistical analyses

Qualitative synthesis was adopted for summarizing the data extracted from individual studies. Descriptive statistics were performed and presented in the text and figures as proportions, medians, and interquartile ranges. Statistical differences between two or more categorical variables were assessed using Pearson’s Chi-square test or Fisher’s exact analysis as appropriate. Likewise, median values of continuous variables based on different categories were compared using Mann–Whitney U test and Kruskal–Wallis H test. SPSS v 27 was used for statistical analyses performed with probability values below 5% indicating statistical significance. This review was conducted with according to the Preferred Reporting Items for Systematic reviews and Meta-analysis (PRISMA) guideline [33].

Results

The flowchart depicting the screening and study selection process is presented in Fig. 1. Upon removing duplicates, 2851 citations were screened, and 528 articles were selected for full text evaluation. Three hundred and sixty-nine articles did not fulfil the eligibility criteria based on reasons stated in Fig. 1. Overall, 159 articles on ML models for head and neck cancer outcomes were included in this review [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192] with 106 models based on structured data while 53 articles applied the ML models on unstructured data. A list of all studies included is presented in Additional file 1: Tables S3 and S4.

Fig. 1
figure 1

Flowchart detailing the search strategy and article selection process in the review

ML models based on structured datasets

General study characteristics

All 106 studies that utilized structured datasets were published between 2016 and 2022 with more studies in 2021 than other years (Fig. 2). Majority of the studies were conducted in China (26.4%) and the USA (16%) with only 5.7% of the studies conducted in India which represented the only low or lower-middle income country. Based on the subtypes of head and cancer considered, more ML models for structured datasets were used for oral cavity tumors (33%) than nasopharyngeal (18.9%), oropharyngeal (14.2%), or other tumors (6.5%) (Fig. 2). Disparate head and neck cancer subtypes were considered in 27.4% of the studies and no studies on lip cancers were present. Many models based on structured data utilized conventional machine learning (83%) than deep learning (15.1%) or deep hybrid learning (1.9%) algorithmic frameworks. Specifically, many of the models utilized random forest (19.8%), regularized logistic regression (17.8%), and support vector machines (14%) than other models.

Fig. 2
figure 2

General characteristics and risk of bias rating for ML studies employing structured datasets. a Publication year trend. b Plot showing the number of patient datasets by the different head and neck cancer subtypes. c Plot showing the clinical utility of the models according to the different cancer subtypes. d Risk of bias rating for individual domains of the PROBAST tool

Most ML models for structured datasets were developed to perform classification tasks (96.2%) than segmentation (1.9%), clustering (0.9%), or regression tasks (0.9%). Likewise, the clinical utility of the models was mostly for risk prediction (60.4%) than assisted diagnosis (39.6%). When stratified based on the cancer subtypes, most ML models on structured data for oropharyngeal (73.3%), nasopharyngeal (75%), and combined head and neck cancers (79.3%) were utilized for risk prediction while models for oral cavity cancers (60%), salivary gland tumors (100%), and laryngeal cancer (100%) were mostly for assisted diagnosis (p = 0.002). Also, cancer outcomes considered using this type of dataset included diagnosis (43.4%), prognosis (33.0%), treatment (16%), and screening (7.5%).

Risk of bias for individual studies

According to the PROBAST tool, only 3 of 106 studies had a low overall risk of bias rating (Fig. 2). Based on the four domains, many head and neck cancer studies fell short mostly regarding their methodology of data analysis and ML modeling (96.2%). Common reasons for high risk of bias rating in this domain was related to the low event-per-variable (EPV) of the dataset used for training or reduced sample size of patients with events during model validation. Likewise, 31.1% had a high bias rating in the ‘Predictors’ domain of the PROBAST tool mostly because the features used for model construction were determined after the outcome of the studies were known. Only 8.5% and 3.8% of studies had high bias ratings in the ‘Participants’ and ‘Outcome’ domains of the PROBAST tool (Fig. 1).

Data quality of ML models based on structured datasets

Of note, only six models (5.7%) followed at least one reporting guideline for data and model description, four of which used the TRIPOD statement. Of 106 ML models for head and neck cancer outcomes based on structured data, only nine have been validated on independent/external cohorts (8.5%). Sample size based on the number of patients in the datasets ranged from 12 to 27,455 with a median (IQR) sample of 165 (68–313) patients. 55.6% of models were based on a cohort size below 200 while 85.9% of studies were based on a cohort size of less than 500 patients. According to the number of settings from which the datasets originated, majority involved single centers (80.2%) while 11.3% were from dual canters. Multi-institutional structured datasets were used in only 9 models (8.5%).

Based on the data types, most were based on radiomic features (44.3%), and clinical features obtained from health records (40.6%). Molecular, pathological, and spectral datasets were used in 17%, 6.6%, and 5.7% respectively. Feature extraction was performed in 66% of the ML models, and for studies that used images and reported their resolution (n = 20), relatively similar proportion of models used images below 224 × 224 (45%) as well as 512 × 512 and above (40%). Multidimensional data comprising different feature types were used in 17.9% of the ML models. Likewise, pattern of data collection was mostly retrospective (73.6%) than prospective (26.4%). Data preprocessing was performed and reported for most models (95.3%) while the quality of structured datasets was deliberately assessed in few studies (14.2%).

Based on the class overlap for all studies that performed classification, most studies were satisfactory in the separation of the outcome class for all possible categories (80.2%; Fig. 3). Likewise, noisy/inconsistent instances were likely absent in 54.7% of datasets with 36.8% adjusting for noise using dimensionality reduction techniques such as principal component analysis. However, class imbalance in the outcome category were observed among more datasets (47.1%) than limitations in class overlapping or label purity in classification tasks. Feature relevance in many datasets was often adjusted using feature selection methods (72.6%) while 17% of studies that presented feature importance measures used ≥ 50% of features that were related to the outcomes of interest. Similarly, the predominant use of  dimensionality reduction or feature selection techniques in structured head and neck cancer datasets also meant that redundant and correlated features were dropped and adjusted for most datasets (73.6%) during training (Fig. 3).

Fig. 3
figure 3

Distribution of structured datasets according to the data quality assessment parameters

Of note, most of the datasets fell short regarding data fairness as 95.3% of them had sample sizes that did not allow adequate representation of those with events across different levels of the predictors. Datasets were often complete (72.6%) owing to the predominant use of feature extraction techniques to generate structured data; however, missing data handling techniques were used in 23.6% of datasets most of which involved variable exclusion. Outlier detection was performed for datasets of three models (2.8%) and most datasets (73.6%) had outcome classes which were representative of all possible scenarios according to the aims of the studies (Fig. 3).

Relating ML model performance and data quality parameters

The metrics of model discrimination in all studies ranged from 0.60 to 1.00 with a median value (IQR) of 0.85 (0.77–0.91). Table 1 shows the distribution of the median model discrimination metrics across various aspects of data quality assessment parameters. Notably, models with good balance in the outcome class had significantly higher median discrimination than those that did not adjust for class imbalance (0.88 vs 0.81, p = 0.014). Although, not statistically significant, this review found a higher median model discrimination for datasets with less collinearity (0.86 vs 0.81) and datasets where outliers were detected and adjusted (0.90 vs 0.85).

Table 1 Crosstabulation of the data quality assessment parameters and the median model discrimination metrics obtained in individual studies with structured datasets

Datasets with high EPV had lower median performances than those with low EPV (0.77 vs 0.86). Likewise, data with outcome classes that were not representative of possible outcomes in clinical settings generated models with higher median performance than those with representative multiple classes (0.90 vs 0.84). As many models have not been externally validated, these may suggest model overfitting among models with low EPV and unrepresentative outcome classes, and a likelihood of reduced performance may be expected during generalizability assessment.

ML models based on unstructured datasets

General study characteristics

The 53 ML models based on unstructured datasets were published between 2017 and 2022 (Fig. 4). Most models used datasets from Chinese populations (41.5%) and only 9.4% of studies were developed in low or lower-middle income countries (four from India and one from Pakistan). Similar to the distribution of head and neck cancer subtypes for structured datasets, oral cavity tumors were more commonly modelled (28.3%) followed by nasopharyngeal cancer (22.6%), salivary gland tumors (11.3%), and laryngeal cancer (9.4%). Only one study was found for lip cancer and sinonasal tumors and combined head and neck cancer subtypes were considered in 18.9% of models (Fig. 4). All of the models were of deep learning type based on convolutional neural networks (CNN). U-Net was the most common CNN framework used (15.1%) followed by Inception V3 (7.5%).

Fig. 4
figure 4

General characteristics and risk of bias rating for ML studies employing unstructured datasets. a Publication year trend b Plot showing the number of patient datasets by the different head and neck cancer subtypes c Risk of bias rating for individual domains of the PROBAST tool

Regardless of the commonality of a segmentation CNN architecture, most of the deep learning models were developed to perform classification (81.1%) than segmentation tasks (18.9%). Unlike for structured datasets, majority of the ML models for unstructured datasets were constructed for potential clinical utility in assisted diagnosis (90.6%) than risk prediction (9.4%). Of the models intended for risk prediction, four were related to disease prognosis while one was related to the risk of treatment complications.

Risk of bias for individual studies

Distribution of risk of bias rating is shown in Fig. 4. Only three studies (5.7%) had low overall risk of bias rating while one study had an unclear overall bias based on the lack of description for the patient cohort used. All others had a high risk of bias. Regarding the PROBAST tool domains, many studies had high risk of bias regarding the analysis (79.2%) and use of predictors (79.2%) mostly due to a reduced EPV in the training dataset, a lack of independent samples for validation, and the determination of predictors while the patient outcomes were known.

Data quality of ML models based on unstructured datasets

Only two studies used reporting guidelines (TRIPOD and STROBE) for data and model description. Further, only three models (5.7%) with unstructured datasets have been externally validated. For studies that reported the sample size based on patients (n = 47), cohort size ranged from 12 to 24,667 with a median (IQR) of 219 (102–502) patients. Unstructured data obtained for these patients in form of images was from 72 to 840,000 with a median (IQR) of 2726 (1291–8776). Most datasets were from single centers (81.1%) with 9.4% and 9.4% from two centers and multiple centers respectively.

Radiomics datasets were mostly utilized for ML model construction (54.7%) majority of which were either CT/PET images (30.2%) or MRI images (26.4%). Endoscopy images were used in 18.9% of the models while histopathology slide images and clinical photographs were used in 13.2% and 9.4% of models. Raman spectra were used in two studies. Also, unstructured datasets were often collected in a retrospective (88.7%) than prospective (11.3%) manner. Multidimensional dataset (comprising different feature categories) was used in a single study (1.9%). Data preprocessing before model training was reported to be performed for 83% of datasets with quality assessment determined for only 11.3% of datasets. Image resolution was mostly found to be between 224 × 224 and 512 × 512 pixels (46.3%) with 22.9% of datasets comprising input images of 512 × 512 and above.

For studies that performed classification tasks (n = 40), class overlap was not present in most datasets (93%) (Fig. 5). However, the sample size per patient and images/spectra were not sufficient in many models (69.8%). Some studies adjusted the sample size by incorporating different data augmentation techniques (9.4%) while 11.8% of datasets were deemed sufficient for model training and validation. Class parity were slightly mode insufficient (39.6%) than sufficient without imbalance correction (37.7%). Imbalance correction methods were introduced in 22.6% of the datasets. Also, a similar proportion of datasets had outcome classes which were representative of possible clinical outcomes (50.9% vs 49.1%) (Fig. 5).

Fig. 5
figure 5

Distribution of unstructured datasets according to the data quality assessment parameters

Relating ML model performance and data quality parameters

The model discrimination metrics in all studies with unstructured data ranged from 0.72 to 1.00 with a median value (IQR) of 0.85–0.98. According to the distribution of data assessment parameters, significant differences were observed in the median model discrimination metrics based on the image quality requirements and class overlap (Table 2). Median performance of ML models for image datasets with resolutions above 224 × 224 was higher than those models based on lower resolutions. Likewise, ML models based on datasets without class overlap had better median performance than the few with class overlap (0.95 vs 0.80).

Table 2 Crosstabulation of the data quality assessment parameters and the median model discrimination metrics obtained in individual studies with unstructured datasets

Discussion

This study overviewed structured and unstructured datasets employed for ML model construction in head and neck cancer and assessed their quality to unravel their limitations and improve future performance and the generalizability of the models.

Overall, the majority of the ML models reviewed used structured than unstructured datasets for their development. This increased utilization of structured over unstructured datasets may be attributed to the common choice of conventional ML architectures such as support vector machines, k-nearest neighbors, and ensemble learners that requires highly organized data for their implementation in this study [193]. Moreover, this finding is supported by the frequent application of feature extraction techniques based on size, shape, and texture to abstract radiomics features from raw medical images for training with these conventional classifiers [194, 195]. Likewise, electronic health records are one of the most abundant data sources in healthcare which are often obtained for ML applications as structured than unstructured datasets and may contribute to this finding [196]. However, this finding suggests reduced application of AI architectures for automated handling of unstructured image and spectra datasets (such as deep learning) in head and neck cancer [23, 197]. Notably, irrespective of the data type many models had a high risk of bias that was often attributed to different issues in data sampling and resampling (i.e., cross-validation, bootstrapping) during model training. The lack of systematic ML modeling according to a standard guideline like the TRIPOD statement [31], the lack of external validation for most models, and the reduced application of multidimensional features represent other factors that hamper their clinical utility in head and neck cancer management [13, 198]. Nonetheless, these issues are consistent with other reports as limitations of AI models that are presently being constructed [27, 199, 200]. This study also noticed that while data preprocessing seemed to be commonly performed in many studies, deliberate assessment of data quality was lacking. This may be due to a lack of awareness of the different dimensions involved in data quality evaluation according to the types of datasets available for model construction and the lack of robust data-centric AI reports detailing the procedures involved in assessing data quality as part of data preprocessing [14]. Incorporating data quality dimensions into methodological and reporting guidelines/checklists may also assist developers and investigators imbibe this practice during ML model development [13].

This review observed that lack of outlier detection, lack of data fairness, and imbalance in outcome classes represented the most common limitations for structured datasets being utilized with ML models proposed for clinical utility in head and neck cancer. Of note, this study even observed a substantial decline in the median discriminatory performance of ML models with imbalanced outcome datasets that were not adjusted using correction techniques. As such, the current models proposed may suffer from errors during generalizability, especially among minority and/or unconventional cases that are infrequently encountered. Furthermore, as most of the models based on structured datasets were developed for risk prediction, if the ML models were applied in patient cohorts or regions in which events were prevalent (dataset shift), most of these cases may be missed resulting in poor performance of the models in real clinical scenarios [199]. Since the majority of the structured datasets were employed for classification, our findings on the effect of class parity/balance on model performance corroborate those of Budach et al. [14] that observed a moderate effect of class imbalance on classification tasks using machine learning. However, the effects of completeness, feature accuracy, and target accuracy on the performance of ML classifiers on structured datasets (depicted as data completeness and label purity in our study) were not observed. This may be due to the differences in the assessment of model performance as Budach et al. [14] utilized the F1-score to assess model performance and our study utilized the AUC as it was the performance metric commonly reported in the included studies. Also, this disparity may be explained by the lower proportion of studies that had poor ratings in data completeness and label purity following quality evaluation in our study.

For unstructured datasets, this review showed that lack of data fairness, class imbalance, and the use of outcome classes that were not representative of the entire clinical spectrum of the head and neck cancer subtypes of interest were the most common limitations for ML model construction in head and neck oncology. However, we observed that these common limitations did not affect the discriminatory performance of the models. Of note, it was found that using overlapping outcome classes in unstructured datasets resulted in a lower discriminatory model performance of the model. This suggests that it is optimal for target classes of unstructured datasets to be used for multinomial/multilabel classification rather than binarizing the outcomes which also groups the input features. For example, unstructured data from patients with benign head and neck conditions should not be combined with those of normal patients as a “non-malignant class” and input features from patients with premalignant conditions should not be collated with those of patients with head and neck cancer as a “malignant class”. Likewise, this study observed that the image resolution of the input data may significantly affect the discriminatory performance of the model based on a cutoff resolution of 224 × 224 pixels. This finding supports the observations of Sabottke et al. [201] and Thambawita et al. [202] on the increase in the performance of neural networks when high-resolution medical images were used for model training.

Interestingly, in both structured and unstructured datasets, data fairness represented a common limitation in obtaining quality datasets for ML model development but did not significantly affect the discriminatory performance of the models. This finding may be attributed to publication bias and small-study effect since models with high AUC and C-statistics (> 0.9) were likely to be published irrespective of whether their sample sizes for training and validation were adequate [203, 204]. Future research should ensure that the quantity of data for training and validation are sufficient, and this could be supplemented using data augmentation techniques [205, 206]. Also, journals may employ/enforce the use of robust checklists pertinent to clinical model development to mitigate publication bias and allow authors justify their methodology [30,31,32, 200].

Limitations

While this study uniquely overviews the data quality of ML models for head and neck cancer, it is not without limitations. First, a few large public databases with open-source datasets were excluded from the review. However, this was justified due to the potential of these datasets to introduce bias into the study due to duplication, unclear data collection methods, and the utilization of the databases for the improvement of ML techniques rather than their implementation/performance in modeling head and neck cancer outcomes. Second, some studies did not present the raw datasets used for ML modeling and the dimensions of data quality were assessed using the metadata. However, only data quality evaluation criteria that could be reasonably determined using this information were selected in this study. Last, the majority of the studies included and assessed using the data quality dimensions utilized ML for classification tasks than segmentation, clustering, or regression, especially for structured datasets. Nonetheless, this represents the real-world application of ML models to serve as decision-support tools in head and neck cancer management.

Conclusions and recommendations

Overall, this review found that structured datasets were mostly used to predict head and neck cancer outcomes using machine learning than unstructured datasets which is largely due to the increased use of traditional machine learning architectures that requires highly organized datasets. Data quality was infrequently assessed before the construction of ML models in head and neck cancer for both structured or unstructured datasets. Irrespective of the datasets employed for model construction, class imbalance and lack of data fairness represented the most common limitations. Also, in structured datasets, lack of outlier detection was common while in unstructured datasets many datasets target classes were not representative of the clinical spectrum of the disease conditions or health status. Class imbalance had the most effect on the internal validation performance of the models for structured datasets while image resolution and class overlap had the most effect on the performance of ML models for unstructured datasets.

Based on the findings of this review and the need to bolster the implementation of ML models in contemporary head and neck cancer management this study recommends that: (i) Studies deliberately assess the quality of structured/unstructured data used for model construction. This could be done using automated comprehensive tools (such as in [207]) or manually based on the individual data parameters assessed in this review (especially data fairness, outlier detection, class imbalance detection, influence of image resolution, and outcome class representativeness) [13, 14] (ii) Relevant stakeholders develop a standard for assessing and reporting data quality parameters for studies based on ML models (iii) Studies strive to adhere to reporting guidelines such as TRIPOD, PROBAST, or STARD to ensure standardization of modeling approach (iv) Available models be retrained and externally validated using data that is of sufficient quality and quantity to fully evaluate their generalizability (v) Avenues for model updating be provided during model development to facilitate retraining when sufficient quality of data is available in the future to ensure data fairness before implementation. Alternatively, online learning techniques may be adopted.