Introduction

Head and neck squamous cell cancers (HNSCC) constitute a diverse group of cancers arising from the head and neck region with common risk factors, natural history, and similar treatment principles. Worldwide, they are the 6th most common malignancy [1]. Despite advances in evaluation and treatment, outcomes of HNSCC continue to be poor. Radiotherapy, either as a primary treatment or in addition to surgery, is integral to the management of most patients with HNSCC. Radiotherapy delivered with curative intent for HNSCC is associated with significant toxicity in addition to suboptimal disease outcomes. However, there is substantial variability in outcomes; some patients are cured of their disease while others aren’t, and similarly, toxicities of treatment are minimal in some and excessive in others. This observation reflects the underlying heterogeneity among patients and their cancers, primarily a result of incomplete biological understanding of both the disease and the patient [2]. Several ongoing strategies are looking at addressing this inadequacy, including genomics, radiomics, mathematical models, and newer therapeutic strategies [2].

Machine learning (ML) has been increasingly utilized in recent years to advance medicine, including cancer treatment [3]. Artificial Intelligence (AI) has found application in various aspects of medicine, ranging from imaging (where it has been evaluated for screening, diagnosis, and prognostication of radiological, pathological, and other medical images) to treatment planning, execution, and follow-up. With radiation oncology in specific, AI has been explored for image segmentation, radiation dose optimization, quality assurance, and clinical decision support [4]. Regarding clinical decision-making, AI has significant potential in facilitating a better understanding of HNSCC and its treatment. A few publications have investigated the feasibility of the implementation of AI on clinical details of HNSCC patients to determine the various outcomes, such as 5-year recurrence rates [5,6,7,8]. Few studies illustrated how ML algorithms aided in predicting nodal metastasis in early oral squamous cell carcinoma using clinical and pathological data [9, 10]. Also, several studies have compared ML models and artificial neural networks to predict locoregional recurrence for HNSCC [11, 12]. These studies represent how ML models could effectively help doctors in clinical decision-making [13, 14].

Pre-processing steps are essential to make the given dataset suitable for ML algorithms. Studies have shown that techniques such as Ordinal Encoding, Onehotencoding, and Feature Hashing have been used to encode mixed data types, having numeric and categorical variables [15]. The dataset may consist of missing entries that must be dealt with beforehand. Studies have shown that various iterative imputation techniques work well on datasets with a significant amount of missing entries [16]. The dataset may contain more columns than rows. Therefore, it is essential to perform appropriate feature selection techniques. There are studies that describe methods of feature selection such as Principal Component Analysis [17], Independent Component Analysis [18], Filter based methods [19], Wrapper based approaches [19], and Embedded approaches [19]. Since we intended to identify prognostic features contributing the most to classification, and PCA converts the original dataset into its principal components, it was not considered [20, 21]. Studies have also found methods such as genetic algorithms suitable for imbalanced datasets for feature selection [22].

This study was conducted to identify if ML would be able to predict clinically pertinent outcomes in HNSCC treated with radiotherapy. This paper represents an approach to help design classification models using clinical data. The methods used to pre-process and clean the data are also elaborated upon. Additionally, a method to retain the original characteristics of the dataset in the testing dataset while performing minority class oversampling is illustrated. Finally, the classification results using the best-performing ML model are presented. Predictors having significant classification importance were identified to see if they could offer additional clinical understanding. Implementation of ML algorithms using the collected data was performed using python programming language, executed on Google Colaboratory platform.

Methodology

Collection and structuring of clinical data

A total of 311 patients with HNSCC treated at our center between 2013 and 2018 with radiotherapy were included in this study. After treatment, a patient should have had a minimum follow-up for three months. Raw data was collected from the hospital medical records. The collected dataset constituted of 401 non-mutually exclusive clinical variables. Broadly, the heads under which variables were collected are described in Table 1.

Table 1 Brief description of collected dataset

Data collected was structured using Excel sheets. Each row represents an individual patient. A single patient’s variables should not be present in multiple rows. Columns were structured such that all possible non-mutually exclusive variables (according to the possibilities found in collected 311 samples) were separated, and any given patient's data variable (samples used for validation of the designed algorithm) would be contained in this defined structure (Fig. 1).

Fig. 1
figure 1

Workflow of the study. Each model was developed in steps, as illustrated

According to the data type collected, Table 2 shows the output labels on which the input variables were meant to be mapped to the output vector using classification ML models (supervised learning). There are five models designed in this study (Table 2). Before beginning to build individual models, the column variables were segregated in such a way that every column held clinical meaning in use for the prediction of its respective output vector. For example, the events occurring after the start of treatment wouldn't be used to predict initial treatment, and late toxicities wouldn't be used to predict acute toxicity. Hence, the sample size varied while designing each model.

Table 2 Classification models designed along with their respective encoded output vectors

Figure 1 illustrates the workflow of our research. The data collection step involved collecting clinical data from medical records and structuring it into non-mutually exclusive columns. The data pre-processing step involved encoding of data, missing value imputation, and checking for class imbalance. If there was a class imbalance, then minority oversampling was performed. Further, the dataset was split into training and testing datasets, and feature scaling and feature selection were performed. The training data was made to fit on the ML algorithms, tuned on the hyperparameters. Testing dataset was used to generate performance metrics for each algorithm. Also, Shapely analysis was used to identify the list of variables contributing the most to classification.

Data encoding

The recorded input variables were of quantitative and categorical form. Quantitative attributes were recorded according to their respective measuring units. Ordinal attributes were encoded into an orderly numeric form using Label Encoder. Nominal attributes were encoded using OnehotEncoder [15, 23].

Missing value imputation using imputation algorithms

A retrospective data will have missing values by its nature because of various reasons, such as corrupt data, failure to retrieve the information, or incomplete extraction. If we attempt to delete the rows consisting of missing data, then sample size would reduce drastically, resulting in a poor machine learning classifier. The missing input dataset was imputed using Statistical Imputation, KNN Imputation, and Multiple imputation by chained equations (MICE) [16, 24]. After imputation from each algorithm, the imputed dataset was made to fit on the RF algorithm [23], and the accuracy generated was calculated as shown in Table 5, appendix. The model was evaluated using ten splits of k-fold cross-validation so that each sample would be a part of the training and testing dataset. This ensures the effectiveness of ML models in limited data. The best-performing missing value imputation algorithm was chosen for the dataset.

Handling class imbalance using minority oversampling and feature scaling

ML algorithm assumes that an approximately equal number of samples are present in each class of a dataset. But in a retrospective dataset and variable medical outcomes, class imbalances are common. To deal with such imbalance, for each model, the number of samples belonging to respective classes was counted. If the number of samples in each class were approximately equal, then oversampling step was skipped. Otherwise, the class imbalance was handled using oversampling techniques, including Random Oversampling, SMOTE, Borderline SMOTE, SVM SMOTE, and ADASYN algorithms for oversampling of minority class [25]. Performance was evaluated using each algorithm, and the best-performing metric was chosen. The shape of the new dataset was recorded to determine the dimensions. Newly added synthetic data was stored in separate variables similarly minority and majority dataset concerning classes were stored in separate variables [26].

The dataset consisting of samples only from the majority class was split into training and testing datasets. The ratio of 67:33 was maintained for training and testing. The training dataset for the model was built by adding a training split majority class dataset with the synthetic dataset. The testing dataset was built by adding the test split of the majority dataset with the minority data from the original dataset (Fig. 2). In this improvised method as against conventional technique, after performing minority oversampling on the original dataset, training and testing variables were generated such that the training samples would contain train-split of majority class samples and synthetically generated samples. Also, it was ensured that the testing dataset would contain test-split of majority and original minority samples. This customization was performed to extract more reliable performance from the testing dataset.

Fig. 2
figure 2

Flow Diagram for minority oversampling using an example of distant recurrence

The input variables present in the dataset were of different measuring scales and had different ranges of magnitude. Hence, the dataset was scaled using standard scalar function [23], such that all variables have zero mean and unit variance.

Feature selection

The collected dataset consisted of more features than the number of samples, thereby falling under the curse of dimensionality [27]. Therefore, feature selection was performed on the dataset. In this process, the variables contributing most to classification were selected. Boruta [28, 29] and Sequential forward floating selection (SFFS) [30, 31] methods were used for feature selection. Each method was evaluated using the performance metrics of three Machine Learning models- RF, KSVM [23], and XGBoost [32]. The performance results from SFFS were better and were therefore chosen as a feature selection method for the dataset. For the Boruta method, the RF algorithm was used as the base algorithm. For the SFFS method, the same algorithm was chosen to run as its base, on which the training dataset was fit. Both Boruta and SFFS were made to run in a setting where it will select k-best features out of all the features in the dataset, and precedence was given to the features having the highest accuracy.

Optimal hyperparameters for ML algorithms

The ML algorithms were made to run optimally for finding the best set of hyperparameters by fitting them on the given dataset. Using Gridsearch [23] approach, optimal hyperparameters were found for each RF, KSVM, and XGBoost model. These hyperparameters were used, both in the base algorithm while performing feature selection and on the ML algorithms while performing classification (Table 11, appendix).

The training dataset was fit on the ML model using the optimal hyperparameters derived using grid search. The new dataset with only the selected features was chosen, and the ML model was fit on the training dataset.

Testing the designed model

Testing of the designed model was done on the hold-out dataset, and its efficiency and reliability were determined using evaluation of performance metrics. This was the dataset that was not included while training the model. The results of classification prediction were compared with the output vector of the testing dataset to generate performance metrics [23]. We calculated Training and Testing Accuracy, Sensitivity, Specificity, F1 score [33] for both classes, and AUC of Receiver Operating Curve [34]. Each performance measure was recorded five times, and the average value was reported. Finally, the best performing ML algorithm for each model was made to run ten times, and the average performance measure value was reported. The most important features contributing to the classification were fetched using Shapely values [35]. The higher value on the labels yes or no indicates the stronger positive or negative correlation of a particular variable with respect to outcome. The mean AUC for each model was plotted, and the data was divided using stratified k fold cross-validation [36] to determine the best performance of the models. A higher AUC score represents that a classifier has a better performance in distinguishing between the two classes.

Results and analysis

A total of 311 patients with HNSCC treated with radiotherapy and having a follow-up of more than three months were found suitable for the study. Males constituted 257 patients, and the mean age of patients was 56.5 years. The average follow-up duration was 23 months.

During the various pre-processing steps, compared to statistical Imputation and KNN Imputation, the Iterative Imputation (MICE) algorithm gave the highest accuracy of 68.6% while running the algorithm for four iterations, having ‘ascending’ hyper-parameter with RF (Table 5; appendix).

All the results discussed below are reported as the mean of five iterations. The performance of RF, KSVM, and XGBoost were compared, and results of the best performing ML algorithm are reported.

For ‘choice of initial treatment,' no oversampling methods were performed due to insignificant class imbalance. RF gave the best performance, using 34 k-best features selected by SFFS. The mean training and testing accuracies were 94.3% and 84.58%, respectively. The sensitivity and specificity were calculated to be 85.1% and 85.7%, respectively. The algorithm tended towards overfitting, but comparatively, testing accuracy was better than the other two algorithms. Similarly, the F1 score was calculated to be the highest among the algorithms (85% for label 0 and 84.2% for label 1). Finally, the AUC was calculated to be 0.8904, signifying it to be a good classifier for the given dataset (Table 6, appendix).

For the remaining four models, all five oversampling techniques were applied, and each technique was evaluated using the three ML algorithms. ADASYN minority oversampling gave the best performance for ‘distant recurrence,' whereas the SMOTE minority oversampling technique performed best for locoregional recurrence, new primary and residual-disease predictions. The classification performance was determined using the mean accuracy of the testing dataset. The distant recurrence, locoregional recurrence, new-primary, and residual models had a testing accuracy (using KSVM) of 95.12%, 77.55%, 98.61%, and 92.25%, respectively. The respective sensitivity and specificity values were 90% and 98%, 95% and 89%, 100% and 98%, and 72% and 97%.

As a measure of precision and recall, F1 scores were calculated for both classes of each model. For distant recurrence, the mean F1-score was calculated to be 91% and 96.8% for minority and majority classes. For locoregional recurrence, they were 68.6% and 82.4%; for new primary, they were 95.4% and 98.8%; for residual disease, they were 87.6% and 94.6%.

The AUC of ROC curves for distant recurrence, locoregional recurrence, new-primary, and residual disease prediction models were 0.998, 0.9453, 0.9994, and 0.9948, respectively (Table 7, 8, 9, 10; appendix).

Finally, the best performing ML algorithm was iterated ten times to ensure the reliability of performance that is consistent for the complete dataset. The mean AUC scores were 0.977, 0.734, 0.983 and 0.993, respectively (Table 3). Also, recurring features identified after ten iterations of Shapely analysis provide the list of most contributing predictors for each model (Table 4).

Table 3 Results from test dataset for distant recurrence, locoregional recurrence, new primary and residual models using KSVM algorithm using its optimal hyperparameters
Table 4 List of common features selected by Shapely analysis

For visualizing the best performing ML algorithm for each model, ten splits of k-fold cross-validation were applied. The mean AUC and AUC for each iteration are presented (Fig. 3). The mean AUC for Choice of Initial Treatment, Distant Recurrence, Locoregional Recurrence, New Primary and Residual were 0.93 ± 0.07, 0.99 ± 0.00, 0.96 ± 0.02, 0.99 ± 0.00, and 0.98 ± 0.03, respectively.

Fig. 3
figure 3

Mean ROC plots for each model: a Choice of Initial Treatment, b Distant Recurrence, c Locoregional Recurrence, d New Primary, e Residual

Discussion

HNSCC merits significant advances in the way it is presently managed, and one of the potential strategies to achieve this is the ability to accurately predict outcomes. This study highlights the possible clinical utility of ML in the management of HNSCC. This approach aids, in addition to predicting the outcomes themselves, to also identifying the factors responsible for a particular outcome (event) occurring. For instance, fore-knowledge of whether a patient is likely to suffer a residual disease, locoregional, and/or a distant recurrence carries tremendous importance. HNSCCs are aggressive tumors, and especially in more advanced stages, recur in a substantial number of patients. Though locoregional recurrences constitute a significant proportion of recurrences, many present with distant recurrences too. It is currently difficult to predict recurrence patterns; the ability to foresee it could pave way for a greater understanding of the disease and help in initial treatment planning. For example, a patient inclined to recur at a distant site could be considered upfront for a more aggressive systemic therapy, and a patient at low risk of local recurrence could be considered for de-intensification strategies in locoregional treatment. Similarly, if the patient is expected to have residual disease following radiotherapy, alternative strategies, such as dose-escalation, chemotherapy intensification, or in some cases excluding radiotherapy completely, could greatly benefit the outcomes in terms of both tumor control and reduced toxicity. In our study, for example, the generated models could predict the risk of locoregional recurrence with reasonable accuracy of 73.45%.

Several researchers have looked into the benefit of ML in predicting treatment on a retrospective dataset of HNSCC patients, with very encouraging results. For instance, researchers from the University of Chicago reported that the ML algorithms successfully identified patients with intermediate-risk HNSCC who stood to benefit from concurrent chemotherapy from those who did not [37]. This study highlights the potential of ML in being a valuable tool for clinical use once the findings have been validated.

In addition, by looking at features that carried a higher weightage in a particular prediction, ML could also facilitate in developing a better understanding of the disease. As an example, for locoregional recurrence, among the feature weights, like presence and size of lymph nodes, presence of gross margin positive status following resection, the extent of primary and grade of tumor carried higher importance across multiple models. While these were expected factors, there were also some unusual factors having a bearing on local recurrence in some of the models, such as doses received by the organ at risk (normal structure) and use of chemotherapy- both of which could perhaps partially be explained by their likelihood of occurrence being high in more advanced disease. Similarly, most recurring factors weighing high in the prediction models for distant recurrence included the presence of residual disease, absence of locoregional recurrence, and development of new (metachronous) primary. Interestingly, in one of the models, the patients' address appeared to have a bearing on distant recurrence, with patients hailing from a particular district less prone to distant recurrence. This exercise highlights the potential of such an AI implementation in healthcare. Properly implemented, the possibilities are immense—all the way from the departmental level to the national and even international levels.

There is also the possibility of systematization of the practice of oncology. For example, multiple factors are taken into consideration in determining the initial choice of treatment for a patient with HNSCC; these include the site of primary, its locoregional extent, stage of the disease, the patient’s overall health, and presence of comorbidities, etc. However, despite established guidelines, the choice of treatment for an individual patient can be variable and is usually individualized. Decision-making in the multi-disciplinary field of oncology is therefore fairly complex and is fraught with differing opinions and a significant lack of consensus among the treating team. The potential benefits of being able to accurately predict treatment allocation include systematizing the process, in addition to gaining a better understanding of the complex interaction of different factors that make up the eventual decision.

Our study has several limitations. It included multiple sub-sites of HNSCC, which could reduce clarity. Another glaring problem was the quality and quantity of data; it investigated a single-center retrospective data, with a small number of patients with relatively poor follow-up information on outcomes. In addition to improving the accuracy, a greater sample size could have also avoided the curse of dimensionality. Similarly, a prospective implementation of such data capturing across multiple centers could help gain a much better understanding of such cancers. This work needs additional validation before it can be implemented in the clinic. Additional variables, including radiomics and advanced pathological data could also be incorporated to make a robust tool that can be confidently utilized by clinicians in decision-making.

Conclusion

With the data available for analysis in our study, the iterative imputation algorithm gave the best accuracy compared to Statistical and KNN Imputation while evaluating with RF. SFFS feature selection algorithm was used since its performance metrics were better than Boruta while evaluating with RF, KSVM, and XGBoost for all models. SMOTE and ADASYN gave the best minority oversampling performance metrics using the three ML algorithms. Intrinsic pre-processing, as applied in this research, greatly facilitated the performance of the designed ML models.

Thus, ML was able to predict several clinically relevant outcomes of patients with HNSCC receiving radiotherapy, using clinicopathological data, with reasonable accuracy. This could significantly impact the way patients are managed, leading to a better understanding of disease and improved outcomes. However, such findings need to be validated prospectively and across multiple centers before they can be introduced into routine clinical use.