Introduction

Oropharyngeal squamous cell carcinoma (OPSCC) is one of the most frequent head and neck cancer, strictly related to human papillomavirus (HPV) infection in the majority of cases [1]. Despite sharing the same anatomical location, HPV-positive and HPV-negative OPSCCs present crucial differences that must be taken into account by oncologists: 1) clinical presentation, as HPV-positive OPSCC symptoms are related to neck mass due to nodal spread of disease, whereas patients with HPV-negative lesions present symptoms related to local growth of the primary tumour, such as odynophagia and dysphagia; 2) HPV-negative OPSCCs have a lower survival and response rate to radio-chemotherapy than HPV-positive ones; 3) patients affected by HPV-positive OPSCC are often younger than HPV-negative OPSCC [2]. Therefore, HPV status determines the appropriate therapy and follow-up plan. In patients affected by OPSCC, HPV status is routinely assessed on biopsied tissue by p16 immunostaining. However, surgical biopsy exposes patients to surgery-related complications, such as bleeding [3], and the presence of co-existing inflammatory changes in the specimen might decrease the sensitivity of the immunostaining [4].

Despite several studies described different imaging features useful to predict HPV status [5,6,7], this approach is not sufficiently reliable due to the presence of overlapping radiological characteristics [8]. To overcome the limitations of subjective medical image interpretation, several authors investigated the potential utility of texture analysis in HPV status assessment [9, 10], since one of the aims of radiomics and machine learning (ML) is the conversion of medical images to quantitative, reader independent data for predictive modelling [11, 12].

Radiomics refers to the analysis of large amounts of quantitative features extracted from medical images. These features include pixel grey level distribution parameters and texture analysis derived data, which evaluate grey level value patterns in images. ML is a subfield of artificial intelligence which may be adopted to build up classification or regression models from radiomics data through automated recognition of patterns in the data space, implementing predictive algorithms [11, 13].

Given this potential, recently the number of radiomic studies has grown dramatically, especially in oncological imaging [14, 15]. However, despite these efforts, the routine use of these tools in the clinical setting has not yet occurred, for example due to lack of technique standardization and external validation [14]. The increasing attention given to ML and radiomics has also resulted in a growing availability of quality assessment checklists, such as the Radiomic Quality Score (RQS) [12, 16, 17]. The RQS’s strength is represented by the evaluation of different aspects of radiomic studies, ranging from images acquisition protocol to data sharing, grouped in six domains (protocol quality and reproducibility, feature selection and validation, biologic/clinical validation and utility, model performance index, high level of evidence, and open science and data). Each item contributes to a final percentage score for the paper, allowing for a quantitative assessment of methodological quality. The value of the RQS is also confirmed by its use across various topics in the recent literature [18,19,20]. An additional advantage of the RQS is the possibility to use its final score to perform statistical analyses with other variables. As also described in a previous report [19], radiomic studies are published on peer-reviewed journals specialized not only in radiology but in a variety of fields, demonstrating a widespread interest among the research community.

With the present systematic review, we aimed to perform a literature revision with RQS quality assessment as well as an evaluation of the relationship between study quality and journal characteristics. In particular, our focus was on the current applications of radiomics in OPSCC imaging for the prediction of HPV status and association between study quality and indices commonly accepted as a proxy for research quality [21].

Methods

Article search strategy

The selection of included studies was carried out through a detailed search in the field of radiomics in head and neck oncology, conducted according to PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines. The study consisted in a systematic search in the electronic databases (PubMed, Web of Science and Scopus) using the following search terms in all possible combinations: radiomics, texture analysis, artificial intelligence, oropharyngeal cancer, Human papillomavirus. The search was finalized on September 1st, 2021. Additional details of the literature research are reported in the supplementary materials. As described in Fig. 1, letters, editorials, reviews, duplicates, and articles published in languages other than English were excluded from the analysis.

Fig. 1
figure 1

Study selection process flowchart

Data extraction and analysis

The RQS is a scoring system used to assess the quality of radiomic analysis methodology by assigning a score for each item satisfied in the articles, divided in six domains (image protocol, radiomics features extraction, data analysis and statistics, model validation, clinical validity, and open science). The final score, that ranges from -8 to 36, is then converted to a percentage score (where 36 is equivalent to 100%) [12]. An overview of RQS items and respective scores is provided in Table 1.

Table 1 Overview of Radiomic Quality Score items and mode of the respective scores in the reviewed studies

The included full-text articles were independently evaluated using the RQS by three raters experienced in artificial intelligence and head and neck cancer (BLINDED: 5 years of experience, BLINDED and BLINDED: 2 years each). Inter-reader intraclass correlation coefficient (ICC) was assessed for both the RQS total and percentage scores, using a two-way, random-effects, single-rater, absolute agreement model.

Furthermore, the included studies were classified based on the following journal characteristics to assess their potential relation to total RQS score: 1) impact factor (JIF) quartile; 2) citation index (JCI) quartile; 3) publication year; 4) JIF; 5) clinical or imaging journal domain ( “clinical” or “imaging” are attributed by using Web of Science, as described in [19]). JIF and JCI quartiles were obtained from Web of Science.

Statistical analysis

All the analyses were performed using the RQS percentage score obtained by the most experienced rater. The main analysis evaluated the relationship between the study quality and journal features (quartile JIF, quartile JCI, publication years, JIF, clinical or imaging journal category). The skewed data distribution was assessed with the Kolmogorov–Smirnov test. To compare variables with a not- normal distribution, a Mann Whitney test was performed. Relationships between continuous variables were examined using Spearman correlation (ρ) for parametric variables with a not-normal distribution. Continuous variables are presented as median and interquartile range (IQR), categorical ones as count and percentage. All statistical analyses were performed using SPSS (SPSS version 27; SPSS, Chicago, IL). A p value < 0.05 was considered statistically significant.

Results

Literature review

In total, 289 articles were obtained from the initial search, of which 119 were duplicates. Of the remaining 170, 151 were rejected based on the selection criteria. Finally, 19 articles were included in the systematic review. The described flowchart is represented in Fig. 1 and included articles are summarized in Table 2.

Table 2 Characteristics of included articles

RQS assessment

For both RQS total and percentage scores, ICC showed high agreement (89%; detailed information in supplementary materials). Supplementary tables S1-S3 report RQS item and total scores assigned by each rater to all the included articles. The quality of the included studies was very low (median score expressed as number 12) and RQS ranges from -2 to 15, corresponding to a median percentage score of 33 (14; 39) (Figure S1). Overall, 16 out of 19 (84%) authors included a well-documented imaging protocol, but no public protocol was used. In only 6 articles (31%) multiple segmentation by different physicians/algorithms/software were found in the radiomic pipeline. Lack of phantom study and imaging at multiple time points was observed in all studies. More than half authors performed feature reduction (13/19, 68%), while in 5 articles validation was completely missing. Although only 2 (10%) radiomic analyses were comprehensive of non-radiomic features, 13/19 (68%) demonstrated phenotypic differences, and all were compared with the current gold standard method. The main limitations, observed in the included articles, were the absence of re-application by a previously published cut off and the lack of calibration statistics, performed in only 10% (2 studies) and 5% (1 study), respectively. However, almost all authors (17/19, 89%) employed discrimination statistics. No researcher registered a prospective study in a trial database or performed a cost-effectiveness analysis, but 4 investigators did share the data obtained. Elhalawani shared the dataset generated for the study in Figshare repository [22], other authors share the code [23, 24], while Suh shared the datasets and the analysis on reasonable request [25].

Subgroup analysis

Journal characteristics are summarized in Table 2, with 11/19 (57%) articles published on imaging journals. All had a wide variability of the quality indicators (mean IF 4.03 ± 1.86); half were published on high profile journals in their research field (10/19, 52% published on Q1 journals according to JIF; 9/19, 47% on Q1 journals by JCI). Most of papers were published in 2018 or later (15/19, 78%). More than half were based on CT images (13/19, 68%), 4/19 (21%) on MRI, and 2 studies on PET/CT (11%). In only 2 cases a deep learning (DL) analysis was carried out for OPSCC HPV status assessment. The results of the subgroup analysis according to the journal type demonstrated higher RQS score in articles published on clinical journals (Fig. 2). However, no correlations were observed for every single item score related to the journal type (clinical or radiological). Furthermore, no significant correlations were found between RQS score and JIF, quartile JIF/JCI or year of publication (details reported in supplementary material). In Figs. 3 and 4, we reported the distribution of the RQS expressed as percentage on the total of articles and the median RQS%. RQS% of the 19 studies according to the six key domains are reported in Fig. 5.

Fig. 2
figure 2

Distribution of the RQS in clinical and imaging journal

Fig. 3
figure 3

Distribution of median RQS% per year

Fig. 4
figure 4

Normed histogram density distribution plot (bin value = 10) and kernel density plot of RQS% scores of the included articles

Fig. 5
figure 5

RQS% of the 19 studies according to the six key domains

Discussion

In the present systematic review, 19 radiomics and ML investigations published in the recent literature on the OPSCC were evaluated. In this setting, one of the most crucial issues in clinical practice is HPV status evaluation [26], and conventional imaging is not currently able to reliably replace the current gold standard (expression of p16 protein via immunohistochemistry from specimen [27]), despite various attempts [28,29,30]. The quality of included studies was very low (median score expressed as number 12, corresponding to a median percentage score of 33) with highest RQS equal to 15 (42%). Significantly higher RQS score was found in clinical journals compared to radiological ones, while no correlations were observed between RQS score and other journal characteristics (JIF, quartile JIF/JCI or year of publication).

Radiomics and texture analysis have tried to fill in the gaps in oral oncology and improve the performance of medical imaging. In the last years, several Authors attempted HPV status prediction radiomic modelling based on different imaging techniques, CT, MRI, or PET/CT. It is interesting to note that only a minority of papers employed DL, given the increasing attention to this approach [31]. Despite MRI having demonstrated its usefulness in HPV status assessment [32, 33], in our review most of the studies were focused on CT images. This could be due to some of its advantages: 1) wider availability in most hospitals [34]; 2) greater variability of MRI based on acquisition parameters as well as different scanners [35]. The resulting relevant radiomic features extracted from CT images have shown potential for HPV status prediction either in internal [8, 35,36,37,38] or external validation [39]. Other authors employed T1-weighted post contrast [25, 40, 41] and ADC [25, 42] images on MRI, and a combination of PET-based and CT-based radiomics on PET/CT [23]. In some cases, specific steps within the radiomic pipeline were also explored, such as comparison between 2 and 3D segmentation [43] and variations due to different CT scanners [44]. These are valuable as limitations in reproducibility of radiomics have been reported due to different CT reconstruction algorithms and image noise [45].

The RQS is one of the most known quality assessment checklists in the field of radiomics and was used to evaluate each paper’s strengths and weaknesses. Proposed by Lambin et al. [12], this score is composed by various items elaborated to reflect commonly employed steps in radiomic analysis pipelines, allowing quantitative and reproducible evaluation by peers. Although this score may be excessively strict when considering the practical issues of medical imaging research, it still represents a valuable and well-known tool [19]. Like other systematic reviews in other oncological imaging fields [20, 46, 47], the quality of included studies was very low and RQS ranged from -2 to 15, between 0 and 42% expressed as percentage. In line with the previous investigations [20, 46], some RQS items were satisfied to a greater extent than others. More than half of articles performed feature reduction, decreasing the risk of overfitting, and included non-radiomic features in a multivariable analysis. To demonstrate the utility of radiomics, all studies included a comparison to the current gold standard method, not always the case in other RQS systematic reviews. Some common missing steps were also recognised: less than 15% of radiomics pipelines comprised a cut-off analysis based on previously published reports, no cost-effectiveness analyses, phantom studies, or multiple time-point imaging were available. Open science implementation was also limited, with use of publicly available datasets practiced in very few cases, despite the advantages it might provide for testing reproducibility of proposed radiomics-based predictive models. Furthermore, in over a quarter of the included articles final model validation was entirely missing. Actual clinical implementation of these results will require more robust validation and possibly studies dedicated to this task.

Additional proxy quality indicators were included in our study. As journal quality indices, JIF and JIF quartiles were selected. The JIF is an index of relevance of the journal in its field of research, calculated from the citation average by year obtained by research published during the previous two years [48]. Other journal performance indicators, quartile ranking by JIF and JCI, were used. These provide additional insights, and JCI in particular reflects a 3-year citation window and is a field-normalized citation metric, unlike JIF [49]. Since Lambin proposed the total score expressed in percentage as quality assessment, the association between this and journal characteristics was evaluated. Interestingly, a significantly higher RQS was found in clinical journals compared to radiological ones. Probably, some RQS items, such as multivariable analysis with non-radiomics features and comparison to a gold standard, may benefit from the inclusion of a clinical researcher among the authors.

On the other hand, no significant correlation was found between RQS and either JIF and or JIF quartile. We also did not find a significant increase of RQS in relation to the year of publication. Similarly, the highest score was not associated to the better journals, in terms of JIF or JIF ranking. These results can again be interpreted in different ways: i) lack of uniformity in quality of radiomics/ML evaluation by reviewers, supported by the absence of association between excellence of the investigation and performance of the journal; ii) as hypothesized by an another systematic review [19], RQS items could not reflect journal or reviewer points of focus, such as patients selection criteria and the topic of the analysis; iii) the items proposed by Lambin [12] in the RQS may be too technical for general peer-review. The results of our analysis confirmed findings from previous reviews: not only the median RQS score in OPSCC articles was in line with that reported by authors [20, 50], but also the JIF was not related to quality of radiomic analysis [19]. However, our findings suggested that a higher RQS was found in clinical journals, contrary to what reported in our previous work [19] and Park et al. [50], although the latter described a not- significant trend for higher RQS in clinical journals [50].

The present review has some limitations. Firstly, the small sample size of included studies and their heterogeneity in terms of design and imaging modalities (MRI, CT, PET/CT). The inclusion of journal quality indicators, such as JIF, and JIF quartile and JCI, which are themselves influenced by potential biases [51], despite JIF being universally recognised as a valuable indicator [52]. As proposed by Lambin [12], the RQS should be expressed as a percentage score, but scores less than zero are all converted to 0%, losing the differences between all scores ranging from -8 to 0.

In conclusion, radiomics and ML studies for the prediction of HPV status in OPSCC have demonstrated low overall quality according to the RQS. While study quality was not related to journal quality, articles with best RQS scores were found in clinical journals. Future investigations in this field should take into account the issues highlighted in this review in order to improve upon previous experiences and facilitate a translation of promising research results to real-world clinical practice.