Introduction

Head and neck squamous cell carcinoma (HNSCC) is the sixth leading cancer by incidence worldwide, accounts for >90% of head and neck cancer, has an annual incidence of over 550,000 with around 300,000 deaths each year1.

The 5-year overall survival for HNSCC is 40–50% and more than two thirds of patients present with locally advanced disease mandating accurate staging2. Recurrence rates as high as 60% within 2 years of treatment have been reported with 20–30% of patients developing distant metastatic disease3. For loco-regionally advanced HNSCC, chemoradiotherapy (CRT) has increasingly become a standard of care.

Fluorine-18 fluorodeoxyglucose (FDG) positron emission tomography – computed tomography (PET-CT) is central to characterising loco-regional and distant disease at initial staging and has an increasing role in post-treatment response assessment4. Randomised controlled trial data has shown that PET-CT performed post CRT is an accurate and cost-effective technique for assessing response and can spare 80% of patients from unnecessary neck dissection5. Post-treatment related changes in the neck can make assessment difficult in some cases, with evidence suggesting that human papilloma virus(HPV)-positive HNSCC behaves differently to HPV-negative disease, the specific test characteristics of PET-CT for assessing treatment response in HPV-negative HNSCC remains unclear5,6.

Semi-quantitative methods of treatment response assessment using standardised uptake value (SUV) have not been shown to be accurate at predicting patient outcome which has led to the development of more reproducible qualitative interpretative criteria (IC) to assess post-treatment response7,8,9,10,11. Heterogeneity in criteria used for assessment also limits comparison between different response assessment studies.

More recently, qualitative IC such as the Porceddu, Hopkins and Deauville scoring systems (Table 1) have been developed and validated in HNSCC response assessment12,13,14. These rely on visual inspection of the relative difference in tumour metabolism compared to surrounding normal tissue and/or background uptake, which in the case of Hopkins is the internal jugular vein and in Deauville, is the mediastinal blood pool. Both Hopkins and Deauville criteria use 5-point scales, however scores 1 and 2 in both categories effectively represent a complete metabolic response. Porceddu criteria employ a 3-point scale which classifies scans as positive, negative or equivocal based on whether there is FDG activity greater than adjacent normal tissues and/or liver12.

Table 1 Response interpretation criteria and explanation of each category.

Several studies have reported that qualitative assessment methods are useful for predicting regional control and can help minimise the number of equivocal scan results13,15,16. In 2016, the American College of Radiology convened a Neck Imaging Reporting and Data Systems (NI-RADS) Committee who have developed a template to help distinguish benign post-treatment changes and residual or recurrent tumour17.

Currently there is no clear consensus regarding the optimal IC to use in this clinical scenario. Classifying ‘equivocal’ cases varies depending on which IC is used and differences remain in how these patients are subsequently managed, for example, undergoing invasive neck dissection or further follow-up imaging and clinical examination given the difficulty in differentiating a benign post-treatment response or residual/recurrent tumour18.

The primary objective of this study was to assess comparative accuracy and prognostic ability of the 4 different IC (NI-RADS, Porceddu, Hopkins and Deauville) in a large cohort of HNSCC patients treated with curative-intent (chemo)radiotherapy for predicting local and regional disease control and progression free survival (PFS).

Methods

Patient cohort

The study involved retrospective analysis of a prospective database performed under a waiver of informed consent and ethics approval by the Institutional Review Board. Prospective consent was obtained from all patients for use of their PET-CT imaging data in research and service development projects. Consecutive patients with histologically confirmed HNSCC treated at a tertiary referral centre between August 2008 and May 2017 with curative-intent non-surgical treatment (radiotherapy alone or chemoradiotherapy) who had undergone baseline and response assessment FDG PET-CT. Our institutional protocol is for response assessment PET-CT to be performed approximately 4 months after treatment. Demographics, baseline characteristics, staging, treatment and outcome details were retrieved from the institutional electronic patient record (PPM+, Leeds, United Kingdom). Exclusion criteria included: patients with nasopharyngeal carcinoma; previous resection of primary or nodal disease; prior radiotherapy; FDG PET-CT only performed at baseline or for response assessment

Treatment

Patients were treated with either three-dimensional (3D)-conformal radiotherapy or intensity-modulated radiotherapy (IMRT), which was gradually introduced into routine clinical practice from 2010. The 3D-conformal radiotherapy technique19 and IMRT20 have been previously described. Institutional protocols were followed with a radical treatment dose of 70 Gy in 35 fractions over 7 weeks or 65 Gy in 30 fractions over 6 weeks, with lower doses to prophylactic dose regions (54–63 Gy in 35 fractions over 7 weeks).

Induction chemotherapy with docetaxel, cisplatin and 5-fluorouracil (TPF) or cisplatin and 5-fluorouracil (PF) were delivered to a proportion of patients as previously described21. Concurrent chemotherapy routinely consisted of cisplatin 100 mg m−2 at days 1 and 29.

Response assessment and follow-up

Tumour response was routinely assessed by clinical examination, naso-endoscopy where appropriate and FDG PET-CT approximately 4 months after completing treatment. Examination under anaesthetic and biopsies were performed at clinical discretion following response assessment. In general, patients who achieved a complete metabolic response did not undergo biopsy. Patients with less than a complete response were managed on an individual basis based upon discussion at a multidisciplinary team meeting. Subsequently, patients were followed up with physical examination and flexible endoscopy every 6–8 weeks in the first year after treatment, every 3 months for an additional 2 years and every 6 months until discharge at 5 years22.

PET-CT technique

FDG PET-CT examinations prior to June 2010 were performed on a 16-slice Discovery STE PET-CT scanner (GE Healthcare, Chicago, Illinois, USA) and from June 2010 to October 2015 on a 64-slice Gemini TF64 scanner (Philips Healthcare, Best, Netherlands), After October 2015 all scans were performed on a 64-slice Discovery 710 scanner (GE Healthcare, Chicago, Illinois, USA). Serum blood glucose was routinely checked and if >10 mmol/L scanning was not performed. Patients fasted for 6 hours prior to intravenous Fluorine-18 FDG injection (dose varied according to patient body weight). PET acquisition from skull vertex to upper thighs was performed 60 minutes after tracer injection. A silence protocol was employed in the uptake period following tracer injection to minimize physiological tracer activity within the head and neck region. The CT component was performed according to a standardized protocol (without the use of iodinated contrast medium) with the following settings: 120 kV; auto-modulated mAs; tube rotation time, 0.5 seconds per rotation; pitch, 6; section thickness, 2.5 mm (to match the PET section thickness).

Patients maintained normal shallow respiration during the CT acquisition. Images were reconstructed using a standard ordered subset expectation maximization (OSEM) algorithm with CT for attenuation correction. Both non-attenuation-corrected and attenuation corrected datasets were reconstructed.

Image analysis

All response assessment PET-CT studies were evaluated by a trainee radiologist under supervision of a dual-accredited Radiologist & Nuclear Medicine Physician with 15 years’ experience of reporting oncological PET-CT using specialised software (Advantage Windows Version 4.5, GE Healthcare, Chicago, Illinois, USA) and each of the four IC were applied. To accurately compare all four response assessment scales, each scale was re-classified into a 4-point scale as shown in Table 2 with complete response, partial response, indeterminate and progressive disease categories. Representative examples of these 4 categories are shown in Fig. 1.

Table 2 Harmonisation process of each interpretative criteria into standardized 4-point scales.
Figure 1
figure 1

Representative cases illustrating post-harmonisation interpretative categories (1 to 4) pre and post-treatment. Row 1 – Complete response, Row 2 – Indeterminate, Row 3 – Partial response, Row 4 – Progressive disease.

Clinical follow-up

Follow-up was defined from final fraction of radiotherapy treatment. Disease status post-treatment was determined from pathology and/or radiology correlation with review of electronic patient records for clinical outcome. In patients who did not receive a biopsy/surgical intervention, serial negative physical examinations over the follow-up period and any relevant imaging investigations were used as confirmation of disease-free status.

Statistical analysis

Survival and recurrence time was defined from final fraction of radiotherapy treatment. Diagnostic performance metrics for each IC: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and overall accuracy applied to both primary tumour and nodes were calculated. Performance in sub-groups including HPV-positive oropharyngeal cancers (OPC), HPV-negative OPC and hypopharynx/larynx cancers were analysed.

Univariate association between recurrence (local and/or regional and/or distant) and each adjusted response assessment score (1–4) was estimated by the Chi-squared test. Kaplan-Meier analysis and Cox proportional hazards regression analyses were performed for each IC to assess cumulative progression free survival (PFS) and time to death (overall survival, OS) or progression. Log-rank testing was used to compare survival between the response categories within each IC. Receiver-operating characteristic (ROC) curve analysis was performed for each IC. The statistical significance level was set at P < 0.05. All statistical tests were performed using SPSS for Windows software (Version 21.0; IBM Corp., Armonk, New York, USA).

Results

Patient characteristics

A total of 562 patients were included in analysis. Detailed patient characteristics are provided in Table 3. Median age was 58 years (range 24–84). The median (range) of baseline tumour SUVmax and nodal SUVmax were 11.0 (0–53) and 8.1 (0–34) respectively. Median response tumour SUVmax was 1.7 (0–14.3) and nodal SUVmax was 1.0 (0–12.5).

Table 3 Patient characteristics.

Outcomes

Median follow-up period was 26 months (range 3–148 months). Median time from end of treatment to response assessment PET-CT was 17 weeks (range 6–31 weeks). 2-year survival outcomes were as follows: PFS 73%; OS 79%; local PFS 89%; regional PFS 85%; distant PFS 88%.

130 patients (23%) died in the study period with 432 patients (77%) alive at the time of analysis. 13 patients (2%) died within 6 months of treatment; one from a sudden cardiac event, two from tumor haemorrhage and 10 from disease progression. During follow-up, 156 patients (28%) developed progressive disease, 31 (20%) at the primary tumour site only (local failure), 42 (27%) at a regional nodal site only (regional failure), 16 (10%) at both the primary tumour and nodal site (loco-regional failure) without distant metastases and 35 (22%) had distant metastases only. 32 patients (21%) had local and/or regional failure with distant metastases. 11 cases (7%) of progressive disease were biopsy proven, 144 (92%) were based on radiology and 1 was a clinical diagnosis. 22 of 35 patients who developed distant metastases had these detected on response assessment PET-CT, 13 patients developed metastatic disease subsequently. Median time to loco-regional recurrence was 4 months (range 2–53).

Kaplan-Meier, the log-rank test and Cox proportional hazards regression analyses showed significant differences in PFS and OS between response categories classified by each of the four IC (p < 0.0001). Pairwise log-rank results provided in supplementary information. The survival curves pre and post harmonisation are shown in Figs. 2 and 3.

Figure 2
figure 2

Kaplan-Meier plots of progression free survival based on each primary interpretative criteria before harmonisation. NI-RADS criteria (Categories 1–4), Porceddu criteria (Categories 1–3), Hopkins criteria (Categories 1–5) and Deauville criteria (Categories 1–5).

Figure 3
figure 3

Kaplan-Meier plots of progression free survival based on each interpretative criteria post-harmonisation into 4-point scales.

Indeterminate cases

The number of indeterminate scores varied for each IC as shown in Table 4. With regards to primary tumour, NI-RADS classified 91 patients as indeterminate compared to 25 for Porceddu, 20 for Deauville and 13 for Hopkins. Overall, the NI-RADS IC scored more cases than the other 3 IC combined as indeterminate i.e. equivocal. Hopkins scored the fewest number of indeterminate cases.

Table 4 Indeterminate scores as categorised according to different interpretative criteria for each analysis group (All primary tumour, all node, HPV-positive OPC, HPV-negative OPC and hypopharynx/larynx sub-groups).

Diagnostic performance of interpretation criteria

The diagnostic performance of each IC in predicting disease control with regard to primary tumour, nodal disease, HPV-positive OPC, HPV-negative OPC and hypopharynx/larynx sub-groups are displayed in Table 5. The performance of each IC in predicting complete response and progressive disease in the indeterminate groups is shown in Table 6.

Table 5 Diagnostic performance of interpretative criteria for prediction of complete response and progressive disease applied to all primary tumours, all nodal disease, HPV-positive OPC, HPV-negative OPC and Hypopharynx/Larynx sub-groups. Mean values for each diagnostic performance metric across all 4 IC are also provided. () = number of cases provided.
Table 6 Diagnostic performance of interpretative criteria for prediction of complete response and progressive disease for indeterminate scores applied to all primary tumours, all nodal disease, HPV-positive OPC, HPV-negative OPC and hypopharynx/larynx sub-groups. () = number of indeterminate cases.

The ROC analysis (Fig. 4) established that each of the IC were similar in their ability to predict disease outcome with areas under the curve (AUC) of 0.76 (NI-RADS), 0.76 (Porceddu), 0.75 (Hopkins) and 0.76 (Deauville) respectively.

Figure 4
figure 4

Receiver operating characteristic (ROC) curves for the four interpretative criteria.

Discussion

The use of qualitative assessment of FDG PET-CT post treatment in HNSCC was highly predictive of PFS and OS using four previously validated criteria - NI-RADS, Porceddu, Hopkins and Deauville in our large patient cohort. All 4 adjusted IC demonstrated good discriminatory ability in predicting disease outcome with high specificity, PPV and NPV which could help clinical decision making, stratifying patients into different management streams including continued observation, biopsy or salvage surgery.

Compared to the existing literature, the PPV values of our study (83–95%) are slightly higher than other reported rates of 51–78%14,15,23. Diagnostic accuracy of response assessment PET-CT is affected by the time interval between treatment and follow-up imaging, the later median time-point of imaging post radiotherapy (17 weeks) compared to other studies may account for the slightly higher PPV values in this study. Conversely the NPV is lower (84–86%) compared to multiple other studies (86–97%) with smaller cohort sizes (largest 214 patients)12,13,23,24,25. PET-CT was categorised as false-negative if recurrent cancer was diagnosed at any stage during follow-up, the longest time to progression recorded was over 50 months from the end of treatment, whereas other studies limited this period to 6 months after the response assessment PET (1423, and had a higher NPV. A comparable study assessing Deauville criteria for nodal response assessment post CRT in 105 HNSCC patients using the same methodology for false-negatives (any time during follow-up) had a similar NPV (86.4%) (13). By restricting false negatives to those with recurrence developing within 6 months, the NPV of NI-RADS as an example, increases from 85% to 94% in our cohort.

There was greater variation in the number of cases classified as indeterminate between different IC, with far more scores in this category when applying the NI-RADS IC. This likely reflects the subjective nature of the NI-RADS indeterminate group which includes all cases which have focal mild to moderate mucosal FDG uptake without giving a reference area of uptake such as the IJV (Hopkins) or mediastinum (Deauville) thereby making it more difficult to split these cases up compared to the other IC17,26. The overall mean recurrence rate of 53% (range 42–69%) in NI-RADS category 2 (low suspicion for recurrence) patients in this study is also much higher than previously reported research study figures of 17.2%, highlighting that more work in large cohort studies is required to validate this26. One advantage identified for the Hopkins IC is the low number of indeterminate cases however the NPV was lower, particularly for HPV-positive (87.6%) and HPV-negative (77.4%) groups. Porceddu and Deauville provided the best trade off minimising indeterminate scores whilst maintaining a high NPV. Individual centres should apply one IC consistently across all patients to facilitate more standardised reporting and allow for future comparisons between institutions.

Interestingly, the NPV for HPV-positive OPC patients was higher than for the HPV-negative sub-group. Fakhry et al. previously reported that HPV-positive status was a good prognostic indicator with better CRT sensitivity and patient outcome27. This is relevant in indeterminate cases, where use of these IC may provide more information on guiding optimal management between neck dissection or surveillance. Previous research has demonstrated no association between HPV status and other semi-quantitative imaging markers in relation to predicting recurrence28. The higher NPV in HPV-positive patients may be potentially useful for clinicians when considering additional treatments such as neck dissection.

The prognostic value of PET is more uncertain when FDG uptake is equivocal/indeterminate across all four IC, with a low PPV, although this observation is limited by a relatively low number of cases fitting this sub-group with a median number of 22 for all tumour and node cases although this group was as low as one in the HPV status and hypopharynx/larynx subgroup analysis. The ability to more accurately distinguish between benign post-treatment inflammation or residual disease remains of paramount clinical importance as each scenario would require significantly different patient management. In longitudinal PET studies assessing lymphoma, equivocal scans have proved to represent a good rather than bad prognosis29. In the meantime, as advocated by the IC such as NI-RADS, indeterminate cases may be best followed up non-invasively with imaging in the form of a contrast-enhanced CT or PET17. One option is to perform a second interval PET-CT response assessment. Porceddu et al. recommend a further repeat PET-CT 4–6 weeks later (16 weeks post treatment) if the first one shows indeterminate response, with no subsequent cases of nodal failure12. Similarly a recent publication from our group highlighted that a second-look PET-CT 13 weeks median duration from the first response assessment PET-CT (median 30 weeks post treatment) found the majority of incomplete response cases convert to a complete metabolic response30. Follow-up imaging at an earlier time point results in a higher number of false positive results31. This warrants future evaluation in a larger prospective cohort.

Inter-observer agreement of IC was not assessed in this study mainly because previous work has shown these IC to be highly reproducible14,16. Limitations include the retrospective study design, heterogenous patient cohort with different sites of HNSCC and the slight difference in treatment with the majority having CRT but a small group having radiotherapy only.

Emerging studies exploring the utility of radiomic features extracted from head and neck cancers highlight the potential for more accurate prediction of disease progression using novel imaging signatures which could be augmented by artificial intelligence techniques32,33,34. Although there is no current clinical implementation of a radiomic-based decision-support system in this clinical scenario, in the future this may emerge and could result in better patient stratification and personalization of treatment34. Some challenges remain ahead of this including a need for greater data transparency, multi-centre collaborations for cross-validation and to confirm reproducibility of radiomic analysis methods34.

Conclusion

Assessment with FDG PET-CT post-treatment in HNSCC is accurate for prediction of complete response or disease progression. All four analysed IC have similar diagnostic performance characteristics however Porceddu and Deauville provide the best trade off minimising indeterminate scores whilst maintaining a high NPV.