Utility and limitations of metabolic parameters in head and neck cancer: finding a practical segmentation method

Purpose Although metabolic tumor volume (MTV) and total lesion glycolysis (TLG) have shown good prognostic value in head and neck cancer (HNC), there are still many issues to resolve before their potential application in standard clinical practice. The purpose of this study was to compare the discrimination ability of two relevant segmentation methods in HNC and to evaluate the potential benefit of adding lymph nodes’ metabolism (LNM) to the measurements. Methods We retrospectively analyzed a recently published database of 62 patients with HNC treated with chemoradiotherapy. MTV and TLG were measured using an absolute threshold of SUV2.5. Comparison analysis with previously published background-level threshold (BLT) results was done through Concordance index (C-index) in eight prognostic models. Results BLT obtained better C-index values in five out of the eight models. The addition of LNM improved C-index values in six of the prognostic models. Conclusion We found a potential benefit in adding LNM to the main tumor measurements, as well as in using a BLT for MTV segmentation compared to the most commonly used SUV2.5 threshold. Despite its limitations, this study suggests a practical and simple manner to use these parameters in standard clinical practice, aiming to help elaborate a general consensus.


Introduction
The number of publications analyzing the prognostic value of fluorine-18-fluorodeoxyglucose ( 18 F-FDG) metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in head and neck squamous cell carcinoma (HNSCC) has been increasing in the last decades. Most of the studies find a good prognostic capability in these variables, as they provide valuable pretreatment information of a higher risk of recurrence [1][2][3][4][5]. Although widely used, maximum standardized uptake value (SUV max ) results are more heterogeneous, and conflicting results are found in the literature regarding its prognostic potential [1,2,6,7].
With the current available data, MTV and TLG seem to have a good prognostic value in HNSCC. Nevertheless, there are still several limitations to address before they can be 1 3 implemented in standard clinical practice. Although most studies show good prognostic results, their methodologies vary considerably as there is no consensus yet on standardization of the calculations. First, studies differ on the volume of interest (VOI); some use the primary tumor as the VOI, and some add the metabolism of the affected lymph nodes to the primary tumor measurements. Second, studies differ on the segmentation methods, which directly affect the metabolic parameters calculations and reproducibility.
A wide variety of methods have been developed for the segmentation of VOIs in positron emission tomography (PET) scans. In general, these methods can be classified in two big groups: threshold-based and algorithm-based methods. Because of their practicality and simplicity, thresholdbased methods are the most widely used in HNSCC studies. They define the burden of the VOIs based on a pre-established threshold using one of four different approaches: an absolute threshold (using all volumes above a certain SUV level [usually 2.5]), a relative threshold (using, in general, 40-50% of the SUV max of the lesion), a background-level threshold (BLT) (using the SUV of a reference region, e.g., liver or mediastinum SUV), and adaptative thresholds (adjusting calculations to different measurable variables of the images, with several approaches to obtain a threshold). Algorithm-based methods constitute a heterogeneous and more complex group of methods that require specific software for segmentation of the VOIs [8,9].
Aiming to compare two relevant segmentation methods, the purpose of this study was to evaluate the discrimination power of MTV and TLG delineated using an absolute threshold of SUV 2.5, as this method is the most used in the majority of HNSCC papers, in a combined analysis with the data of our recently reported results (where we segmented using a BLT with the liver as a reference region) [10]. Furthermore, we analyzed the potential benefit of adding the affected lymph nodes metabolism (LNM) to MTV and TLG measurements.

Study population
We have analyzed the population published in a recent report [10]. After the approval of The Institutional Ethics Committee of our hospital, we conducted a retrospective examination of the oncological records of patients treated at Puerta de Hierro University Hospital, Majadahonda, Spain, between January 2012 and December 2018. We included patients over 18 years of age diagnosed with primary stage III-IV HNSCC without distant metastasis that had been treated with radiotherapy alone or concurrent chemoradiotherapy who underwent pretreatment 18F-FDG PET-CT for initial staging and therapy planning with no more than 6 weeks between image acquisition and treatment initiation. Exclusion criteria implicated patients with synchronous or metachronous lesions, patients who had undergone previous treatments, and patients with nasopharyngeal cancer. Informed consent was waived because of the retrospective design.

Radiation therapy and follow-up
Patients were immobilized with a thermoplastic head-shoulders mask and simulated in supine position. Planning 3 mm slice thickness contrast-enhanced computer tomography (CT) was obtained for all patients. These images were combined with pretreatment 18F-FDG PET for better delineation of treatment volumes. Clinical target volumes (CTV) were as follows: CTV1: primary and affected lymph nodes gross tumor volumes; CTV2: CTV1 and the "high risk" first uninvolved lymph node region; CTV3: elective bilateral lymph nodes concurring with international guidelines [11]. By adding 0.5 mm to the CTVs, we generated the corresponding planning target volumes (PTVs). Treatment with intensity modulated radiation therapy was administered in a Tomo-Therapy HiArt unit equipped with an image-guided radiation therapy system and was dispensed in 32 fractions using the simultaneous integrated boost-up to doses for PTV1, PTV2, and PTV3 of 69.12, 57.6 and 53.12 Gray (Gy), respectively. Dose-limiting constraints for organs at risk were the following: Dmax of 45 Gy for spinal cord; V28 < 50% for parotid glands; V65 < 10% for mandible; Dmax of 55 Gy for brain stem; and Dmean < 50 Gy for constrictor muscles.
Two protocols consisting in a regimen of cisplatin (two courses of 20 mg/m2/d in days 1-4 and 29-32) plus oral tegafur (two courses of 1200 mg/d in days 1-14 and 29-43) [12], and a regimen of six cycles of weekly cisplatin (40 mg/ m 2 ) were administered as concomitant chemotherapy for most patients. Patients who were not fit for a cisplatinbased protocol were treated with Cetuximab (400 mg/m 2 initial dose, followed by seven weekly doses of 250 mg/ m 2 ). Patients whose physical conditions were not suitable for chemotherapy were treated with radiotherapy alone.
Two to four months after treatment completion, an 18 F-FDG PET-CT was performed in all patients to assess adequate treatment response. Follow-up was done every 6 months for the first 3 years and once a year for 2 more years with head and neck imaging (CT or magnetic resonance imaging). Salvage surgery or palliative treatment was indicated in case or recurrent or persisting disease.

F-FDG PET-CT imaging
Prior to the 18 F-FDG i.v. injection (approximately 350-400 MBq), patients were required to have a serum glucose concentration bellow 180 mg/dL after fasting for at least 6 h. A helical multidetector CT was used to perform non-contrast-enhanced scans from the top of the skull to the mid-tight with the following parameters: 110 kVp, a maximum modulated milliamperage of 85 mAs, and six slices with a 5.0 mm thickness. CT images were then used for image fusion and attenuation correction. A multimodality Siemens Biograph 6 scan (Biograph; Siemens, Erlangen, Germany) was used to acquire the PET images after 50-60 min of 18 F-FDG i.v. injection with the following parameters: 4-min' scan per bed position × 7-8 positions and ordered-subset expectation maximization reconstruction (four iterations, eight subsets). 18 F-FDG PET images were analyzed by an experienced nuclear medicine practitioner (J.M.) using a Siemens Leonardo reading station. Patient's clinical outcomes were unknown during the analysis using the program Syngo®. via (Siemens Healthineers, Muenchen, Germany). To obtain semiquantitative metabolic parameters, image pixels were converted into standardized uptake values. Then, we set a SUV of 2.5 for VOI segmentation of the tumor and the affected lymph nodes. All 18 F-FDG-avid lesions were then manually encircled in three imaging planes, and the volumes with ≥ SUV2.5 were defined as the VOIs. Tumor and lymph nodes' MTV were automatically calculated by the software, and those two volumes were then analyzed individually and combined. TLG was obtained using the formula: MTV x SUVmean.

Statistical analysis
First, we made a descriptive analysis representing categorical variables with absolute and relative frequencies. Numerical variables were evaluated through medians with percentiles 25 and 75 (P25-P75), means with standard deviation, and minimum and maximum values.
For the survival analysis, overall survival (OS) was defined as the time between treatment initiation and death from any cause. Disease-free survival (DFS) was defined as the time between treatment initiation and relapse or death from any cause, whichever happened first. Patients who were followed-up until the end of the study with no events, as well as patients lost in the follow-up with no events were censored at the date of last contact. Kaplan-Meier curves were estimated and differences between survival curves were compared through the log-rank test. Survival analysis was made with the combination of lymph nodes and main tumor MTV and TLG.
To obtain the optimal cutoff points, receiver-operating characteristics (ROC) curves with the survival data were developed [13], and the points were also evaluated with the Youden [14] and Liu [15] methods. To assess the association between MTV and TLG measured by SUV2.5 with OS and DFS, Cox proportional hazard models were executed. We tested each variable in two ways: as a continuous variable (per 10-ml increment for MTV and per 25-g increment for TLG) and as a dichotomized variable with the obtained cutoff points. We then developed an univariable approach, and a multivariable model adjusting by age, T-classification, and N-classification. Proportional hazards assumption was checked using Schoenfeld residuals. Hazard ratios (HR) with their corresponding 95% confidence intervals are shown.
To test the discrimination ability of the models, Harrell's/Concordance Index (C-index) was estimated. Values of C-index near 0.5 indicate that the discrimination ability is no better than chance. Values near 1 indicate that the model is good at determining which of two patients will have an event first [16].
To evaluate the effect of lymph nodes' metabolism in the discrimination ability of the variables, we created several univariable regression models for tumor, lymph nodes', and tumor + lymph nodes' MTV and TLG with both absolute threshold method (SUV2.5) and our previous published data with a background-level threshold method. [10]. To compare the latter two segmentation methods, we analyzed eight prognostic models using the combined data from both studies: MTV and TLG continuous and dichotomized models with the two segmentation methods mentioned before.
Significance level was set at 0.05. Software used was

Demographics
Seventy-nine oncological records were assessed from patients diagnosed with stage III and IV HNSCC. Two patients had a longer than 6 weeks interval between treatment initiation and PET acquisition and thus were excluded. Pre-treatment 18 F-FDG PET-TC was performed in a different center in 15 patients and were also excluded. Therefore, 62 records were fully assessed for the analysis. The median age was 65 years (range, 38-87 years) and the female/male distribution was 13/49. Forty-nine patients were smokers. Dis

PET parameters' determination
Respectively, the median values for TLG and MTV were 176.92 g (range 5.5-2,567.04 g) and 29.94 ml (range 1.66-220.87 ml). Cutoff values estimated at a 3-years' time point for TLG and MTV were acquired using a time-dependent AUC-ROC. The same values were obtained using the Youden and Liu methods. The cutoff points were 284 g for TLG (AUC = 0.68) and 60 ml for MTV (AUC = 0.70). Subsequently, Cox regression analyses were performed dichotomizing the corresponding cutoff values.

Survival analysis
The median follow-up was 59.3 months (95% CI; 48.6-68.5 months). Locoregional recurrence occurred in 21 patients and distant metastasis was found in 13 patients.
Kaplan-Meier's curves for DSF and OS using the dichotomized models of MTV and TLG are shown in Fig. 1.

Comparison analysis
Univariable analyses per unit increment of lymph nodes', tumor, and the combination of tumor and lymph nodes' MTV found a significant association with OS and DFS in both background-level and absolute SUV2.5 segmentation methods ( Table 2). The same happened with TLG, except for tumor TLG measured with the BLT method, which did not have a significant association with OS nor DFS ( Table 3). The determination of tumor + lymph nodes' TLG obtained better C-index values than the determination of tumor TLG alone with both segmentation methods. The same happened with MTV measurements using the absolute SUV2.5 method. On the contrary, C-index values were lower with the addition of LNM to tumor MTV measurements with the background-level method for both OS and DFS (Tables 2-3).
Among the eight models, we created comparing the two segmentation methods, BLT had a better C-index in five of them. The four dichotomized models performed better with the BLT method. When comparing the continuous models, SUV 2.5 models worked better to predict the DFS. When predicting OS, only the continuous model of SUV 2.5 MTV had better C-index values than the BLT (Table 4).

Discussion
Although MTV and TLG have consistently shown good prognostic value in patients with HNSCC in many studies [1][2][3][4][5], there are still several factors limiting their application in standard clinical practice. Because of the variability in the methodology used in the different publications, it is not possible to stablish adequate nor generalizable cutoff values. For their correct implementation, it is necessary to firstly standardize one segmentation method among the many available ones, as well as establishing whether we should incorporate the metabolism of the affected lymph nodes in the measurements or not.
In this study, we evaluated the prognostic value of MTV and TLG delineated using an absolute threshold of SUV 2.5 in our recently reported database (where we segmented using a BLT with the liver as a reference region) [10]. Subsequently, we created eight prognostic models using the results of both studies to assess the discrimination power of the two methods through their C-index values. Our results found an independent statistical association with survival for both methods in this population (Table 1), and the BLT obtained better C-Index values in five out of the eight prognostic models, including all dichotomized models (Table 4).
We have found several studies comparing threshold and algorithms' segmentation methods, and the latter ones seem to perform better at discriminating patients at risk of recurrence [17][18][19][20][21][22]. It is to be expected that more complex and specific methods would provide a better prognosis prediction. However, considering the many different algorithms available and the limited studies found from each of them, it is difficult to imagine any of these methods having a widespread application at the moment. The same happens with adaptative thresholds, where there is no consensus yet on which factors to consider for the calculations [8]. Thus, threshold methods seem to be the most practical and reproducible segmentation methods with current PET scanners, and they have systematically provided good prognostic information in HNSCC.
Among the five studies comparing fixed absolute, fixed relative, and background-level threshold methods we have found, four of them obtained a better result with the latter, concurring with our results [19,[23][24][25]. One study did not find a statistical significance result for the backgroundlevel method [26]. Absolute and relative threshold methods are dependent of the standardized uptake value (SUV) and, therefore, are directly affected by SUV variations. Patients' preparation and characteristics, PET scanner quality and differences in image analysis protocols have a direct impact in SUV calculations [27][28][29]. To reduce this variability, some authors suggest to either use a background-level threshold [24,28,30,31] or a gradient-based method [18,20,21].
To analyze the potential benefit of adding lymph node's metabolism to the measurements, we created eight prognostic models in a combined analysis (Tables 2-3). We found an association between tumor and lymph nodes' MTV and TLG with survival in six of them. Only tumor TLG using the   [32] and this study, most papers do not find a statistical association between LNM and survival in HNSCC [33][34][35][36][37][38][39][40], although it may play a role in patients with HPV + oropharyngeal cancer [41][42][43][44] and in hypopharyngeal patients with a low MTV [35]. Hoshikawa et al. [45] share the same findings we obtained in our BLT tumor TLG models, where statistical significance was achieved only after the addition of the LNM. To the best of our knowledge, this is the first study testing the possible benefit of combining both tumor and lymph node's measurements in HNSCC.
This study has numerous limitations. The potential biases of a retrospective data compilation in a reduced single-center population with different disease locations limit the generalization of the results. Although there was a tendency of a better performance of the BLT method, the difference in the C-index values were millimetric. Even so, and with all due limitations, this study addresses major concerns in the implementation of MTV and TLG in today's clinical practice, hoping to help elaborate a consensus on how to measure these variables.
Although PET's metabolic parameters seem to offer valuable prognostic information and, perhaps in the future, could aid in treatment selection [46] and in dynamic radiation dose modification [47], more prospective and multicentric studies with rigorous standardization in PET preparation are still needed to obtain quality evidence regarding how and when to use these parameters in HNSCC. Our results found a possible statistical benefit in adding lymph nodes metabolism to the measurements, as well as in using a BLT for segmentation compared to the most used SUV2.5 threshold. Considering this method only adds no more than a minute of work and can be measured with current PET scanners with no specific software, it may be

Conclusions
Out of the eight prognostic models created, the BLT method had a better prediction power than SUV2.5 in five of them, and the addition of the affected lymph node's metabolism to MTV and TLG measurements improved their prediction power in six of the eight models. Despite the limitations of this study, our results suggest a practical and simple manner on which these metabolic parameters can be implemented in today's clinical practice.