Comparison between 1973 and 2004/2016 World Health Organization grading in upper tract urothelial carcinoma treated with radical nephroureterectomy

Aims The European Association of Urology guideline for upper tract urothelial carcinoma (UTUC) relies on two grading system: 1973 World Health Organization (WHO) and 2004/2016 WHO. No consensus has been made which classification should supersede the other and both are recommended in clinical practice. We hypothesized that one may be superior to the other. Methods Newly diagnosed non-metastatic UTUC patients treated with radical nephroureterectomy were abstracted from the Surveillance, Epidemiology, and End Results database (2010–2016). Kaplan–Meier plots and multivariable Cox regression models (CRMs) tested cancer-specific mortality (CSM), according to 1973 WHO (G1 vs. G2 vs. G3) or to 2004/2016 WHO (low-grade vs. high-grade) grading systems. Haegerty’s C-index quantified accuracy. Results Of 4271 patients, according to 1973 WHO grading system, 134 (3.1%) were G1, 436 (10.2%) were G2 and 3701 (86.7%) were G3; while according to 2004/2016 WHO grading system, 508 (11.9%) were low grade vs 3763 (88.1%) high grade. In multivariable CRMs, high grade predicted higher CSM (Hazard ratio: 1.70, p < 0.001). Conversely, neither G2 (p = 0.8) nor G3 (p = 0.1) were independent predictors of worse survival. The multivariable models without consideration of either grading system were 74% accurate in predicting 5-year CSM. Accuracy increased to 76% after either addition of the 1973 WHO or 2004/2016 WHO grade. Conclusions From a statistical standpoint, either 1973 WHO or 2004/2016 WHO grading system improves the accuracy of CSM prediction to the same extent. In consequence, other considerations such as intra- and interobserver variability may represent additional metrics to consider in deciding which grading system is better. Supplementary Information The online version contains supplementary material available at 10.1007/s10147-021-01941-9.


Introduction
Upper tract urothelial carcinoma (UTUC) is a rare and aggressive malignancy, with an estimated annual incidence in Western Countries of almost two cases per 100,000 inhabitants [1] and with non-organ confined stage in two-third of newly diagnosed patients [2][3][4][5]. After stage, tumor grade is the most important predictor of cancer-specific mortality (CSM) in UTUC patients [6][7][8][9][10]. The most recent European Association of Urology (EAU) UTUC guideline relies and recommends the use of two different grading systems. These consist of the 1973 World Health Organization (WHO) and the 2004/2016 WHO classification. Specifically, the 1973 WHO grading system [11] is based on three tiers. Grade 1 applies to tumors with least degree of cellular anaplasia. Grade 3 applies to tumors with most severe degrees of cellular anaplasia. Finally, grade 2 lies in between. Conversely, the 2004/2016 WHO grading system [12,13] is based on two tiers. It relies on more detailed histological criteria. Low-grade carcinoma applies to tumors with predominantly ordered cell organization with mainly round-oval nuclear shape and mild nuclear chromatin variation. High grade applies to tumors with predominantly disordered cell organization with loss of polarity, moderate to marked nuclear pleomorphism and mainly hyperchromasia [14]. Since there is no consensus on which of the two grading systems should be used in everyday clinical practice [12,15] and since both are recommended [2], we hypothesized that one may be better. To test this hypothesis, we examined the ability of either the 1973 or the 2004/2016 WHO grading system in predicting CSM, in a contemporary cohort of non-metastatic UTUC patients treated with radical nephroureterectomy (RNU), identified within a large-scale database, namely the Surveillance, Epidemiology and End Results, from 2010 to 2016.

Variables definition
Tumor grade was defined according to both the 1973 WHO grading system [grade 1 (G 1 ) vs. grade 2 (G 2 ) vs. grade 3 (G 3 )] and the 2004/2016 WHO grading system (low grade vs. high grade). Covariables consisted of age, sex, primary site (renal pelvis, ureter), T-stage (T 1 vs. T 2 vs. T 3 vs. T 4 ), N-stage (N 0 vs. N + vs. N x ) and chemotherapy administration (yes vs. no/unknown). CSM was defined as deaths related to UTUC, according to SEER mortality code [17] and represented the endpoint of interest.

Statistical analyses
Kaplan-Meier plots and multivariable Cox regression models predicting CSM were fitted. These models relied on T-stage, N-stage, chemotherapy administration and primary site, without including grade. Subsequently, the models were refitted with all previously included variables in addition to the 1973 WHO grading system. Finally, the models were refitted again, this time, with the 2004/2016 WHO grading system. Within Cox models, independent predictor status of WHO grading system was tested. Sensitivity analyses testing the effect of grade (1973 and 2004/2016 WHO grading systems) on CSM were performed in UTUC patients with T 1 stage and in UTUC patients with T 2 or lower stage. Finally, the effect of 2004/2016 WHO grading system on CSM was tested in UTUC patients with G 2 grade, according to the 1973 WHO grading system. Subsequently, accuracy of 5-year CSM predictions was quantified based on multivariable models without consideration of WHO grading system, as well as with consideration of either the 1973 or the 2004/2016 WHO grading system. Haegerty's C-index quantified accuracy. All statistical tests were two sided, with a level of significance set at p < 0.05. Statistical analyses were performed using the R software environment for statistical computing and graphics, version 4.0.0 (available at: http:// www. rproj ect).
In multivariable Cox regression models focusing on CSM ( Table 2), relative to G 1 , neither G 2 [Hazard ratio (HR) 1.07, p = 0.8] or G 3 (HR 1.65, p = 0.1) represented independent predictors. When sensitivity analyses were performed (Supplementary Table 1), the results were confirmed in the multivariable Cox regression models focusing on CSM in patients with T 1 stage (relative to G 1 , G 2 : HR 1.00, p = 1.0 and G 3 : HR 1.82, p = 0.2) and T 2 or lower stage (G 2 HR: 0.99 p = 0.9, G 3 HR 1.38, p = 0.4, relative to G 1 ). The accuracy of the multivariable model ( Table 4) that included 1973 WHO grading system was 76%. Conversely, the accuracy of the multivariable model without consideration of 1973 WHO grading system was 74%.
In overall population, according to 2004/2016 WHO grading system (Fig. 2B), 5-year CSM rates were 13.4% and 30.2% for low grade and high grade, respectively. In multivariable Cox regression models focusing on CSM (Table 3), relative to low grade, high grade (HR 1.70, p < 0.001) achieved independent predictor status. When sensitivity analyses were performed (Supplementary Table 1), the results were confirmed in the multivariable Cox regression models focusing on CSM in patients with T 1 stage (relative to low grade, high grade: HR 1.76, p = 0.04), T 2 or lower stage (relative to low grade, high grade: HR 1.65, p = 0.02) and G 2 grade (relative to low grade, high grade: HR 2.19, p = 0.02). The accuracy of the multivariable model ( Table 4) that included 2004/2016 WHO grading system was 76%. Conversely, the accuracy of the multivariable model without consideration of 2004/2016 WHO grading system was 74%. First, of all RNU patients examined in the current study (n = 4271), approximately 90% harbored the highest grade level, regardless of which grading system was used. Specifically, 86.7% harbored G 3 according to 1973 WHO grading system and 88.1% harbored high grade according to 2004/2016 WHO grading system. These elevated rates of high-grade UTUC may be explained by the nature of the study population. Specifically, all patients harbored stage T 1 or higher [18]. Moreover, all patients were treated with RNU. In consequence, a selection bias towards higher grade was operational, relative to studies that also included noninvasive (stages T a and T is ) UTUC patients treated with less  Second, the current analyses demonstrated marginal discrimination between G 1 and G 2 , with respect to CSM. Within the three-tier grading system, independent predictor status of G 2 and G 3 , relative to G 1 , could not be established. These results were confirmed in RNU patients with T 1 or T 2 or lower stages. The combination of these observations suggested limited discrimination ability of the three-tier grading system. Nonetheless, the addition of the 1973 WHO grading system resulted in a 2% accuracy gain, relative to multivariable models without consideration of the three-tier grading system. However, a 2% gain may be considered marginal. Specifically, this figure implies that within a cohort of 1000 individuals, the use of the three-tier grading system would improve CSM prediction in 20 patients. This gain is important in large-scale prospective trials or in large-scale epidemiological analyses. However, a 2% gain in predictive accuracy may not be clinically meaningful in everyday clinical practice.
In the second part of the analyses, we focused on the two-tier WHO grading system. Here, we validated the independent predictor status of high grade relative to low grade. Specifically, high-grade UTUC had 1.70-fold, 1.76fold, 1.65-fold, and 2.19-fold higher risk of CSM, relative to low-grade UTUC in overall population, in T 1 , T 2 or lower and G 2 patients, respectively. Finally, we also recorded a 2% accuracy gain, when the 2004/2016 WHO grading system was added to multivariable model, where grade was previously not considered. In consequence, based on accuracy, the added benefit of the 2004/2016 WHO grading system   was exactly the same as for the 1973 WHO grading system. However, the discrimination of CSM rates appeared more practical with the two-tier grading system, where highgrade patients exhibited a nearly twofold higher CSM rate and reached independent predictor status. In consequence, it appears that based on statistical criteria used in the current analyses, the two-tier grading system benefits of a slight advantage over its three-tier counterpart. Additional consideration may be required to decide which grading system should be included in everyday clinical practice and which may be abandoned. Several investigators compared intra-and interobserver variability of the two-vs three-tier grading system in bladder cancer [12,[23][24][25][26][27][28][29]. Unfortunately, such analyses did not focus on UTUC. However, based on methodological considerations, a system that relies on two tiers is invariably more likely to result in a lower intra-and interobserver variability than a system with more than two levels. This notion rests on the effect of chance. In consequence, based on similar predictive accuracy, superiority of discrimination in univariable and multivariable models, and on methodological consideration of intra-and interobserver variability, it appears that the two-tier grading system might represent a better alternative. However, specific expert intra-and interobserver variability testing in UTUC patients should ideally complement the findings of our study.
To the best of our knowledge, we are the first to examine the ability of either 1973 or 2004/2016 WHO grading classification in predicting CSM, in UTUC patients identified within a large-scale population-based database. Only one group of investigators [30] examined grade assignment differences according to 1973 vs. 2004/2016 grading system in a smaller cohort (n = 458) of UTUC patients treated with RNU, at a single Chinese institution, between 2008 and 2013. Unfortunately, the complexity of the methodology used by Guan et al. renders comparisons with our methodology practically impossible.
Our work is not devoid of limitations and should be interpreted in the context of its retrospective and populationbased design. First, the SEER database focuses on invasive UTUC, since T is and T a patients are not included. In consequence, our observations are based on more advanced stage and grade distribution and are not directly comparable with studies that used the entire UTUC population as reference. However, T is and T a patients should ideally not be treated with RNU. In consequence, their exclusion from SEER database does not represent an important limitation for studies that focus on RNU. Second, disease progression or disease recurrence data are not available in the SEER database. In consequence, they cannot be examined as endpoints. Third, the SEER database does not allow to ascertain either type or duration of chemotherapy. Fourth, due to the short median follow-up, future studies with longer follow-up should be done to confirm or refuse our results. Fifth, our study did not benefit of central pathology review. Sixth, our analyses could not assess intra-and interobserver variability, which are essential in clinical practice. Finally, the SEER database represents a proportion of the United States populations. In consequence, our findings are only applicable to patients from the United States and are not be generalizable to patients from other parts of the world. However, these limitations apply to this and to all other studies based on the SEER database.

Conclusion
From a statistical standpoint, either 1973 WHO or 2004/2016 WHO grading system improves the accuracy of CSM prediction to the same extent. In consequence, other considerations such as intra-and interobserver variability may represent additional metrics to consider in deciding which grading system is better.