Rule-based versus probabilistic selection for active surveillance using three definitions of insignificant prostate cancer

Purpose To study whether probabilistic selection by the use of a nomogram could improve patient selection for active surveillance (AS) compared to the various sets of rule-based AS inclusion criteria currently used. Methods We studied Dutch and Swedish patients participating in the European Randomized study of Screening for Prostate Cancer (ERSPC). We explored which men who were initially diagnosed with cT1-2, Gleason 6 (Gleason pattern ≤3 + 3) had histopathological indolent PCa at RP [defined as pT2, Gleason pattern ≤3 and tumour volume (TV) ≤0.5 or TV ≤ 1.3 ml, and TV no part of criteria (NoTV)]. Rule-based selection was according to the Prostate cancer Research International: Active Surveillance (PRIAS), Klotz, and Johns Hopkins criteria. An existing nomogram to define probability-based selection for AS was refitted for the TV1.3 and NoTV indolent PCa definitions. Results 619 of 864 men undergoing RP had cT1-2, Gleason 6 disease at diagnosis and were analysed. Median follow-up was 8.9 years. 229 (37 %), 356 (58 %), and 410 (66 %) fulfilled the TV0.5, TV1.3, and NoTV indolent PCa criteria at RP. Discriminating between indolent and significant disease according to area under the curve (AUC) was: TV0.5: 0.658 (PRIAS), 0.523 (Klotz), 0.642 (Hopkins), 0.685 (nomogram). TV1.3: 0.630 (PRIAS), 0.550 (Klotz), 0.615 (Hopkins), 0.646 (nomogram). NoTV: 0.603 (PRIAS), 0.530 (Klotz), 0.589 (Hopkins), 0.608 (nomogram). Conclusions The performance of a nomogram, the Johns Hopkins, and PRIAS rule-based criteria are comparable. Because the nomogram allows individual trade-offs, it could be a good alternative to rigid rule-based criteria. Electronic supplementary material The online version of this article (doi:10.1007/s00345-015-1628-y) contains supplementary material, which is available to authorized users.


Introduction
Early detection of prostate cancer (PCa) has led to increased prevalence of finding indolent tumours, i.e. tumours that are unlikely to become symptomatic during life. The ability to predict indolent PCa is needed to avoid overtreatment [1]. Active surveillance (AS) has emerged as a feasible strategy to decrease the overtreatment of low-risk PCa. With AS, men with low-risk PCa are strictly monitored over time, and if risk reclassification or disease progression occurs, they can opt for curative therapy. Hence, the aim of AS is to safely delay or completely avoid side effects of active therapy [2]. There are 16 unique worldwide AS cohorts which all have their highly variable own protocols [3]. So far, published results on AS study cohorts worldwide show encouraging results on biochemical recurrence (BCR) rates and disease-specific mortality [4]. Long-term effects are yet unknown. Research on how to improve the existing AS protocols is, however, needed as misclassification at diagnosis, and subsequent reclassification after one-year repeat biopsy is not uncommon [5]. For example, 28 % of men within the Prostate cancer Research International: Active Surveillance (PRIAS) study were reclassified after one or more repeat biopsies [6].
Currently, all existing AS cohorts apply relatively simple combinations of inclusion criteria for patient selection ("rule-based selection"). More refined risk stratification through a nomogram may be preferable, especially in the light of individualised medicine and shared decision-making ("probability-based selection") [7]. We aimed to assess the performance of inclusion criteria as used in several prospective AS protocols in identifying indolent cancer at radical prostatectomy (RP) and follow-up outcomes of men who received immediate RP but were also suitable for AS. For comparison, we used a previously developed and externally validated nomogram that predicts indolent disease [8,9]. We hypothesise that the use of probabilistic selection by the use of a nomogram that incorporates multiple patient characteristics may be better for selection.

Patients
Men included in this study were participants in the screening arm of the European Randomized study of Screening for Prostate Cancer (ERSPC). Data cohorts of the Swedish and Dutch sections of ERSPC were combined. All men were diagnosed with screen-detected PCa and underwent RP as primary treatment. Details on both Dutch and Swedish screening protocols were previously published [10,11].

Methods
Men with T3-4, Gleason ≥7 PCa at diagnostic biopsy or an unknown tumour volume were excluded from this analysis, as well as men with positive lymph nodes or distant metastases at the time of diagnosis or at the time of surgery. A multiple imputation model was used to fill in missing data. We used the first imputation of a multiple imputation procedure with the impute function in SPSS software (IBM Corp. Released 2012. IBM SPSS Statistics for Windows, version 21.0. Armonk, NY: IBM Corp). A total of 936 confounder values were missing, comprising 13.5 % of all values. Filling in these values through imputation allowed us to include the 382 (44 %) patients with any missing value in the analysis. All tumour characteristics were used for the multiple imputation.
We first assessed the frequency of indolent PCa at RP according to the classic definition of pT2, tumour volume <0.5 ml (TV0.5), and pathological Gleason pattern ≤3 [12]. Men not fulfilling these criteria for indolent PCa (TV > 0.5 ml and/or pathological Gleason pattern >3) were categorised as having significant PCa.
Third, we explored the use of a nomogram to estimate risk for indolent PCa at RP [13]. We assessed the effect of applying various eligibility criteria for the nomogram (T1c-T2a, PSA ≤ 20 ng/ml; Gleason ≤3 + 3, ≤50 % positive cores, 20 mm PCa, 40 mm benign tissue in all cores) and of different thresholds in the predicted chance of harbouring indolent PCa (referred to as Pind) on the number of men remaining suitable for AS at diagnosis.
Having the availability of follow-up data, we were able to calculate BCR after RP. The criteria proposed by Freedland et al. [17] were used to define BCR, i.e. one PSA value after RP > 0.2 ng/ml. The different sets of rule-based selection criteria and Pind cut-off points were compared using the Kaplan-Meier method and the log-rank test.
We finally applied decision curve analysis (DCA) [18] to evaluate the potential clinical usefulness of rule-based selection and probability-based selection models. We estimated a net benefit (NB) for the four models by summing the benefits (true-positive indolent PCa) and subtracting the harms (falsepositive indolent PCa).The harms were weighted by a factor related to the relative harm of being unjustly included on AS versus being directly curatively treated while suitable for AS. This weighting was derived from the threshold probability at which a patient would opt for AS. This threshold varies between men and urologists. Clinical practice currently uses a threshold probability of 50-70 % [19]. The interpretation of a decision curve is rather straightforward; the model with the highest NB at a particular threshold should be chosen over alternative models.

Results
Our study cohort consisted of 864 men of whom 619 had cT1-2, Gleason 6 disease at diagnosis and were therefore eligible for analyses. Median follow-up time after diagnosis was 8.9 years. Table 1 presents the study cohort characteristics at diagnosis and outcomes after RP. With  Table 2 furthermore presents the number of men who experienced BCR after RP according to the three definitions of indolent disease in the different sets of rule-based criteria and the nomogram suitable cohort. A log-rank test showed that the number of men experiencing BCR do not differ statistically between the groups. However, the distribution of BCR over the indolent and significant group changes, with a rising percentage of BCR in the indolent group (TV0.5 = 3.4 %, TV1.3 = 4.9 %, NoTV = 6.3 %). We found that in ROC analysis (Appendix Fig. 1), the nomogram (TV0.5) had a slightly better sensitivity-to-specificity ratio than the PRIAS rules. The AUC for the nomogram (TV0.5) was 0.610, for PRIAS 0.584, for Klotz 0.524, for Johns Hopkins 0.615, for the refitted TV1.3 nomogram 0.595, and for the refitted NoTV nomogram 0.570.
In terms of clinical usefulness, we found that in DCA analysis (appendix Fig. 2a-c), no large differences in NB were seen for threshold probabilities 50-70 %, which are clinically most relevant.

Discussion
In our cohort of Dutch and Swedish screen-detected PCa patients who all underwent initial RP, 37 % fulfilled the TV0.5 indolent PCa criteria at RP increasing to 58 % for the TV1.3 indolent PCa criteria and 66 % for the NoTV indolent PCa definition. More stringent rule-based AS inclusion criteria as well as stricter nomogram probability thresholds decrease the rate of misclassified tumours in a rather similar fashion, but both at the cost of a substantial number of patients no longer considered suitable for AS.
The nomogram based on TV0.5 had slightly better sensitivity and specificity with respect to BCR outcome than the PRIAS and Klotz criteria. If we juxtapose the TV0.5 nomogram to the Johns Hopkins criteria, the latter performed better but at the cost of including less patients and thereby curatively treating patients that might also would have been suitable for AS.
On the basis of a Kaplan-Meier analysis (curves not shown), we cannot conclude that the use of the TV0.5 nomogram is preferred over the use of rule-based selection or vice versa. However, for BCR the TV0.5 nomogram outperformed the PRIAS and Klotz criteria. The TV0.5 nomogram, however, performed slightly worse than the Johns Hopkins criteria. If we chose a slightly lower Pind and therewith allowing more men to be included on AS, sensitivity and specificity of the TV0.5 nomogram are still acceptable. This flexibility in application is a property of using a nomogram for selection rather than a strict set of rules and desirable in the light of individualised medicine and shared decision-making.
Because the classic definition of a pathologically indolent PCa may be too restrictive [14], we also used two more updated definitions of an indolent PCa. When juxtaposing the models, the TV0.5 nomogram (AUC 0.685) was slightly better in discriminating indolent from significant PCa than the PRIAS (AUC 0.658), Johns Hopkins (AUC 0.642), and Klotz (AUC 0.523) criteria. This trend of the nomogram predicting slightly better is also seen for the refitted TV1.3 and NoTV nomograms.
Perfect patient selection for AS using either rule-based selection criteria or by applying a nomogram seems difficult at present. The AUCs illustrate that both approaches are currently suboptimal in differentiating indolent from non-indolent disease at RP in a group of men with already low-risk features at diagnosis. This is confirmed by the study of Wang et al. [20] whom in a group of 273 AS patients who underwent multiple biopsies and/or delayed RP found that nomograms designed to predict indolent tumours only have a modest ability to predict biopsy progression and any progression on either biopsy or surgery in men choosing an AS management strategy. Wang et al. furthermore concluded that in a subgroup of 58 men, none of the various nomograms were able to predict surgical progression at RP [20]. Since AS is incorporated into many guidelines (AUA, NCCN, EAU, etc.) as a viable management strategy for men with either very low-risk or low-risk PCa, it is expected that more men will elect AS as their primary therapy. The optimisation of both rule-based selection and probability-based selection is therefore warranted.
Over the past few years, magnetic resonance imaging (MRI) is emerging as a tool which may be able to more accurately determine the risk of significant disease and progression of disease over time by improving sampling through target biopsies [21]. MRI may therefore also help better select AS candidates [22]. Several studies have shown the additional value of MRI in an AS protocol [21][22][23]. Stamatakis et al. [22] combined MRI-based factors into a nomogram which generates a probability for confirmed AS candidacy. They found that three MRI-based factors, i.e. number of lesions, lesion suspicion, and lesion density, were associated with confirmatory biopsy outcome and reclassification. A created nomogram which uses these factors has promising predictive accuracy, according to Stamatakis et al. [22]. It could be that adding such factors to the currently existing rule-based selection criteria or the nomogram could improve sensitivity and specificity and therewith AS patient selection.
A first limitation of our study lies in the fact that men in our cohort were diagnosed with sextant biopsies. Sextant biopsy does not reflect current clinical practice anymore; nowadays, current practice relies on 8-18 core biopsies. Studies that applied more extended biopsy schemes argue that with a sextant biopsy protocol, 10-30 % of cancers are missed [24]. Several studies reported that when 8-12 cores were taken, the PCa detection rate in a clinical setting might increase [24,25]. We validated the previously developed nomogram in multiple other populations in which more extended biopsy schemes were used. Results of these validation studies showed that the nomogram predicted indolent PCa with good discrimination, indicating that it can be broadly applied in contemporary urological practice [26,27]. In addition, we extracted correction factors for the adjustment of the nomogram with which contemporary extended biopsy schemes can be addressed [28]. Another limitation is that follow-up time of our study cohort is too short to assess mortality outcomes and relate these to baseline selection criteria. The lack of mortality outcomes was also the reason to choose BCR as an endpoint instead. Many men with BCR, however, will never develop metastasised disease or die from PCa [29]. Thirdly, patients underwent RP in different centres in either Sweden or the Netherlands. They were operated by different surgeons using different techniques for RP, which might influence outcomes. Finally, 247 cases included in this analysis were also used in the validation and construction of the nomogram. This may lead to an overestimated performance of the nomogram and Pind. The strength of this study lies in the fact that all men were diagnosed with PCa within ERSPC (Sweden and the Netherlands), resulting in standardised pathological examination of biopsy specimens and structured data follow-up [30].
In conclusion, in our cohort of Dutch and Swedish screen-detected PCa patients who all underwent initial RP, 37 % had TV0.5 indolent PCa at RP increasing to 58 % for the TV1.  ERSPC-based TV0.5 nomogram and rule-based selection by the Johns Hopkins and PRIAS criteria is comparable. Because the nomogram allows individual tradeoffs, it could be a good alternative to applying rigid rule-based criteria. Furthermore, a nomogram anticipates on the continuous improvement of risk assessment by newly emerging risk criteria, including imaging modalities.