Introduction

Early detection of prostate cancer (PCa) has led to increased prevalence of finding indolent tumours, i.e. tumours that are unlikely to become symptomatic during life. The ability to predict indolent PCa is needed to avoid overtreatment [1]. Active surveillance (AS) has emerged as a feasible strategy to decrease the overtreatment of low-risk PCa. With AS, men with low-risk PCa are strictly monitored over time, and if risk reclassification or disease progression occurs, they can opt for curative therapy. Hence, the aim of AS is to safely delay or completely avoid side effects of active therapy [2]. There are 16 unique worldwide AS cohorts which all have their highly variable own protocols [3]. So far, published results on AS study cohorts worldwide show encouraging results on biochemical recurrence (BCR) rates and disease-specific mortality [4]. Long-term effects are yet unknown. Research on how to improve the existing AS protocols is, however, needed as misclassification at diagnosis, and subsequent reclassification after one-year repeat biopsy is not uncommon [5]. For example, 28 % of men within the Prostate cancer Research International: Active Surveillance (PRIAS) study were reclassified after one or more repeat biopsies [6].

Currently, all existing AS cohorts apply relatively simple combinations of inclusion criteria for patient selection (“rule-based selection”). More refined risk stratification through a nomogram may be preferable, especially in the light of individualised medicine and shared decision-making (“probability-based selection”) [7]. We aimed to assess the performance of inclusion criteria as used in several prospective AS protocols in identifying indolent cancer at radical prostatectomy (RP) and follow-up outcomes of men who received immediate RP but were also suitable for AS. For comparison, we used a previously developed and externally validated nomogram that predicts indolent disease [8, 9]. We hypothesise that the use of probabilistic selection by the use of a nomogram that incorporates multiple patient characteristics may be better for selection.

Materials and methods

Patients

Men included in this study were participants in the screening arm of the European Randomized study of Screening for Prostate Cancer (ERSPC). Data cohorts of the Swedish and Dutch sections of ERSPC were combined. All men were diagnosed with screen-detected PCa and underwent RP as primary treatment. Details on both Dutch and Swedish screening protocols were previously published [10, 11].

Methods

Men with T3-4, Gleason ≥7 PCa at diagnostic biopsy or an unknown tumour volume were excluded from this analysis, as well as men with positive lymph nodes or distant metastases at the time of diagnosis or at the time of surgery. A multiple imputation model was used to fill in missing data. We used the first imputation of a multiple imputation procedure with the impute function in SPSS software (IBM Corp. Released 2012. IBM SPSS Statistics for Windows, version 21.0. Armonk, NY: IBM Corp). A total of 936 confounder values were missing, comprising 13.5 % of all values. Filling in these values through imputation allowed us to include the 382 (44 %) patients with any missing value in the analysis. All tumour characteristics were used for the multiple imputation.

We first assessed the frequency of indolent PCa at RP according to the classic definition of pT2, tumour volume <0.5 ml (TV0.5), and pathological Gleason pattern ≤3 [12]. Men not fulfilling these criteria for indolent PCa (TV > 0.5 ml and/or pathological Gleason pattern >3) were categorised as having significant PCa.

Second, we selected men from our study cohort with low-risk PCa at diagnosis defined according to the PRIAS (T1c-T2, PSA ≤ 10 ng/ml; PSA density <0.20 ng/ml/cc, Gleason ≤3 + 3, ≤2 positive cores), Klotz (T1b-T2b; PSA ≤ 10 ng/ml; Gleason ≤6), and Johns Hopkins criteria (T1c, PSA density <0.15 ng/ml/cc, Gleason ≤6, ≤2 positive cores, ≤50 % single core involvement). The frequencies of indolent PCa at RP in these groups were studied.

Third, we explored the use of a nomogram to estimate risk for indolent PCa at RP [13]. We assessed the effect of applying various eligibility criteria for the nomogram (T1c-T2a, PSA ≤ 20 ng/ml; Gleason ≤3 + 3, ≤50 % positive cores, 20 mm PCa, 40 mm benign tissue in all cores) and of different thresholds in the predicted chance of harbouring indolent PCa (referred to as Pind) on the number of men remaining suitable for AS at diagnosis.

The classic definition of a pathologic indolent PCa (pT2, TV0.5, and Gleason pattern ≤3) might be too restrictive and not reflecting biology well [14]. Therefore, we repeated steps one to three with two updated and more recent definitions of indolent PCa: (1) pT2, tumour volume <1.3 ml (TV1.3) and Gleason pattern ≤3 + 3 [1416]; (2) pT2, Gleason pattern ≤3 + 3 and tumour volume no part of definition (NoTV) [15]. For step three, the nomogram was refitted twice using the original data [13], to account for the adjusted definitions of an indolent PCa.

Having the availability of follow-up data, we were able to calculate BCR after RP. The criteria proposed by Freedland et al. [17] were used to define BCR, i.e. one PSA value after RP > 0.2 ng/ml. The different sets of rule-based selection criteria and Pind cut-off points were compared using the Kaplan–Meier method and the log-rank test.

We finally applied decision curve analysis (DCA) [18] to evaluate the potential clinical usefulness of rule-based selection and probability-based selection models. We estimated a net benefit (NB) for the four models by summing the benefits (true-positive indolent PCa) and subtracting the harms (false-positive indolent PCa).The harms were weighted by a factor related to the relative harm of being unjustly included on AS versus being directly curatively treated while suitable for AS. This weighting was derived from the threshold probability at which a patient would opt for AS. This threshold varies between men and urologists. Clinical practice currently uses a threshold probability of 50–70 % [19]. The interpretation of a decision curve is rather straightforward; the model with the highest NB at a particular threshold should be chosen over alternative models.

P values (two-sided) <0.05 were considered statistically significant. For statistical analysis, we used the Statistical Package for the Social Sciences (SPSS) version 21 (IBM Corp. Released 2012. IBM SPSS Statistics for Windows, version 21.0. Armonk, NY: IBM Corp) and R version 2.15.2 (R Foundation for Statistical Computing, Vienna, Austria).

Results

Our study cohort consisted of 864 men of whom 619 had cT1-2, Gleason 6 disease at diagnosis and were therefore eligible for analyses. Median follow-up time after diagnosis was 8.9 years. Table 1 presents the study cohort characteristics at diagnosis and outcomes after RP. With TV0.5 cut-off, a total of 229 (37 %) tumours at RP could be defined as indolent versus 390 (63 %) as significant. When applying the TV1.3 and NoTV indolent PCa definitions, the number of indolent PCa increases to 356 (58) and 410 (66 %), respectively. Pind could be calculated for 455 (74 %) men meeting the nomogram inclusion criteria.

Table 1 Study cohort characteristics at diagnosis and outcomes after radical prostatectomy (n = 619)

Table 2 presents the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for all three indolent PCa definitions (TV0.5, TV1.3, NoTV) at RP of the rule-based selection and nomogram-based selection approaches. The table also contains the effect of applying different thresholds of the nomogram calculated risk of harbouring indolent PCa, i.e. Pind.

Table 2 Sensitivity, specificity, and BCR for several sets of rule-based and nomogram-based AS inclusion criteria for selecting indolent PCa at RP (n = 619)

The area under the curve (AUC) for the TV0.5 indolent definition was 0.658 for PRIAS, 0.523 for Klotz, 0.642 for Johns Hopkins, and 0.685 for the nomogram. For the TV1.3 indolent definition, the AUC for PRIAS was 0.630, for Klotz 0.550, for Johns Hopkins 0.615, and for the refitted nomogram 0.646. For the NoTV indolent definition, the AUC for PRIAS was 0.603, for Klotz 0.530, for Johns Hopkins 0.589, and for the refitted nomogram 0.608.

Table 2 furthermore presents the number of men who experienced BCR after RP according to the three definitions of indolent disease in the different sets of rule-based criteria and the nomogram suitable cohort. A log-rank test showed that the number of men experiencing BCR do not differ statistically between the groups. However, the distribution of BCR over the indolent and significant group changes, with a rising percentage of BCR in the indolent group (TV0.5 = 3.4 %, TV1.3 = 4.9 %, NoTV = 6.3 %). We found that in ROC analysis (Appendix Fig. 1), the nomogram (TV0.5) had a slightly better sensitivity-to-specificity ratio than the PRIAS rules. The AUC for the nomogram (TV0.5) was 0.610, for PRIAS 0.584, for Klotz 0.524, for Johns Hopkins 0.615, for the refitted TV1.3 nomogram 0.595, and for the refitted NoTV nomogram 0.570.

In terms of clinical usefulness, we found that in DCA analysis (appendix Fig. 2a–c), no large differences in NB were seen for threshold probabilities 50–70 %, which are clinically most relevant.

Discussion

In our cohort of Dutch and Swedish screen-detected PCa patients who all underwent initial RP, 37 % fulfilled the TV0.5 indolent PCa criteria at RP increasing to 58 % for the TV1.3 indolent PCa criteria and 66 % for the NoTV indolent PCa definition. More stringent rule-based AS inclusion criteria as well as stricter nomogram probability thresholds decrease the rate of misclassified tumours in a rather similar fashion, but both at the cost of a substantial number of patients no longer considered suitable for AS. The nomogram based on TV0.5 had slightly better sensitivity and specificity with respect to BCR outcome than the PRIAS and Klotz criteria. If we juxtapose the TV0.5 nomogram to the Johns Hopkins criteria, the latter performed better but at the cost of including less patients and thereby curatively treating patients that might also would have been suitable for AS.

On the basis of a Kaplan–Meier analysis (curves not shown), we cannot conclude that the use of the TV0.5 nomogram is preferred over the use of rule-based selection or vice versa. However, for BCR the TV0.5 nomogram outperformed the PRIAS and Klotz criteria. The TV0.5 nomogram, however, performed slightly worse than the Johns Hopkins criteria. If we chose a slightly lower Pind and therewith allowing more men to be included on AS, sensitivity and specificity of the TV0.5 nomogram are still acceptable. This flexibility in application is a property of using a nomogram for selection rather than a strict set of rules and desirable in the light of individualised medicine and shared decision-making.

Because the classic definition of a pathologically indolent PCa may be too restrictive [14], we also used two more updated definitions of an indolent PCa. When juxtaposing the models, the TV0.5 nomogram (AUC 0.685) was slightly better in discriminating indolent from significant PCa than the PRIAS (AUC 0.658), Johns Hopkins (AUC 0.642), and Klotz (AUC 0.523) criteria. This trend of the nomogram predicting slightly better is also seen for the refitted TV1.3 and NoTV nomograms.

Perfect patient selection for AS using either rule-based selection criteria or by applying a nomogram seems difficult at present. The AUCs illustrate that both approaches are currently suboptimal in differentiating indolent from non-indolent disease at RP in a group of men with already low-risk features at diagnosis. This is confirmed by the study of Wang et al. [20] whom in a group of 273 AS patients who underwent multiple biopsies and/or delayed RP found that nomograms designed to predict indolent tumours only have a modest ability to predict biopsy progression and any progression on either biopsy or surgery in men choosing an AS management strategy. Wang et al. furthermore concluded that in a subgroup of 58 men, none of the various nomograms were able to predict surgical progression at RP [20]. Since AS is incorporated into many guidelines (AUA, NCCN, EAU, etc.) as a viable management strategy for men with either very low-risk or low-risk PCa, it is expected that more men will elect AS as their primary therapy. The optimisation of both rule-based selection and probability-based selection is therefore warranted.

Over the past few years, magnetic resonance imaging (MRI) is emerging as a tool which may be able to more accurately determine the risk of significant disease and progression of disease over time by improving sampling through target biopsies [21]. MRI may therefore also help better select AS candidates [22]. Several studies have shown the additional value of MRI in an AS protocol [2123]. Stamatakis et al. [22] combined MRI-based factors into a nomogram which generates a probability for confirmed AS candidacy. They found that three MRI-based factors, i.e. number of lesions, lesion suspicion, and lesion density, were associated with confirmatory biopsy outcome and reclassification. A created nomogram which uses these factors has promising predictive accuracy, according to Stamatakis et al. [22]. It could be that adding such factors to the currently existing rule-based selection criteria or the nomogram could improve sensitivity and specificity and therewith AS patient selection.

A first limitation of our study lies in the fact that men in our cohort were diagnosed with sextant biopsies. Sextant biopsy does not reflect current clinical practice anymore; nowadays, current practice relies on 8–18 core biopsies. Studies that applied more extended biopsy schemes argue that with a sextant biopsy protocol, 10–30 % of cancers are missed [24]. Several studies reported that when 8–12 cores were taken, the PCa detection rate in a clinical setting might increase [24, 25]. We validated the previously developed nomogram in multiple other populations in which more extended biopsy schemes were used. Results of these validation studies showed that the nomogram predicted indolent PCa with good discrimination, indicating that it can be broadly applied in contemporary urological practice [26, 27]. In addition, we extracted correction factors for the adjustment of the nomogram with which contemporary extended biopsy schemes can be addressed [28]. Another limitation is that follow-up time of our study cohort is too short to assess mortality outcomes and relate these to baseline selection criteria. The lack of mortality outcomes was also the reason to choose BCR as an endpoint instead. Many men with BCR, however, will never develop metastasised disease or die from PCa [29]. Thirdly, patients underwent RP in different centres in either Sweden or the Netherlands. They were operated by different surgeons using different techniques for RP, which might influence outcomes. Finally, 247 cases included in this analysis were also used in the validation and construction of the nomogram. This may lead to an overestimated performance of the nomogram and Pind. The strength of this study lies in the fact that all men were diagnosed with PCa within ERSPC (Sweden and the Netherlands), resulting in standardised pathological examination of biopsy specimens and structured data follow-up [30].

In conclusion, in our cohort of Dutch and Swedish screen-detected PCa patients who all underwent initial RP, 37 % had TV0.5 indolent PCa at RP increasing to 58 % for the TV1.3 indolent PCa criteria and 66 % for the NoTV indolent PCa definition. Performance of an ERSPC-based TV0.5 nomogram and rule-based selection by the Johns Hopkins and PRIAS criteria is comparable. Because the nomogram allows individual trade-offs, it could be a good alternative to applying rigid rule-based criteria. Furthermore, a nomogram anticipates on the continuous improvement of risk assessment by newly emerging risk criteria, including imaging modalities.