Introduction

Interpretation of prostate multiparametric magnetic resonance imaging (MRI) is challenging because of potential discordance between findings from the different pulse sequences and substantial overlap between the appearance of benign and malignant conditions. These difficulties led to the creation of the Prostate Imaging-Reporting and Data System (PI-RADS). For each pulse sequence, semi-objective descriptors are used to classify lesions into specific categories. These categories are then combined into a final score assessing the likelihood of clinically significant prostate cancer (csPCa). PI-RADS version 2 (PI-RADSv2) showed good performance but moderate inter-reader agreement [1,2,3,4,5,6,7,8,9]. Version 2.1 (PI-RADSv2.1) was published in 2019 to address PI-RADSv2 limitations and improve reproducibility by clarifying some descriptors [10].

Although PI-RADSv2.1 has been extensively evaluated [11,12,13,14,15,16,17,18,19,20,21,22], meta-analyses yielded discordant results on the relative diagnostic performance of PI-RADSv2 and PI-RADSv2.1 [23,24,25]. Particularly, whether PI-RADSv2.1 improves inter-reader agreement remains unclear.

MRI interpretation can be broken down into two phases: the detection phase, in which the radiologist sees the lesion, and the characterization phase, in which they assess its degree of suspicion. Each phase contributes to the scoring performance and variability.

In this study, we focussed on the characterization phase by asking 21 readers with varying experience to assess, using PI-RADSv2.1 and PI-RADSv2 descriptors, the same set of MRI lesions with known histology. Our primary objective was to determine whether these descriptors were precise enough to allow readers to assign similar scores to the same lesions.

Materials and methods

Prospective biopsy database

As of September 2008, consecutive patients undergoing prostate MRI and subsequent biopsy at our institution were included in a prospective database after signing institutional review board-approved consent forms [26]. MRIs combined T2-weighted (T2w), diffusion-weighted (Dw) and dynamic contrast-enhanced (DCE) imaging at 1.5 T or 3 T. Transrectal biopsies combined systematic and targeted cores obtained under cognitive or MRI/ultrasound fusion (Urostation, Koelis) depending on the lesions’ location and the operator’s preference. Two to five targeted cores were taken from each lesion and at least two systematic cores (one paramedian, one lateral) from each PZ sextant. The operator could omit systematic cores from PZ sextants with lesions targeted at biopsy. TZ was biopsied only if it contained suspicious lesions.

Readers

Twenty-one radiologists (14 seniors, 7 juniors), from nine different private and public hospitals, participated in the study. Seven seniors (experienced seniors) had more than 5 years and seven (less experienced seniors) less than 5 years of experience. Four juniors had achieved a 6-month rotation in a department of uroradiology, three had passed an advanced diploma in genitourinary imaging, and two had no experience in prostate imaging (Additional file 1: I). Before starting the study, juniors took a 2-h class on PI-RADS scoring. Then, all readers attended a meeting during which representative cases were presented and differences between PI-RADSv2 and PI-RADSv2.1 were discussed.

Study sample

Consecutive biopsy-naïve patients included in the biopsy database between September 2015 and July 2016 were retrospectively selected. September 2015 corresponded to the date of implementation of PI-RADSv2 guidelines at our institution (Additional file 1: II). July 2016 was chosen to allow for at least four years of follow-up. These dates were also chosen because during that period, biopsy operators were instructed to target all focal lesions, even those with a low degree of suspicion, resulting in a large variety of targeted lesions.

Readers were given a four-month period (September-December 2019) to interpret the MRIs of the study sample. They were blinded to clinical and histological data, and to each other’s assessment.

Predefined lesions

First, readers assessed the ‘predefined lesions’, i.e. the MRI lesions targeted at biopsy. These were indicated on one T2w image. Readers were informed that, at the time the sample was acquired, biopsy operators were instructed to target all focal lesions, and thus, that a substantial proportion of the predefined lesions was expected to be benign. Nonetheless, the proportion of benign lesions and csPCas in the sample was not disclosed.

Readers noted the lesions’ maximal diameter, side and location (PZ, TZ or central zone (CZ)). When lesions extended into several zones, the zone in which most of the lesion was located was selected.

Then, readers defined the lesions’ PI-RADSv2 and PI-RADSv2.1 categories, for each pulse sequence, following as closely as possible the manual definitions of these categories (Additional file 1: II). The lesions’ final PI-RADSv2 and PI-RADSv2.1 scores were automatically calculated based on their location, size and pulse sequence categories.

Additional lesions

If needed, readers could note additional lesions that had not been targeted at biopsy. They defined, for each ‘additional lesion’, its location, diameter and pulse sequence categories according to PI-RADSv2 and PI-RADSv2.1 manual definitions. The overall scores were automatically calculated.

Per-lobe and per-patient scores

The PI-RADSv2 and PI-RADSv2.1 scores of each prostate lobe/patient were computed by selecting the highest scores of the predefined and additional lesions described in this lobe/patient. Lobes or patients with no lesion received default PI-RADSv2 and PI-RADSv2.1 scores of 1 (Additional file 1: III).

Follow-up

Follow-up data were retrieved in June–September 2020. The medical files of the patients without csPCa at initial biopsy were searched for any additional prostate biopsy performed during follow-up. Patients without follow-up at our institution were contacted by telephone or through their general practitioner.

Reference standard and csPCa definition

For characterizing predefined lesions, targeted biopsy findings were used as reference standard. For per-lobe and per-patient analysis that took into account predefined and additional lesions, combined targeted and systematic biopsy findings were used as reference standard. csPCa was defined as International Society of Urological Pathology (ISUP) grade ≥ 2 cancer.

Statistical analysis

Quantitative characteristics were described using medians and interquartile ranges (IQRs). Qualitative characteristics were described using absolute and relative frequencies.

A mixed probit regression corresponding to the binormal model was used to model the receiver operating characteristic (ROC) curves according to the reader’s experience, with the reader as random effect [27, 28]. Regression coefficients for experienced and less experienced seniors in comparison to juniors allowed to quantify and test the effect of reader’s experience on the diagnostic performance of the scores. The model was also used to predict the ROC curve for each category of readers. Areas under the curve (AUCs) were estimated using the binormal method [28]. Stratified bootstrap with sampling at the level of patients within strata defined by the presence or absence of csPCA was used to build AUCs 95% confidence intervals (CIs). A logistic mixed model was used to model sensitivity and specificity according to the reader’s experience, with the reader as random effect. Sensitivities and specificities were estimated with their 95% CIs for predefined thresholds of PI-RADS scores of ≥ 3 and ≥ 4. Inter-reader agreement was estimated using Cohen’s kappa coefficient (κ) for location and DCE categories, concordance correlation coefficient for size, and weighted κ for T2w and Dw categories and overall scores. Coefficients of ≤ 0.20, 0.21–0.40, 0.41–0.60, 0.61–0.80 and > 0.80 indicate poor, fair, moderate, good and excellent agreement, respectively.

Similar analyses were performed at lobe and patient level. R software, version 3.6.1 (https://cran.r-project.org) was used for analysis. This study is registered with ClinicalTrials.gov, number NCT04299997.

Results

Study sample

A total of 159 patients imaged at 1.5 T (n = 77) or 3 T (n = 82) were included (Fig. 1, Table 1). MRI scanners and protocols are detailed in Additional file 1: IV. Twelve patients had normal MRI, and 240 lesions were targeted in the 147 remaining patients. These 240 lesions constituted the ‘predefined lesions’ corpus.

Fig. 1
figure 1

Standards for reporting of diagnostic accuracy (STARD) flow diagram. MR Magnetic resonance, PACS Picture archiving communication system, ISUP International society of urological pathology

Table 1 Patients’ characteristics

Predefined lesions

Agreement on location, size and PI-RADS categories

Agreement on lesions’ location was moderate-to-good (κ = 0.60–0.73), with experienced seniors obtaining the highest κ (Table 2). Perfect agreement across all readers was reached in only 142/240 lesions (PZ, n = 133; TZ, n = 9; CZ, n = 0). Depending on the reader, a median number of 204 (IQR, 202–210), 26 (IQR, 23–28) and 10 (IQR, 6–12) lesions were localized in PZ, TZ and CZ, respectively (Additional file 1: V.1). Agreement on size was excellent (CCC ≥ 0.80) for all groups of readers (Table 2).

Table 2 Inter-reader agreement (analysis of the 240 predefined lesions)

Agreement on PI-RADSv2.1 T2w and Dw categories was moderate (κ = 0.42–0.58 and κ = 0.48–0.57, respectively) and tended to increase with experience. For DCE categories, agreement was fair (κ = 0.30–0.38) for all groups of readers. Similar findings were obtained with PI-RADSv2 categories (Table 2).

PI-RADS scores

Inter-reader agreement for PI-RADSv2.1 scoring was moderate for seniors (κ = 0.43–0.47) and fair for juniors (κ = 0.39; Table 2). Using PI-RADSv2.1, juniors obtained a significantly lower AUC (0.74 [95%CI, 0.70–0.79]) than experienced seniors (0.80 [95%CI, 0.76–0.84], p = 0.008), but not than less experienced seniors (0.74 [95%CI, 0.70–0.78], p = 0.75). Experienced seniors tended to show higher specificity, but the difference was not statistically significant (Tables 34, Additional file 1: V.2-V.5).

Table 3 PI-RADSv2.1 and PI-RADSv2 scores assigned by the three groups of readers
Table 4 Sensitivities and specificities obtained by the three groups of readers using PI-RADSv2.1 and PI-RADSv2 scoring

Similar findings were obtained with PI-RADSv2 (Tables 24, Additional file 1: V.2–V.5). All groups of readers tended to assign lower scores to non-csPCa lesions using PI-RADSv2.1 than using PI-RADSv2. As compared to PI-RADSv2, PI-RADSv2.1 downgraded a median number of 17 lesions per reader (IQR, 6–29), of which 2 (IQR, 1–3) were csPCa. It upgraded a median number of 4 lesions per reader (IQR, 2–7), of which 1 (IQR, 0–2) was csPCa. The most frequent downgradings were from PI-RADS scores of 3 to 2 and 4 to 2. In TZ, a median number of 2 lesions (IQR, 0–2) were downgraded from a score of 3 to 2, and a median number of 1 lesion (IQR, 0–2) was upgraded from a score of 2 to 3 (Additional file 1: V.6-V.7).

Additional lesions

Readers described a median number of 60 ‘additional lesions’ (IQR, 25–73; Additional file 1: VI.1).

Per-lobe and per-patient scores

At per-lobe analysis, after taking into consideration predefined and additional lesions, inter-reader agreement for PI-RADSv2.1 scoring was moderate-to-good (κ = 0.54–0.63; Table 5). Using PI-RADSv2.1, juniors obtained a significantly lower AUC (0.79 [95%CI, 0.75–0.83]) than experienced seniors (0.82 [95%CI, 0.79–0.86], p = 0.03), but not than less experienced seniors (0.79 [95%CI, 0.76–0.83], p = 0.71). Experienced seniors tended to show higher specificity, but the difference was not statistically significant (Table 5, Additional file 1: VI.2–VI.5).

Table 5 Inter-reader agreement (per-lobe analysis)

Similar findings were obtained with PIRADSv2 (Tables 35, Additional file 1: VI.2–VI.5). As compared to PI-RADSv2, PI-RADSv2.1 downgraded a median number of 66 lobes per reader (IQR, 35–94), of which 6 (IQR, 1–11) contained csPCa at biopsy (Fig. 2). It upgraded a median number of 5 lobes per reader (IQR, 2–8), of which 1 (IQR, 0–2) contained csPCa. The most frequent downgradings were from PI-RADS scores of 2 to 1, 4 to 2 and 3 to 2 (Additional file 1: VI.6).

Fig. 2
figure 2

Axial images obtained in a 62-year-old patient with prostate-specific antigen (PSA) level of 8.1 ng/mL and normal digital rectal examination. Prostate multiparametric magnetic resonance imaging (a, T2-weighted image; b, apparent diffusion coefficient map; c, diffusion-weighted trace image obtained with b value of 2000 s/mm2; and d, dynamic contrast-enhanced image) showed a 13-mm linear lesion parallel to the capsule in the peripheral zone of the left base (ad, arrowheads). Using PI-RADSv2 descriptors, 17 readers assigned to the lesion a T2-weighted imaging (T2WI) category of 2 (‘Linear, wedge-shaped or diffuse mild hypointensity, usually indistinct margin’), two readers a T2WI category of 3 (‘Heterogeneous signal intensity or non-circumscribed, rounded, moderate hypointensity’) and two readers a T2WI category of 4 (‘Circumscribed, homogeneous moderate hypointense focus/mass confined to prostate and < 1.5 cm in greatest dimension’). Two readers assigned a diffusion-weighted imaging (DWI) category of 2 (‘Indistinct hypointense on ADC ‘), fifteen readers a DWI category of 3 (‘Focal mildly/moderately hypointense on ADC and isointense/mildly hyperintense on high b value DWI’) and three readers a DWI category of 4 (‘Focal markedly hypointense on ADC and markedly hyperintense on high b value DWI < 1.5 cm on axial images’). Seventeen readers judged the lesion as positive at dynamic contrast-enhanced (DCE) imaging (‘Focal, AND earlier than or contemporaneously with enhancement of adjacent tissues, AND corresponds to suspicious findings on T2WI and/or DWI’), and four readers judged it as negative (‘No early enhancement, OR diffuse enhancement not corresponding to a focal finding on T2W and/or DWI, OR focal enhancement corresponding to a lesion demonstrating features of BPH on T2W’). The final PI-RADSv2 score was 2 for three readers, 3 for four readers and 4 for fourteen readers. Using PI-RADSv2.1 descriptors, the assignment of T2WI categories was the same as with PI-RADSv2 since the descriptors are identical. Fifteen readers assigned a DWI category of 2 (‘Linear/wedge-shaped hypointense on ADC and/or linear/wedge-shaped hyperintense on high b value DWI’), four readers a diffusion category of 3 (‘Focal (discrete and different from the background) hypointense on ADC and/or focal hyperintense on high b value DWI; may be markedly hypointense on ADC or markedly hyperintense on high b value DWI but not both’) and two readers a DWI category of 4 (‘Focal markedly hypointense on ADC and markedly hyperintense on high b value DWI < 1.5 cm on axial images’). Sixteen readers judged the lesion as positive at DCE imaging (‘Focal, AND earlier than or contemporaneously with enhancement of adjacent tissues, AND corresponds to suspicious findings on T2W and/or DWI’) and five as negative (‘No early or contemporaneous enhancement, OR diffuse multifocal enhancement NOT corresponding to a focal finding on T2W and/or DWI, OR focal enhancement corresponding to a lesion demonstrating features of BPH on T2W, including features of extruded BPH in the PZ). The final PI-RADSv2.1 score was 2 for sixteen readers, 3 for one reader and 4 for four readers. Systematic and targeted biopsy showed normal prostate tissue, with mild inflammation in the left base. Fifty-six months later, the patient had a PSA level of 6 ng/ml and had not undergone another prostate biopsy.

Per-patient analysis showed concordant results (Additional file 1: VII).

Follow-up

Of the 96 patients without csPCa at initial biopsy, 7 with an ISUP 1 cancer received immediate radical treatment. During a median follow-up of 51 months (IQR, 45–55), 7 of the 88 remaining patients were diagnosed with an ISUP 2 cancer and none with an ISUP ≥ 3 cancer.

Discussion

To specifically evaluate the characterizing value of the PI-RADSv2/v2.1 descriptors, we asked the readers to score the exact same corpus of lesions. To be clinically meaningful, this corpus had to include lesions with a large range of degrees of suspicion. Therefore, we selected consecutive patients who underwent MRI and biopsy at our institution in 2015–2016. At that time, our biopsy policy required to target all focal lesions, even those with a low degree of suspicion. Biopsy operators could omit systematic biopsy in PZ sextants that had targeted biopsy, which allowed targeting several lesions without unreasonably increasing the number of cores taken. Hence, 92.5% (147/159) of the study patients underwent targeted biopsy while the csPCa prevalence was only 39.6% and 33% at patient and lesion level retrospectively. Furthermore, in accordance with the recommendations of the time [29], MRI was not used to select patients for biopsy but only to indicate the lesions to target, which limited selection bias.

This set of predefined lesions was first used to assess inter-reader agreement on lesion size and location. Agreement on size was excellent (CCC ≥ 0.80). The overall agreement on lesion location (PZ, TZ or CZ) was moderate-to-good (κ = 0.60–0.73). Only 59% (142/240) of the predefined lesions were localized in the same zone by all readers. This is problematic since PZ and TZ lesions are scored differently, using different dominant sequences. Additionally, CZ lesions are also assessed differently, at least using PIRADSv2.1 descriptors [10]. Thus, any variability in lesion location can have major consequences on the final scoring agreement. Variability on lesion location can be explained by two main factors. First, due to the lack of well-defined anatomical landmarks between CZ and PZ, the number of lesions localized in CZ was highly variable from one reader to another. Second, partial volume effects in some locations (e.g. anterior horn of the PZ, extreme apex) made it difficult to distinguish between PZ lesions and TZ nodules protruding into the PZ. 3D T2w acquisitions with multiplanar reformations might facilitate lesion location by reducing partial volume effects. Unfortunately, in this study, readers had only access to 2D T2w axial and sagittal imaging.

As others [30], we found that experienced seniors performed significantly better, mostly because they assigned lower scores to non-csPCa lesions. However, the impact of experience on inter-reader agreement was small and agreement remained moderate at best, even for experienced seniors. This is discordant with another study in which inter-reader agreement was substantial and better between dedicated uro-radiologists than between non-dedicated radiologists. However, in that study, all radiologists were from the same institution, which may have reduced interpretation variability, particularly among dedicated radiologists [15]. Taken together, our results suggest that, despite continuous efforts of standardization and clarification, most PI-RADS descriptors remain subjective. Distinguishing ‘marked’ from ‘non-marked’ abnormalities, ‘encapsulated’ from ‘mostly encapsulated’ nodules, or ‘focal’ from ‘non-focal’ enhancement is subjective but has major effect on the final score. Interestingly, for PI-RADSv2.1 and PI-RADSv2, and for all groups of readers, κ values tended to be higher for T2-weighted and diffusion-weighted categories than for DCE categories. Although this finding should be interpreted with care since all pulse sequences do not have the same number of categories, it may suggest that visually distinguishing positive from negative cases is difficult at DCE, especially in the presence of subtle enhancements from background.

Several solutions for improving MRI reproducibility can be suggested. Mentoring through systematic double reading with an experienced reader could probably accelerate the training of beginners, but this is made difficult by the heavy workload of radiologists [31]. Using quantitative thresholds for apparent diffusion coefficient or DCE-derived parameters may also improve prostate MRI accuracy and inter-reader agreement [16, 32,33,34], but there is still progress to be made on the reproducibility of MRI biomarkers [35,36,37,38]. Finally, assistance by Artificial Intelligence algorithms may facilitate prostate MRI reading in the future; however, conflicting results have been recently published on this matter [39,40,41,42,43,44,45].

Our sample size was not designed to statistically compare PI-RADSv2.1 and PI-RADSv2 performances, because the difference was expected to be small. Meaningful comparison would have needed an unrealistic number of patients. Yet, the strict application of PI-RADSv2.1 descriptors in predefined lesions tended to yield lower scores in non-csPCa lesions as compared to PI-RADSv2 descriptors. This was mainly observed in PZ lesions for which the PI-RADSv2.1 clarifications on Dw imaging categories 2, 3 and 4 seem to have favoured better characterization. However, this effect was too small and too heterogeneous across readers to induce a substantial difference between the AUCs of the two scores. Additionally, PI-RADSv2.1 clarifications did not improve inter-reader agreement.

After assessing the predefined lesions, readers were allowed to describe additional suspicious lesions. This was designed to evaluate whether the new PI-RADSv2.1 upgrading rules in TZ increased the number of suspicious lesions as compared to PI-RADSv2. In accordance with other studies [12,13,14, 18], we found that such upgradings were rare. As a result, per-lobe analysis, that included predefined and additional lesions, showed similar results than per-lesion analysis: experienced seniors out-performed the two other groups of readers, and, in all groups of readers, PI-RADSv2.1 showed a trend toward improved specificity as compared to PI-RADSv2. Of note, the number of additional lesions was highly variable across readers, with juniors tending to describe more lesions that seniors.

In this study, experienced readers were a priori defined as having more than 5 years of experience. A recent European consensus suggested that a minimum of 1000 cases should be read to become an expert [31]. All our experienced seniors fulfilled that condition, and our results are in line with those of the European consensus.

Readers assessed PI-RADSv2 and PI-RADSv2.1 during the same session. This may have resulted in underestimating the differences between the scores. However, independent scoring is illusory; most readers were familiar with the PI-RADSv2 descriptors and would have kept them in mind when using the new PI-RADSv2.1 criteria. In addition, assigning the scores in two different sessions introduces intra-reader variability, which may be substantial [46, 47]. Because reading the cases needed approximately 15–20 h, we were also afraid that the second reading would be biased by fatigue and the gradual lack of involvement of the readers. Thus, we chose to ask the readers to concentrate, during the same reading session, on the assessment of each pulse sequence category by following as closely as possible the written PI-RADSv2 and PI-RADSv2.1 descriptors without minding the overall score that was calculated automatically.

Our study has limitations. Firstly, because we indicated the predefined lesions to the readers, the AUCs obtained herein do not fully assess the diagnostic performance of the PI-RADS score in clinical routine. The detection phase, that is also a source of interpretation variability, was outside the scope of this study. However, many other studies have already assessed the overall performance of the PI-RADS score [23,24,25]. Instead, we wanted to specifically evaluate whether the PI-RADS descriptors were specific enough to induce reproducible characterization of the same lesion across multiple readers. This allowed the evaluation of factors of variability (size, location, PI-RADS categories of each pulse sequence) that, to our knowledge, had not been studied before. Secondly, prostate biopsy, used as reference standard, may have missed some csPCas. However, the small proportion of aggressive cancers detected during follow-up suggests that the sensitivity of our biopsy technique was good. Thirdly, we included only biopsy-naïve patients. Our results may not be valid for other populations.

In conclusion, when assessing the same set of MRI lesions using PI-RADSv2.1 and PI-RADSv2 descriptors, experienced seniors performed significantly better in characterizing csPCa than the other groups of readers. PI-RADSv2.1 descriptors tended to be more specific than PI-RADSv2 descriptors, but did not improve inter-reader variability.