The learning curve in bladder MRI using VI-RADS assessment score during an interactive dedicated training program

Objective The purpose of the study was to evaluate the effect of an interactive training program on the learning curve of radiology residents for bladder MRI interpretation using the VI-RADS score. Methods Three radiology residents with minimal experience in bladder MRI served as readers. They blindly evaluated 200 studies divided into 4 subsets of 50 cases over a 3-month period. After 2 months, the first subset was reassessed, resulting in a total of 250 evaluations. An interactive training program was provided and included educational lessons and case-based practice. The learning curve was constructed by plotting mean agreement as the ratio of correct evaluations per batch. Inter-reader agreement and diagnostic performance analysis were performed with kappa statistics and ROC analysis. Results As for the VI-RADS scoring agreement, the kappa differences between pre-training and post-training evaluation of the same group of cases were 0.555 to 0.852 for reader 1, 0.522 to 0.695 for reader 2, and 0.481 to 0.794 for reader 3. Using VI-RADS ≥ 3 as cut-off for muscle invasion, sensitivity ranged from 84 to 89% and specificity from 91 to 94%, while the AUCs from 0.89 (95% CI:0.84, 0.94) to 0.90 (95% CI:0.86, 0.95). Mean evaluation time decreased from 5.21 ± 1.12 to 3.52 ± 0.69 min in subsets 1 and 5. Mean grade of confidence improved from 3.31 ± 0.93 to 4.21 ± 0.69, in subsets 1 and 5. Conclusion An interactive dedicated education program on bladder MRI and the VI-RADS score led to a significant increase in readers’ diagnostic performance over time, with a general improvement observed after 100–150 cases. Key Points • After the first educational lesson and 100 cases were interpreted, the concordance on VI-RADS scoring between the residents and the experienced radiologist was significantly higher. • An increase in the grade of confidence was experienced after 100 cases. • We found a decrease in the evaluation time after 150 cases. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-08766-8.


Introduction
Bladder cancer (BCa) is the 10 th most common malignancy, with an approximately 550,000 new cases and 200,000 deaths worldwide [1], and has the highest lifetime economic burden per patient of all tumors, mainly due to hospital care-related costs [2].
In recent years, magnetic resonance imaging (MRI) has proven to be a reliable and accurate tool for BCa diagnosis and staging. In particular, the Vesical Imaging-Reporting and Data System (VI-RADS) score [3] was developed to provide a systematic and standardized approach in the acquisition, interpretation, and reporting of bladder MRI to differentiate muscle-invasive from non-muscle-invasive bladder cancer, aiming, among others, to reduce the heterogeneity in results between centers. Interestingly, beyond a high diagnostic performance, the VI-RADS assessment scoring showed a substantial inter-rater agreement among both experts and inexperienced radiologists [4][5][6][7][8][9][10]. As technology advances rapidly in radiology, medical schools and residency programs must adopt new methods of learning in order to implement substantial changes to the radiology curriculum delivery. Indeed, a strong emphasis should be placed upon reader education and experience in oncologic imaging. Up to now, numerous studies on MR imaging of the prostate showed variable diagnostic performance related to reader expertise [11]. Despite this, the optimal training curriculum for residents and fellows remains unclear. Learning curves for prostate MRI interpretation over time after dedicated reader training have not been extensively studied; nonetheless, an overall improvement in tumor detection accuracy was found after training [12][13][14][15][16][17][18][19][20][21][22][23]. To date, in contrast to prostate MRI, there have been no studies investigating the role of a learning program on the accuracy of bladder cancer staging using MRI and the VI-RADS score.
The objective of this study was to determine the learning curve of radiology residents in interpreting bladder MRI using the VI-RADS score during an interactive dedicated training program.

Patient population and study design
This observational study was approved by our Institutional Review Board and the Ethical Committee. All patients were prospectively enrolled and were notified of the investigational nature of this study and gave their written informed consent. The study was conformed to the guidelines for good clinical practice in agreement with the ethical principles set forth in the latest version of the Declaration of Helsinki. Institutional data from 200 consecutive patients who underwent bladder MR imaging between January 2018 and July 2021 were analyzed. Patients who received pre-operative systemic treatment for muscle-invasive bladder cancer (MIBC) before imaging were excluded. Study design flowchart is presented in Fig. 1.

MR imaging technique
All patients underwent the same MRI protocol using a 3-T scanner (General Electric Healthcare, Discovery 750); a small proportion of exams (3.5%) was acquired with a 1.5-T scanner (Siemens, Avanto) due to incompatibilities with medical devices (e.g., pacemakers) at 3-T magnetic field. The acquisition protocol, as per VI-RADS guidelines [3], included morphological multiplanar T2-weighted imaging (T2WI), on three planes (axial, coronal, and sagittal), according to lesion location; diffusion-weighted imaging (DWI) acquired in the axial plane with multiple b values (b = 100-800-2000) that were used to generate the apparent diffusion coefficient (ADC) map; dynamic contrast-enhanced (DCE) images were acquired in the axial plane with fat-suppression (3D T1 gradient echo), before and after gadolinium-based contrast media injection, with a temporal resolution of 8 s. An intramuscular antispasmodic agent was administered, if necessary, to reduce bladder wall motion artifacts, and patients were instructed to drink 500-1000 mL of water 30 min before the examination to obtain adequate bladder distension.

Readers, imaging interpretation, and reference standard
Three radiology residents from two different institutions, at the first (M.L.P., A.D.) and fourth year (M.C.S.) of training, with a general understanding of body MRI but minimal experience in bladder MRI interpretation (less than 20 cases in total), served as readers. They independently evaluated 200 consecutive studies in four equal subsets (batches) of 50 cases over a 3-month period. After a 2-month memory extinction period, the first subset was re-evaluated, resulting in a total of 250 assessments. All examinations were anonymized, and the readers were blinded to clinical information including patients presenting signs and symptoms, previous imaging studies, history of cystoscopy, and tumor biopsy.
Image analysis was performed using the VI-RADS assessment score, by evaluating first the morphological T2W images, then the DWI/ADC map sequence, and finally the DCE images. Whenever more than one lesion was identified, the overall VI-RADS score provided corresponded to either the lesion with the highest VI-RADS category or, in cases where the scores were equal, the lesion with the largest size. Additionally, the evaluation time, the grade of confidence, and the image quality were recorded. Evaluation time corresponded to the period spent in interpreting the exam and the thinking process to formulate the score rounded to the nearest minute. Grade of confidence was documented according to a 5-point Likert scale (Supplementary Table 1). Also, readers were asked to rate image quality on a 3-point scale according to T2WI, DWI, and DCE diagnostic quality standards based on an institutional-specific algorithm (Supplementary Table 2). Interpretation by an expert urogenital radiologist with 10 years of bladder MRI experience (V.P.) and a cumulative reading experience of 500 bladder MRIs using VI-RADS score, was considered the reference standard (RS).

Educational interventions in the dedicated training program
The dedicated training program had a duration of 3 months. Prior to the reading of the first subset of cases, the readers were given an overview lecture on bladder MRI, followed by a review and interpretation of five cases from the Picture Archive and Communication System (PACS). Additional studying material was provided to the readers and included recent and insightful articles on these topics. During the period between batches 1 and 2, a more experienced resident provided an educational lecture on the VI-RADS assessment score, with a focus on the assessment of each scoring category. An interactive session involving revision of cases, that the readers found challenging in the first groups and/or had discordant scoring, was provided between batches 2 and 3 by a resident with four years of experience in urogenital MRI. Finally, a bladder MRI expert provided an advanced bladder cancer imaging presentation between the last two batches, with radiologic and pathologic correlation. Questions were encouraged during every lesson to aid readers' learning. After recording the case scoring for each batch, the readers were shown the reference standard reports, in order to review the cases incorrectly evaluated and further improve their learning process.

Statistical analysis
Descriptive statistics were used to summarize overall, perbatch and per-reader VI-RADS score category assignment, grade of confidence (GC), evaluation time (ET) and image quality (IQ) using number and percentages (VI-RADS score), and the mean and standard deviation (GC, ET, and IQ). By plotting mean agreement as a ratio of correct evaluations per batch, a learning curve was constructed for the three readers. Inter-reader agreement analysis was performed with Cohen's kappa statistics, for each VI-RADS score by each of the three readers pairs. The diagnostic performance of each reader was assessed by means of receiver operating characteristic curve analysis. The overall and per-batch AUC values were obtained based upon single VI-RADS scoring and a conversion of VI-RADS to above or below a cut-off of 3 (cases were split into < 3 vs ≥ 3) for the evaluation of muscle-invasiveness. Overall and per-batch sensitivity and specificity were calculated to assess the performance of each reader per batch. p < 0.05 was considered to indicate a significant difference for all hypothesis tests. Analyses were performed using SPSS, version 27 (IBM).

Learning curve
The learning curve over time on VI-RADS scoring, for the three readers, is presented in Fig. 2. The mean ratio of concordant VI-RADS scoring between readers and the reference standard improved steeply from 65% in batch 1 to 82% in batch 2. In subsequent batches, the case evaluation showed slight improvement, up to 87% seen in batch 5. The same increasing trend was noted regarding the kappa coefficient for the VI-RADS agreement between readers and the reference standard ( Table 2). As such, for subsets 1 and 4 respectively, the k was 0.555 and 0.739 for reader 1, 0.522 and 0.712 for reader 2, and 0.481 and 0.737 for reader 3. The k coefficient differences between pre-training evaluation (subset 1) and post-training evaluation of the same group of patients (subset 5) were respectively 0.555 to 0.852 for reader 1, 0.522 to 0.695 for reader 2, and 0.481 to 0.794 for reader 3.

Diagnostic accuracy
The performance of each reader in scoring single VI-RADS assessment categories, for the detection of bladder cancer muscle invasiveness (using a VI-RADS ≥ 3 as cut-off), and for the identification of absence of lesions was measured based on overall evaluations (Table 3) and on a per-batch basis (Supplementary Tables 3-9). Using VI-RADS ≥ 3 as cut-off, the sensitivity ranged from 84 to 89% and the specificity from 91 to 94%, across the three readers. The AUCs ranged from 0.89 (95% CI: 0.84, 0.94) to 0.90 (95% CI: 0.86, 0.95). Fig. 3 shows the receiver operating characteristic curves for the three readers on the detection of muscleinvasive bladder cancer.

Evaluation time
The mean reader evaluation time decreased as subsequent batches were assessed from 5.21 ± 1.12 min in subset 1 to 3.52 ± 0.69 min in subset 5 (Table 4 and Fig. 4). A statistically significant reduction was found in mean evaluation time between subsets 1 and 4/5, between subsets 1 and 2, and between subsets 2 and 3 (p < 0.001), not between subsets 3 and 4 (p = 0.32), nor 4 and 5 (p = 0.45). For reader 1, no significant change in evaluation time of any consecutive subsets was shown (p ≥ 0.106). For reader 2, a  statistically significant decrease between subsets 1 and 2 and subsets 2 and 3 (p < 0.001), as well as between subsets 3 and 4 (p = 0.03) was found, while no difference was noted between subsets 4 and 5 (p = 0.656). For reader 3, a statistically significant reduction between subsets 1 and 2 and subsets 2 and 3 (p < 0.001) was found, despite having no differences between subsets 3 and 4 (p = 0.258), nor 4 and 5 (p = 0.93).

Grade of confidence
Mean grade of confidence improved as subsequent batches were assessed from 3.31 ± 0.93 in subset 1 to 4.21 ± 0.69 in subset 5 (Table 5 and Fig. 5). A statistically significant increase was found in mean grade of confidence between subsets 1 and 4/5 and subsets 1 and 2 (p < 0.001). Mean grade of confidence was not different between the remaining consecutive subsets (p ≥ 0.294). For reader 1, no significant difference was found between any subsets (p ≥ 0.58). For readers 2 and 3, a significant increase in grade of confidence was found between subsets 1 and 2 (p < 0.001; p = 0,044; respectively), with no differences between other consecutive subsets (p ≥ 0.216; p ≥ 0,438; respectively).

Image quality
When the image quality was minimal (IQ1), the overall VI-RADS score agreement between readers and the reference standard was moderate (k = 0.503 for reader 1; k = 0.508 for reader 2; k = 0.603 for reader 3; p < 0.001). When the image quality was scored as optimal (IQ3), the overall VI-RADS score agreement was substantial, with (k = 0.739 for reader 1; k = 0.726 for reader 2; k = 0.713 for reader 3; p < 0.001).
The results on k statistics and image quality are shown in Table 6. Summary on overall quality assessment scoring is provided in Supplementary Table 10. Figure 6 illustrates an example of case that was incorrectly classified during batch 1 and correctly scored during batch 5.

Discussion
Different studies have shown that readers' education and training are key factors in oncologic imaging and for training future radiologists [11]. Despite this, no evidence exists on the effect of an interactive learning program on the reader performance in bladder MRI and the VI-RADS score. The purpose of this study was to assess the learning curve and to determine how three radiology residents performed when interpreting bladder MRI using VI-RADS assessment scoring as part of a dedicated interactive training program, in which 200 cases of bladder MRI were divided into four sets. In between the  subsets, frontal lessons, dedicated case-review, and tutoring sessions were provided, followed by a final re-assessment of the first subset of cases (batch 5).
We observed a significant increase in concordance between the VI-RADS scoring of the residents, compared to the experienced radiologist, after the first two batches of training (100 cases) showing a steep improvement from 65 to 82% followed by a plateau; reader 2 experience a drop in improvement in batch 3, probably due to the high number of cases in agreement with the RS in batch 2; by revising batch 2 cases for reader 2 we noticed that the number of cases scored as low quality was very low (n = 5), which might explain such high agreement ratio. The steep improvement in the bladder MRI outcomes was likely linked to the educational intervention that focused on providing general information on VI-RADS assessment scoring.
In the agreement analysis, this trend was also observed when looking at k coefficient that improved from a mean of 0.519 in batch 1 to 0.801 in batch 5. These results might suggest that providing an overview on the VI-RADS criteria combined with a sample number of cases (100-150) might lead to acceptable results, highlighting the strength of this reporting and data system for a standardized approach to bladder MRI interpretation. This is in line with the ESUR/ESUI consensus statement establishing 100 supervised cases as the minimum number of prostate MRI before independent reporting can be performed for clinically significant prostate cancer detection [24].
Despite the slower and lower trends in subsequent subsets, the learning curve of the residents continued to rise, illustrating the need for more advanced and prolonged training. Differing from our experience, Rosenkrantz et al did not report a significant improvement in interpreting prostate MRI using PI-RADS v. 2.0 in the group of readers receiving continual feedback [12]. A more recent study found that online courses significantly improved the sensitivity in detecting prostate cancer on MRI using PI-RADS score [25].
To what regards evaluation time, a particularly relevant topic for today's heavily loaded radiology departments, the two residents at the first year of training demonstrated a significant decrease in mean interpretation time after the first 150 cases (mean overall ET: 5.21 min in subset 1 to 3.52 min in subset 5; p < 0.001). A similar outcome was observed in another study where authors found that the mean reader evaluation time decreased significantly from 95.2-99.0 s in subsets 1-2 to 66.1-65.8 s in subsets 3-4 (p < 0.001), when   [12]. However in our study, as previously mentioned, this trend was only observed for the firstyear residents, indicating that general exposure to MR imaging may lead to shorter assessment periods, regardless of previous exposure to bladder MRI. As such, the fourth-year resident did not show differing mean timeframes during the training program, which suggests no association between specific bladder MRI training and reduced evaluation time.
Considering the grade of confidence of the readers, our results led to the same conclusions of Garcia-Reyes et al who found significant improvements between pre-and posteducation evaluations of prostate MRIs (3.75 to 4.22 on a scale from 1 to 5) [14]. In our study, mean overall confidence ranged from 3.31 in subset 1 to 4.21 in subset 5 (5-point scale). Confidence in reporting, specifically in assessing the likelihood of tumor invasion of the muscularis propria, is of utmost importance as it can dramatically change and guide patients' management. We point out that throughout the study, for both the per-subset and the overall results, the percentage of VI-RADS score 3 (equivocal cases) was lower than or equal to 11%, which may differentiate this system from other -RADS [26] in which the number of assigned indeterminate cases is higher, having a strong clinical impact in bladder cancer imaging.
Even though it is not directly related to the effects of a training program, image quality clearly influences the diagnostic performance of radiologists. This was confirmed in our study in which we found that the agreement between the residents and the experienced radiologist was higher when the images were perceived as high-quality exams. This might have influenced the outcome of the learning process as the mean overall k coefficient increased from 0.547 (low IQ) to 0.726 (high IQ). We hypothesize that expertise might be proven advantageous particularly in poor quality exams, but this thesis warrants further research. These findings have several implications regarding trainee education in bladder MRI interpretation. This study demonstrated the risk of interpreting bladder MRIs without prior experience or training, which is why we do not recommend reading exams without basic knowledge of MRIs and specifically of bladder MRI. Efforts should be made to guarantee that every radiologist reading bladder MRI has an acceptable number of interpreted exams (100 cases according to our results) along with a proper understanding of the VI-RADS criteria. Radiology curricula could be improved by including training on bladder MRI, aiding to a standardized management using the VI-RADS criteria, and leading to a more value-based service to patients with bladder cancer.
The following limitations are acknowledged: first, we did not investigate the pathology corroboration of the results; second, some heterogeneity was observed in-between batches as we included subsequent patients and did not evenly distribute incorrectly scored as a VI-RADS 3 and 4 by the inexperienced readers during the first interpretation batch, probably due to the non-optimal quality acquisition; however, MRI was correctly scored with an overall VI-RADS 2 in batch 5, given the higher reader experience. T stage after TURBT identified HG-T1 urothelial carcinoma. T2WI, T2-weighted imaging; VI-RADS, Vesical Imaging-Reporting and Data System; DWI, diffusion-weighted imaging; ADC, apparent diffusion coefficient; DCE, dynamic contrast-enhanced; NMIBC, non-muscle-invasive bladder cancer; TURBT, trans-urethral resection of bladder tumor; HG, highgrade a homogenous number of each VI-RADS score throughout the groups of patients; third, the MR images were acquired with a highly performing 3 Tesla MRI scan, in a tertiary referral center, which might negatively impact the reproducibility of our findings; fourth, our three subjects received training, and no control group was formed, which might be a source of bias. Finally, the applicability of the proposed training program in trainee daily routine is partly currently limited due to the COVID-19 pandemics. However, most of the activities included in the program could be easily organized on an online learning platform.
In conclusion, an interactive dedicated reader education program on bladder MRI and the VI-RADS score was associated with a significant increase in readers' diagnostic performance over time. A general improvement was observed after 100-150 cases, which might be proposed as a cut-off to reach learning programs. These findings may represent a useful experience to improve and shape future fellowship programs and radiology curricula.
Funding Open access funding provided by Università degli Studi di Roma La Sapienza within the CRUI-CARE Agreement. The authors state that this work has not received any funding.

Declarations
Ethics approval Institutional Review Board approval was obtained.
Informed consent Written informed consent was obtained from all subjects (patients) in this study.

Conflict of interest
The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.
Guarantor The scientific guarantor of this publication is Valeria Panebianco.
Statistics and biometry No complex statistical methods were necessary for this paper.

• observational • performed at one institution
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.