Introduction

Contrast-enhanced computed tomography (CT) is considered the imaging standard for the evaluation of renal cysts. Since its introduction, the Bosniak classification for cystic renal masses has found widespread acceptance. This is due to its simple structure, with a low number of diagnostic categories, each of them associated with a suggestion for clinical management [1, 2] (Table 1). Bosniak category I and category II lesions are simple and minimally complex cysts and require no further work-up. A Bosniak category III lesion is an indeterminate complex cyst with an increased probability of malignancy ranging from 31 % to 100 %. For these cysts, the usual workup is surgery or, in selected cases, radiological follow-up [2]. Bosniak IV cysts have clearly malignant features and surgical therapy is recommended. In order to decrease unnecessary surgical interventions, a fifth category, Bosniak IIF has been introduced. This category is a modification of the initial Bosniak classification and describes a group of minimally complex cystic lesions, separate from Bosniak II and III, for which short-term (3–6 months) imaging surveillance is recommended [3] (Table 1).

Table 1 The Bosniak classification for evaluation of renal cysts [2]

While in cancer diagnosis a maximum sensitivity is always desirable, false-positive findings may cause serious problems and side effects, especially in vulnerable organs like the kidneys. Evidence-based clinical decision-making requires an assessment of the accumulated empirical evidence. We noted a discrepancy between the broad application of the Bosniak classification in clinical practice and the lack of a systematic review and quantitative data synthesis demonstrating strengths and weaknesses of this clinical decision rule. While the Bosniak classification is clinically established, how accurate a positive or negative result is and whether it may be reproduced using different equipment or readers remains unknown. Through a systematic review and meta-analysis, we aimed to address the rate of malignancy in different Bosniak categories, the Bosniak classification’s diagnostic accuracy and factors that influence malignancy rates and diagnostic performance.

Materials and methods

Search strategy

Two readers independently performed a systematic search of the Pubmed and Scopus databases including articles listed from 1 January 1986 to 18 January 2016. The predefined search term ‘Bosniak’ was used. The title and abstracts from search results were screened and the full text of eligible studies was retrieved. Only original, peer-reviewed research articles that investigated the rate of renal cyst malignancy in adult human subjects imaged by CT and classified according to the Bosniak classification were eligible for this study. Additional backward snowballing was performed scanning the references of retrieved articles for additional studies [4].

Study selection

Both reviewers independently screened all identified records for eligibility. A third arbitrator resolved any disagreement. If the title and abstract did not provide sufficient information, the full text was retrieved. Included were articles fulfilling the following conditions: (1) a reference standard had to be established either by histopathology workup or imaging follow-up; (2) eligible studies had to be published in English and (3) eligible studies had to include at least 15 patients. Study quality was assessed using the QUADAS 2 tool. The selection process by which the included studies were derived for data extraction is shown in Fig. 1.

Fig. 1
figure 1

Flowchart showing the study selection process

Data extraction

Two readers (one board-certified radiologist and one board-certified urologist with >5 years experience in renal imaging) performed raw data extraction. Afterwards, data were checked for discrepancies which were solved in consensus. A third reader controlled all data extracted by the initial readers and corrected any discrepancies in consensus with them. The following parameters were collected and entered into a spreadsheet: author name; publication year; study design (retrospective vs. prospective); number of patients and lesions; reference standard (histopathology obtained by surgery or image-guided biopsy or follow-up, duration of follow-up); reader number; experience and inter-reader agreement; and the technical parameters of CT. Numbers and final diagnosis (benign or malignant) were extracted for all Bosniak categories (I, II, IIF, III and IV). Studies were further classified as either reporting the prevalence of malignancy in certain Bosniak subgroups only or diagnostic (including both lesions of any Bosniak I, II or IIF, and Bosniak III or IV). Both readers applied QUADAS 2 items to assess study quality and likelihood of bias [5]. Again, if present, disagreement was solved in consensus. In case of disagreement, a third reader acted as an arbitrator.

Data synthesis and analysis

Analyses were performed using the software programs Open Meta-Analyst for Mac OS Yosemite 10.10 (http://www.cebm.brown.edu/open_meta) and StataSE 12 (StataCorp, College Station, TX, USA). Raw extracted data from eligible articles were used to construct forest plots of the rate of malignancy in Bosniak categories I–IV; in addition, meta-regression was performed in order to identify a possible influence on the prevalence of malignancy by the factors listed above. In case of positive findings, the forest plot was grouped according to the respective variable identified by meta-regression. For the assessment of heterogeneity, I2-statistics were calculated and interpreted in accordance with the proposal of Higgins and Thompson as showing low (I2 around 25 %), medium (I2 around 50 %) or high (I2 around 75 %) heterogeneity [6].

The diagnostic accuracy of the Bosniak classification for the differentiation between benign and malignant renal cysts was calculated by tabulating results into positive (Bosniak III and IV) versus negative (Bosniak I, II and IIF) diagnostic test results. The reference standard for malignant and benign diagnoses was defined as the final diagnosis confirmed by histopathology and/or follow-up. For calculation of sensitivity and specificity, a diagnostic random-effects model, using the method of DerSimonian and Laird, was used. For calculation purposes, a correction factor of 0.5 was added to zero findings. A summary receiver operating characteristic (ROC) curve was constructed by using a bivariate (maximum likelihood) model. Again, data heterogeneity was assessed by I2-statistics.

Meta-regression was applied to investigate the possible influence of variables (sample size of the respective study, reference standard – histopathology only or histopathology and/or follow-up, study published before or after the introduction of the Bosniak IIF category, benign lesions only, including Bosniak IIF or also Bosniak I and/or II) and technical factors including slice thickness (grouped as ≤5, ≤10 or not given), detector rows (grouped as: up to 4, 16, 64 or higher and not given) or whether there was any technical information given or not on sensitivity and specificity. P-values of <0.05 were interpreted as indicating a significant result.

Finally, publication bias was assessed by construction of funnel plots and Deek’s test for funnel plot asymmetry.

Results

Study characteristics, bias

Overall, 35 eligible studies were selected (Fig. 1, Tables 2 and 3). In our meta-analysis, a total of 2,557 patients with 2,578 lesions (862 malignant lesions, 33.4 %) were included. QUADAS 2 assessment (Fig. 2) revealed a mixed risk of bias assessment regarding patient selection: A number of studies used only histopathology as the only reference standard, patient recruitment was non-consecutive or insufficient details regarding patient recruitment were given. In addition, benign lesions regularly contained only Bosniak IIF or Bosniak II and IIF cysts but no Bosniak I lesions [732]. No further risk of bias concerns was raised and all included studies were deemed applicable to answer the research question (Fig. 2). The study designs were described as prospective in one study [19] and retrospective in 33 studies [717, 2041]. In one study [18], the retrospective or prospective character of the study could not be determined. Patient recruitment was consecutive in seven studies [13, 18, 19, 34, 39, 41]. Seven reports described non-consecutive [7, 8, 14, 24, 26, 36, 40] case-control patient recruitment. In another 21 studies, the consecutive or non-consecutive nature of patient recruitment was not clearly stated [9, 10, 12, 1517, 20, 21, 23, 25, 2733, 3538]. Histopathology as a reference standard was used in 13 studies [7, 12, 1517, 22, 28, 29, 31, 32, 35, 37, 41], follow-up and histopathology in another 21 studies [811, 13, 14, 1821, 2327, 30, 33, 34, 36, 38, 39], and, in one study only, follow-up was used as the reference standard [40]. Twenty-five of 35 (71.4 %) eligible studies provided technical information on computed tomography [7, 911, 13, 14, 1619, 21, 2327, 29, 3133, 35, 36, 38, 39, 41] (Table 1). However, this information was incomplete in the majority of the investigated studies and almost all studies investigated their patients on several devices with varying protocols (Table 2). The number of observers reading CT images (range 1–3 readers) was provided in 21 studies [711, 1315, 17, 19, 2225, 27, 30, 32, 34, 3638]. Observer experience (range 2–52 years’ experience) in CT was given in ten studies only [9, 10, 13, 14, 17, 19, 22, 23, 25, 32] (Table 1). Inter-observer variability based on kappa analysis (kappa range 0.571–1) was provided in five studies [13, 19, 23, 30, 34]. Eight studies were carried out before the introduction of the Bosniak IIF category [12, 28, 29, 3335, 37, 38] and the remaining 27 studies after the introduction of Bosniak IIF [711, 1320, 2227, 3032, 36, 3941].

Table 2 Patient numbers, length of follow-up, CT equipment and reader experience in the included studies
Table 3 Reference standard and key diagnostic parameters extracted from the investigated studies
Fig. 2
figure 2

QUADAS 2 assessment results

Rate of malignancy in Bosniak categories

The rate of malignancy increased from Bosniak I to IV (Fig. 3 and Supplemental Material Fig. A1A3). Pooled estimates were 3.2 % (95 % CI 0–6.8) in 89 Bosniak I, 6 % (95 % CI 2.7–9.3) in 261 Bosniak II, 6.7 % (95 % CI 5–8.4) in 818 Bosniak IIF, 55.1 % (95 % CI 45.7–64.5) in 887 Bosniak III and 91 % (95 % CI 87.7–94.2) in 449 Bosniak IV lesions. Malignancy rates did not differ between Bosniak I, II and IIF (P-values I vs. II: 0.309, II vs. IIF: 0.690, I vs. IIF: 0.199) and were higher in Bosniak III (P-values vs. IIF <0.0001) but lower than in Bosniak IV (P < 0.0001).

Fig. 3
figure 3

Forest plot of pooled malignancy rates (random effects model) in Bosniak categories

Two Bosniak I cysts were malignant: one an RCC upgraded by ultrasound [35] and one an incidental focal area (0.6 cm) of papillary RCC within a larger cyst [41]. Six studies provided details on benign Bosniak IV lesions: these were either smaller than 2 cm [17], haemorrhagic cysts [20, 35], cystic nephroma [15, 34, 39] or oncocytoma [15, 35], or simple cysts [35].

Meta-regression identified a higher rate of malignancy in Bosniak IIF lesions in studies that used histopathology as the only reference standard (16.6 %, 95 % CI 7.7–25.4), compared to studies that also accepted follow-up examinations as a reference standard (6.3 %, 95 % CI 4.6–8.0). Year of publication was associated with a trend towards higher malignancy rates (P = 0.05). No further influencing factors on the rate of malignancy were identified in any Bosniak category (P > 0.05, respectively). Between-studies heterogeneity was low in Bosniak I (I2 = 5 %) and IIF (I2 = 0 %), medium in Bosniak II (I2 = 32 %) and Bosniak IV (I2 = 36 %), and high in Bosniak III (I2 = 89 %) cysts.

Diagnostic performance of the Bosniak classification

Twenty-six studies provided information about the diagnostic performance of the Bosniak classification by including benign and malignant lesions classified as benign (Bosniak < III) or malignant (Bosniak ≥ III) by imaging [713, 1720, 22, 24, 25, 2831, 3339, 41].

The area under the summary ROC (sROC) curve (bivariate model) was calculated as 92 % (95 % CI 89–94; Fig. 4). Overall pooled sensitivity and specificity were 93 % (95 % CI 89–95) and 67 % (95 % CI 59–76). Between-study heterogeneity was high (I2 for sensitivity: 68.5 %, I2 for specificity: 90.9 %).

Fig. 4
figure 4

Summary receiver operating characteristic (ROC) curves based on bivariate (maximum likelihood) models for 26 diagnostic studies (left), a subgroup of 12 diagnostic studies with both histopathology and follow-up (FU) as standards of reference (SOR) (middle), and nine diagnostic studies using only histopathology as the SOR (right). Note a significantly (P < 0.001) higher area under the ROC curve (AUC) based on a higher (P < 0.001) specificity in studies with both histopathology and FU as the SOR, compared to histopathology only. The summary statistics in the middle most accurately reflect the clinical application of the Bosniak classification

A subgroup analysis in diagnostic studies that included histopathology only as the standard of reference, and non-selected Bosniak categories [7, 12, 17, 22, 28, 29, 35, 37, 41], revealed a lower sROC AUC of 0.86 (95 % CI 0.83–0.89). A higher AUC of 0.94 (95 % CI 0.91–0.96) was found in 12 diagnostic studies that included histopathology and follow-up as the standards of reference [9, 11, 13, 1820, 25, 33, 34, 36, 38, 39]. This group reflects the clinical setting most accurately, and a pooled (bivariate model) sensitivity of 97 % (95 % CI 86–99, I2 = 70.7 %) and a specificity of 74 % (95 % CI 64–82, I2 = 77.2 %) were calculated.

Meta-regression (random effects model) identified a lower sensitivity in studies that included only Bosniak IIF lesions as benign (meta-regression coefficient -0.76, 95 % CI -1.39 to -0.13, P = 0.018). Further, meta-regression demonstrated a significantly higher specificity in studies that used histopathology and follow-up as the reference standard as compared to studies with histopathology as reference standard only (meta-regression coefficient 0.92, 95 % CI 0.24–1.59, P = 0.008). Technical factors including slice thickness, detector rows or whether there was any technical information given at all did not show a significant influence on either sensitivity or specificity (P > 0.05, respectively). In addition, year of publication was not associated with these diagnostic performance indices (P > 0.05, respectively). No evidence of publication bias was found (Deek’s test P = 0.61; Supplemental Material Fig. B).

Discussion

Our results demonstrate an increasing malignancy rate from Bosniak I to IV categories. Between-study heterogeneity ranged from low to high, with the highest value observed in Bosniak III lesions. Bosniak III cysts are defined as complex cysts and must be differentiated from minimally complex cysts (IIF) that can be managed with only follow-up. We did not identify any explanatory variable for the observed heterogeneity in Bosniak III lesions. However, two factors very likely contributing to this heterogeneity, namely, reader experience and spatial resolution, were insufficiently reported in the majority of included studies. Bosniak IIF cysts were more likely to be malignant if the study considered only histopathology as the standard of reference. As the rate of true-negative findings, and, subsequently, the malignancy rate, depends on whether clinically benign findings that are not subject to histopathological sampling are considered, a study design-related selection bias did appear to be present. The low to medium heterogeneity in Bosniak I, II and IV categories that was accompanied by a low (Bosniak I, II) or high (Bosniak IV) malignancy rate strongly suggests that the limitations of the Bosniak classification lie in a less-than-optimal grading of lesion complexity using the Bosniak IIF and III categories. This is underlined by the fact that the introduction of the Bosniak IIF category did not significantly affect the overall diagnostic performance of the Bosniak classification. In addition, the low but not very low malignancy rates of Bosniak II and IIF categories did not differ. These findings seem to suggest that Bosniak II lesions should be followed up in a similar way to IIF lesions. Again, selection bias might lead to an overestimation of malignancy rates in these lesions.

Overall, the Bosniak classification showed a sensitivity of 89.6 % and a specificity of 65.1 %. Meta-regression identified a lower sensitivity in studies that included only Bosniak IIF lesions as benign. This finding was attributed to the fact that there was a higher prevalence of malignancy in Bosniak IIF compared to the Bosniak II and I categories. Consequently, the rate of false-negative findings was higher per study design than in the case of a non-selected inclusion of all non-surgical Bosniak categories (I, II and IIF). In addition, study design-related specificity was lower in studies that used surgical verification only as the standard of reference. A higher rate of true-negative findings is to be expected when follow-up was used as the reference standard, as true-negative findings without subsequent surgery are not considered in a study considering histopathologically verified lesions only. In the latter case, specificity is expected to be lower. Consequently, we identified the best diagnostic performance for the Bosniak classification system in those studies most representative of the clinical setting: non-selected lesion inclusion and considering follow-up examination results in addition to histopathological work-up. Here, sensitivity was very high; conversely, the negative likelihood ratio was very low. This leads us to conclude that a negative Bosniak finding (Bosniak category < III) will sufficiently exclude malignancy. However, pooled specificity and positive likelihood ratios were rather mediocre. As false-positive findings regularly result in unnecessary treatment, or at least invasive diagnostic procedures, further research is needed to improve risk stratification and evidence-based clinical practice guidelines, especially for the management of Bosniak IIF and III findings. Accurate risk stratification would be a prerequisite for the adequate use of active surveillance strategies. However, our systematic review did not provide the data to resolve this issue.

The diagnosis of indeterminate cystic renal lesions may be improved by using additional imaging methods, such as contrast-enhanced ultrasound and magnetic resonance imaging (MRI). Contrast-enhanced ultrasound (CEUS) improves diagnosis by detecting fine enhancing septa and tumour vascularity in complex cysts [13, 4244]. Similar diagnostic improvements can be obtained with MRI, which provides high soft-tissue contrast for the evaluation of septa and solid contrast-enhancing lesion parts. Israel et al. found a similar malignancy rate when comparing CT and MRI in 69 renal lesions. MRI had a tendency to upgrade the lesions: in 18 of 20 malignant lesions, CT and MRI agreed completely with regard to the Bosniak categorization, while MRI upgraded two CT Bosniak III lesions to Bosniak IV [36]. Chen et al. also compared CEUS and MRI of complex cystic renal masses and found a higher sensitivity and accuracy of CEUS (97.2 % and 84.5 %, respectively), but a lower specificity (71.4 %) versus 80.6 %, 78.9 %, and 77.1 %, respectively, for MRI [45]. These additional diagnostic tools have shown promising results with regard to lesion characterization. While CEUS is a relatively simple examination, MRI is considered rather time-consuming and expensive. In addition, minimally invasive percutaneous biopsies have a potential role in the management of renal cysts by separating surgical from non-surgical lesions and the value of this technique is currently under investigation. A detailed discussion on this topic is beyond the scope of this study, but there is a recent and comprehensive systematic review of new modalities for the diagnosis of complex renal cysts published by Ellimootill and co-workers [46].

Some limitations of our systematic review and meta-analysis warrant discussion. A majority of the studies included in our work provided insufficient data about technical and reading conditions. As most studies recruited their cases over a longer period of time, scanners and protocols were not kept constant. Extractable technical parameters and year of publication (assuming that year of publication and equipment are associated) did not show a significant influence on the diagnostic performance of the Bosniak classification according to our additional meta-regression analysis. Therefore, our analysis did not show a diagnostic impact of improved CT technology using the Bosniak classification for diagnosis of cystic renal lesions. However, as demonstrated in the results, technical influences on reader performance remain a research gap in this field. Further, observer experience and inter-observer variation were largely unexplored. As a consequence, there is an additional research gap regarding the rate of lesions with inconclusive or equivocal findings and the subsequent inability to determine a definitive Bosniak classification. In addition, the majority of studies were retrospective. Although we were able to identify several study design-related influences on malignancy rates and diagnostic parameters, a large amount of between-study heterogeneity remains unexplained. Again, these limitations should be seen as research gaps, highlighting where further research is necessary.

In conclusion, our meta-analysis provides quantitative summaries of malignancy rates in Bosniak categories. Strong heterogeneity in Bosniak IIF and III subgroups indicates the need for further research for improved clinical management of complex renal cysts. Considering studies most appropriately reflecting clinical practice, the Bosniak classification can accurately rule out malignancy, but its specificity remains moderate. The Bosniak classification is an accurate tool with which to stratify the risk of malignancy in renal cystic lesions and is seemingly robust along various protocols and CT scanner generations. Research gaps with regard to the clinical application of the Bosniak classification include a lack of data about reader experience and inter-reader variability, and the diagnostic influence of technical CT parameters.