The diagnostic and economic success of a lung cancer screening program will depend on accurate and reproducible differentiation between high-risk nodules, requiring more intense work-up, and low-risk nodules. Therefore, the ACR developed the Lung-RADS Assessment Categories to support radiologists in their decision-making by standardising lung cancer screening CT reporting and management recommendations . The Lung-RADS categories in their current format are based on visual nodule type classification, manual nodule size and documentation of growth. For both reading tasks substantial interobserver variability has been reported previously [9,10,11,12,13]. The goal of our study was therefore to quantify interobserver variability for Lung-RADS categorisation of low-dose screening CTs and to assess its impact on test performance and subject management. To ensure adequate representation of all Lung-RADS categories we used an enriched study group that included baseline and follow-up CTs.
We found an overall substantial pairwise inter-reader agreement of Lung-RADS categorisation of screening CT scans, which underlines the value of this categorical system in harmonising interpretation and management of screening CTs. Agreement was slightly higher for baseline scans than for follow-up scans (kappa 0.71 versus 0.63). This finding might be explained by the fact that for follow-up the complexity of visual assessment including comparison and determination of nodule growth is higher than for baseline alone.
Variability of Lung-RADS categorisation may refer to the same risk-dominant nodule or to assignment of different nodules as risk-dominant. The first was less common (26%) and most importantly had only very rarely a substantial impact on subject management (0.4%). The latter was seen much more frequently (74%) and led to a management difference (≥ 9 months difference in follow-up time) in 8% of all reading pairs. Interestingly, not only measurement and classification differences were responsible for these discrepancies but apparently also differences in nodule perception given the fact that in the majority of different nodule categorisation the readers selectively annotated their risk-dominant nodule of choice. We did not ask the observers to annotate all nodules detected but left it to their discretion which nodules would be measured. While in the NLST trial, annotation of all nodules larger than 4 mm was requested, no recommendation is made in Lung-RADS concerning this issue. Therefore it remains open to what extent differences in detection or characterisation contributed to the reader variability. Similarly, Pinsky et al. reported earlier that nodule detection and documentation substantially varied between screening radiologists in the NLST trial . Another factor that may have contributed to disagreement is the small separation between Lung-RADS categories. For example, the difference between a Lung-RADS 2 solid nodule and a Lung-RADS 4A solid nodule is only 2.1 mm (5.4 mm is Lung-RADS 2, 7.5 mm is Lung-RADS 4A). This “closeness” of the Lung-RADs categories may also explain why despite relatively frequent disagreements, only a small proportion had effects on patient management.
The primary objective of this observer study was to quantify variability of Lung-RADS categorisation without special focus on actual malignancies. Subjects were randomly included in the study group to ensure a balanced distribution of all Lung-RADS categories, and consequently the number of scans with malignancies diagnosed in the year of the scan was limited (n = 13). The actual histology of these malignancies is unknown to us. Nevertheless though the number of discrepancies with substantial management impact seems not negligible, it has to be underlined that the number of discrepancies was low in these 13 CTs. With the exceptions of four readings, the malignant nodules were categorised as Lung-RADS 4A or 4B, resulting in intensive further diagnostic work-up, and for only two readings the discrepancy resulted in a potential delay of more than 9 months.
All observers read 24 cases prior to reading the study data set in order to become familiar with the Lung-RADS definitions. They also had a printout of the original Lung-RADS assessment rules available during their reading. Nevertheless, wrong assignment of the Lung-RADS category criteria occurred in 6% of all readings and in 8.5 cases on average per observer. Since the main goal of our study was to investigate inter-reader variability as a result of different nodule interpretation, we adjusted incorrect Lung-RADS categories to the observer’s own nodule annotations before data analysis. However, wrong assignment of the Lung-RADS category may turn out to be a problem in practice as well. Computerised tools that automatically assign the correct Lung-RADS category of a scan once the pertinent data of one or more nodules have been entered may therefore prove useful .
Our study has some limitations. Reader experience plays an important role in observer studies. To capture a realistic estimation of the extent of observer variability and its impact on patient management, we included a broad range of observers with and without experience reading actual screening CTs. All readers, however, were well trained in thoracic CT and skilled in interpreting nodules, thus representing radiologists potentially involved in screening in the future. Parts of the observer variability, especially with respect to identification of the risk-dominant nodule, might still be related to lack of experience, suggesting that dedicated training is important, as also articulated by the ACR . Interestingly, no significant differences in agreement were observed between residents and radiologists.
Other limitations are related to the study design. We used an enriched cohort consisting of 160 cases categorised as Lung-RADS category 1/2, 3, 4A and 4B on the basis of our algorithm. We chose this approach to be able to draw meaningful conclusions over the whole spectrum of nodules. This, however, means that our results need to be interpreted in the light of the enriched study group and cannot simply be extrapolated to an unselected screening cohort.
Secondly, the Lung-RADS category 4X was not considered in this study. This category gives radiologists the opportunity to upgrade a Lung-RADS category 3 or 4A nodule to category 4X on the basis of suspicious morphological findings and resulting in intensified possibly invasive diagnostic work-up. In addition to quantitative measures it adds subjective assessment of nodule morphology which we aimed to exclude from our analysis.
Thirdly, no reference standard was available for this data set, since Lung-RADS was not used in the original NLST annotations. As our study focuses on the effect of interobserver variability and its impact on management no reference standard was required.
Fourthly, since we did not ask our readers to annotate all identifiable nodules, we were not able to investigate whether the assignment of different nodules as risk-dominant was caused by an error in detection or an error in characterisation. In future studies, this should be taken into account.
Lastly, we defined a difference in follow-up of at least 9 months as a substantial impact on patient management. However, whether observer variations would have an impact on tumour stage and eventually patient outcome remains open.
In summary, the Lung-RADS Assessment Categories achieved substantial interobserver agreement. Disagreement was mainly caused by assigning a different risk-dominant nodule. In our enriched cohort disagreement led to different follow-up interval of more than 9 months in 8% of all reading pairs with little effect on the diagnosis of the malignancies within this series. The use of (semi-)automatic detection, segmentation and classification tools would likely reduce disagreement amongst readers, but the availability of these tools in clinical practice is still low and they require careful standardisation.