Computer-aided detection of pulmonary nodules: a comparative study using the public LIDC/IDRI database
- First Online:
To benchmark the performance of state-of-the-art computer-aided detection (CAD) of pulmonary nodules using the largest publicly available annotated CT database (LIDC/IDRI), and to show that CAD finds lesions not identified by the LIDC’s four-fold double reading process.
The LIDC/IDRI database contains 888 thoracic CT scans with a section thickness of 2.5 mm or lower. We report performance of two commercial and one academic CAD system. The influence of presence of contrast, section thickness, and reconstruction kernel on CAD performance was assessed. Four radiologists independently analyzed the false positive CAD marks of the best CAD system.
The updated commercial CAD system showed the best performance with a sensitivity of 82 % at an average of 3.1 false positive detections per scan. Forty-five false positive CAD marks were scored as nodules by all four radiologists in our study.
On the largest publicly available reference database for lung nodule detection in chest CT, the updated commercial CAD system locates the vast majority of pulmonary nodules at a low false positive rate. Potential for CAD is substantiated by the fact that it identifies pulmonary nodules that were not marked during the extensive four-fold LIDC annotation process.
• CAD systems should be validated on public, heterogeneous databases.
• The LIDC/IDRI database is an excellent database for benchmarking nodule CAD.
• CAD can identify the majority of pulmonary nodules at a low false positive rate.
• CAD can identify nodules missed by an extensive two-stage annotation process.
KeywordsComputer-assisted diagnosis Image interpretation, computer-assisted Lung cancer Solitary pulmonary nodule Lung
Lung image database consortium
Image database resource initiative
The last two decades have shown substantial research into computer-aided detection (CAD) of pulmonary nodules in thoracic computed tomography (CT) scans [1, 2]. Although many academic and several commercial CAD algorithms have been developed, CAD for lung nodules is still not commonly used in daily clinical practice. Possible explanations for this are a lack of reimbursement, technical impediments to integration into PACS systems, but also low sensitivity and high false positive rates. The recent positive results of the NLST lung cancer screening trial  and the subsequent developments towards implementation of lung cancer screening in the United States [4, 5] have renewed the interest into CAD for pulmonary nodules. If lung cancer screening will be implemented on a large scale, the burden on radiologists will be substantial and CAD could play an important role in reducing reading time and thereby improving cost-effectiveness [6, 7].
Following the general demand for open and reproducible science, public databases have been established to facilitate objective measures of CAD performance, and to move CAD development to a next level [8, 9, 10]. In 2011, the complete LIDC/IDRI (Lung Image Database Consortium / Image Database Resource Initiative) database was released . This dataset provides by far the largest public resource to assess the performance of algorithms for the detection of pulmonary nodules in thoracic CT scans. A large effort has gone into the collection of annotations on these cases, but CAD was not used to assist the readers .
In this paper, we apply two commercial and one state-of-the-art academic nodule detection systems on the LIDC/IDRI database with the aim to set a first benchmark performance on the full database. To our knowledge, this is the first paper, which reports the performance of CAD systems on the full LIDC/IDRI database. We performed an extensive analysis of the performance of the applied CAD systems and make our evaluation publicly available so that other CAD developers can compare with this benchmark. Furthermore, we hypothesize that CAD can find lesions, which were not detected in the extensive LIDC annotation process consisting of a blinded and unblinded review by four radiologists. To investigate the latter, we evaluated the false positives of the best CAD system using a similar reading protocol as had been used in LIDC.
Materials and methods
Manufacturer and scanner model distribution of the 888 CT scans in our dataset
GE MEDICAL SYSTEMS
GE MEDICAL SYSTEMS
GE MEDICAL SYSTEMS
GE MEDICAL SYSTEMS
LightSpeed Pro 16
GE MEDICAL SYSTEMS
GE MEDICAL SYSTEMS
GE MEDICAL SYSTEMS
Section thickness distribution of the 888 CT scans in our dataset
Distribution of the reconstruction kernels used for the 888 CT scans in our dataset
Manufacturer and reconstruction kernel
GE MEDICAL SYSTEMS - BONE
GE MEDICAL SYSTEMS - LUNG
GE MEDICAL SYSTEMS - STANDARD
Philips - B
Philips - C
Philips - D
SIEMENS - B20s
SIEMENS - B30f
SIEMENS - B31f
SIEMENS - B45f
SIEMENS - B50f
SIEMENS - B70f
TOSHIBA - FC03
TOSHIBA - FC10
LIDC/IDRI image annotation
The LIDC/IDRI employed a two-phase image annotation process . In the first phase (the blind phase), four radiologists independently reviewed all cases. In the second phase (the unblinded phase), all annotations of the other three radiologists were made available and each radiologist independently reviewed their marks along with the anonymized marks of their colleagues. Findings were annotated and categorized into nodule≥3 mm, nodule<3 mm, or non-nodule. Non-nodule marks were used to indicate abnormalities in the scan, which were not considered a nodule. Using this two-phase process, the LIDC investigators aimed to identify as completely as possible all lung nodules, without forcing consensus among the readers. More details about the annotation process can be found in . An XML file with the annotations is publicly available for every case.
Nodule selection and purpose
In this study, we included all annotations available in the XML files for the 888 scans. The focus of this study was on the nodule≥3 mm group. As a result of the LIDC/IDRI image annotation process, each nodule≥3 mm had been annotated by one, two, three, or four radiologists. In total, the data set of this study included 777 locations, which were marked as nodule≥3 mm by all four radiologists. The 777 nodule≥3 mm annotations marked by all four radiologists can be categorized by size as follows: 22 nodules <4 mm, 228 nodules 4–6 mm, 199 nodules 6–8 mm, and 328 nodules >8 mm. The number of nodules per scan ranged between 1 and 8.
The purpose of this study was twofold. First, we aimed to assess the performance of three state-of-the-art nodule CAD systems. Secondly, we performed an observer experiment to investigate whether CAD can find additional lesions, missed during the extensive LIDC annotation process.
Three CAD systems were used: a commercial CAD system Visia (MeVis Medical Solutions AG, Bremen, Germany), a commercial prototype CAD system Herakles (MeVis Medical Solutions AG, Bremen, Germany), and an academic nodule CAD system ISICAD (Utrecht Medical Center, Utrecht, the Netherlands) . ISICAD was the leading academic CAD system in the ANODE09 nodule detection challenge . For all three CAD systems, a list of candidate marks per CT scan was obtained. Each CAD candidate is described by a 3D location. Additionally, Herakles and ISICAD also provide a CAD score per CAD candidate. The CAD score is the output of the internal classification scheme of the CAD system and is a measure of the likelihood that a candidate is a pulmonary nodule. An internal threshold on the CAD scores determines which candidates are active CAD marks and, hence, will be shown to the user, and which candidates are not shown. Since different thresholds can be applied on the CAD score, a CAD system can have multiple operating points. A low threshold generates more CAD marks, thereby typically increasing sensitivity at the cost of more false positive detections. A high threshold will generate less false positives but may reduce the sensitivity of a CAD system. For all three CAD systems, one fixed operating point is internally set which we will refer to as the system operating point.
The performance of the CAD systems was analyzed on the set of 777 nodules annotated by 4/4 radiologists as a nodule≥3 mm. We employed free-response operating characteristic (FROC) analysis  where detection sensitivity is plotted against the average number of false positive detections per scan. Confidence intervals were estimated using bootstrapping with 5,000 iterations . If a CAD system marked locations which were annotated by three or fewer radiologists as nodule≥3 mm, as nodule<3 mm, and as non-nodules, these CAD marks were counted as false positives. For Visia, no CAD scores were available for the CAD candidates. Consequently, only one operating point and not a full FROC curve could be generated for Visia.
To gain more insight into which type of nodules were missed by CAD, we looked at the characteristics, as scored by the LIDC readers, for all nodule≥3 mm findings, of the false negatives. We defined subsolid nodules as nodules for which the majority of the radiologists gave a texture score smaller than 5 (1=ground-glass/non-solid, 3=part-solid, 5=solid). Subtle nodules were defined as nodules for which the majority of the radiologists gave a subtlety score smaller or equal than 3 (1=extremely subtle, 5=obvious).
To assess the robustness of the CAD algorithms, we also evaluated the CAD results on different subsets of the data. The LIDC-IDRI data set is a heterogeneous set of CT scans and CAD algorithms that could conceivably exhibit a different performance on different types of data. We analyzed the following factors: (1) presence of contrast material, i.e., non-contrast versus contrast enhanced scans, (2) section thickness, i.e., cases with section thickness <2 mm versus section thickness ≥2 mm, and (3) reconstruction kernel, i.e., soft/standard versus enhancing/overenhancing kernels.
In order to evaluate whether CAD can find lesions missed during the extensive annotation process of the LIDC/IDRI database, we considered the CAD marks of the best CAD algorithm , which were counted as false positives at its system operating point. Two conditions were differentiated: the location of the CAD mark had in fact been marked in the LIDC annotation process, but not by all four readers as nodule≥3 mm as warranted for being counted as a true positive. The second condition comprised those CAD marks that had no corresponding LIDC marks at all. The CAD marks corresponding to the first condition can be subdivided according to the LIDC readings. The latter CAD marks were independently inspected by four chest radiologists, since these are potentially nodules overlooked by all four LIDC readers. Thus, we mimic the original LIDC annotation process as though CAD had been included as another independent reader in the first phase of the image annotation process. CAD marks were categorized as nodule≥3 mm, nodule<3 mm, non-nodules, or false positive. Electronic measurement tools were available to measure size. To reduce the workload for the radiologists, a research scientist (5 years experience in nodule CAD research) first removed the marks which were obviously not a nodule. CAD marks which were marked as nodule>3 mm by all four radiologists in our study were independently evaluated by an experienced radiologist that scored subtlety, location, type, and attachment to other structures. Subtlety was scored on a five-point scale (1=extremely subtle, 5=obvious).
Comparative CAD performance
The performance of the three CAD systems on the different subsets is depicted in Fig. 2. This figure shows that the performance of ISICAD and Visia was influenced by different data sources. ISICAD shows the largest performance difference between soft/standard versus enhancing/overenhancing reconstruction kernels. Herakles showed the most stable and robust performance for all different data sources and consistently outperformed the other two CAD systems.
Overview of the categories in which the false positives of Herakles at the system operating point can be divided. In this analysis, we first check for corresponding nodule≥3 mm annotations, then we check for corresponding nodule<3 mm annotations, and finally we check for corresponding non-nodule annotations. This means that in the top row where three out of four radiologists annotated the location as nodule≥3 mm, the fourth radiologist may have marked the location as nodule<3 mm, non-nodule, or did not mark it at all. In the nodule<3 mm category, all false positives whose location was marked as nodule<3 mm by at least one radiologist were placed (and, hence, no radiologist marked it as nodule≥3 mm). The non-nodule category contains all false positives whose location was marked as non-nodule by at least one radiologist (and, hence, no radiologist marked the location as nodule≥3 mm or nodule<3 mm). False positives for which no corresponding annotation was found were assigned to the last category
Nodule≥3 mm - 3/4
Nodule≥3 mm - 2/4
Nodule≥3 mm - 1/4
No corresponding annotation
Observer study results
Results of the observer experiment. The distribution of the scores of all observers is tabulated
Though clear definitions are available for what represents a pulmonary nodule (Fleischner Glossary ), the literature lists a number of publications demonstrating the lack of observer agreement of what indeed represents a pulmonary nodule [15, 16, 17]. Not surprisingly this effect is larger for small lesions . This lack of an absolute standard of truth makes benchmarking of CAD systems very difficult. Therefore, we decided to use the largest publicly available database of CT annotated pulmonary nodules. An elaborate double reading process involving four radiologists had been undertaken to define various levels of evidence for the presence of nodules to avoid the need for a consensus statement. In our study we used the extensive annotation information of the LIDC/IDRI database to benchmark the performance of state-of-the-art nodule CAD systems. To our knowledge, this is the first study that uses the full LIDC database and secondly accepts the fact that there is no absolute standard of truth for the presence of pulmonary nodules in the absence of pathological correlation.
Our study showed substantial performance differences between the three CAD systems, with the commercial prototype Herakles demonstrating the best performance. At its system operating point, Herakles detected 82 % of all nodule≥3 mm findings marked by all four LIDC readers at an average of 3.1 false positives per scan. If marks on the other LIDC annotations were ignored in the analysis, a sensitivity of 83 % at an average of only 1.0 false positives was reached.
The best CAD system still misses a subset of the nodules (18 % of the 777 nodules). We observed that a substantial part of the missed nodules (30 %) were subsolid nodules, which are more rare and have a markedly different internal character than solid nodules. Therefore, integrating a dedicated subsolid nodule detection scheme  in a complete CAD solution for pulmonary nodules may prove helpful to improve overall CAD performance.
Both Visia and ISICAD showed substantial performance differences on different subsets of the data, but Herakles achieved a more robust performance. The performance of ISICAD dropped substantially on data with enhancing or overenhancing reconstruction kernels. This may be attributed to the fact that ISICAD was developed and trained exclusively with data from the Dutch-Belgian NELSON lung cancer screening trial, which consists of homogeneous thin-slice data reconstructed with a soft/standard reconstruction kernel . This indicates that although ISICAD was the leading CAD system for the data used in the ANODE challenge , which consisted only of data obtained from the NELSON trial, its performance drops when applied to data of other sources. Therefore, the heterogeneity of a reference database is an important aspect for a reliable CAD evaluation and an advantage of the LIDC/IDRI database.
Although a blinded and unblinded review of all images had been performed by the LIDC investigators, we showed that CAD can find lesions missed by the original LIDC readers. We found 45 nodules, which were accepted as a nodule≥3 mm by all four radiologists involved in our observer study. Previous studies have already shown that CAD can find lesions missed by multiple readers [18, 20]. One possible reason why the LIDC readers missed nodules may be that the LIDC readers only inspected transverse sections . Characteristic features of the 45 nodules not included in the LIDC/IDRI database but seen by CAD were subtle conspicuity, small size (<6 mm), and attachment to pleura or vasculature.
Since an extensive evaluation on a large reference database is essential to move CAD to the next level, we have published our results on a public website (http://luna.grand-challenge.org/) which allows other CAD researchers to upload results of their CAD systems for which the same FROC curves as presented in Figs. 1 and 2 will be computed and published on the website. The annotation files of the reference standard and the extra annotations by the human readers in our observer study are available for download. By making the extra annotations available to other researchers, this study contributes to an improved reference standard for the LIDC/IDRI database, and we hope future CAD studies will use the improved reference standard.
We primarily evaluated the performance of CAD on nodules for which all four radiologists agreed that it was a nodule≥3 mm. Previous publications have also focused on the nodules detected by three, two, or one out of four radiologists [21, 22]. For using CAD in a screening setting, a high sensitivity even at the expense of specificity is desirable to find all potential cancerous nodules. High false positive rates, on the other hand, increase the workload to radiologists and potentially increase unnecessary follow-up. We, therefore, report the sensitivity using the highest level of evidence (four out of four readers) and considered the lower levels of agreement for quantifying the false positive rates. For future CAD reference databases, a large database of CT images including follow-up CT and histopathological correlation would be helpful to remove subjectivity from the reference standard, and to verify whether CAD detects the clinically relevant nodules.
In conclusion, we found that, on the largest publicly available database of annotated chest CT scans for lung nodule detection, Herakles detects the vast majority of pulmonary nodules at a low false positive rate. The results show that the new prototype outperforms the other two CAD systems and is robust to different acquisition factors, such as presence of contrast, section thickness, and reconstruction kernel. Our observer experiment showed that Herakles was able show to pulmonary nodules that had been missed by the extensive LIDC annotation process. Given the growing interest and need for CAD in the context of screening, it can be expected that new CAD algorithms will be presented in the near future. Our results are publicly available and other CAD researchers may compare the performance of their CAD algorithm to the results reported here, utilizing the LIDC/IDRI database for benchmarking of available CAD systems.
The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study.
The scientific guarantor of this publication is Bram van Ginneken. The authors of this manuscript declare relationships with the following companies: MeVis Medical Solutions AG, Bremen, Germany
This study has received funding by a research grant from MeVis Medical Solutions AG, Bremen, Germany and by a research grant from the Netherlands Organisation for Scientific Research (NWO), project number 639.023.207. No complex statistical methods were necessary for this paper. Institutional Review Board approval was obtained. Written informed consent was waived by the Institutional Review Board.
Not applicable since no animals were involved in this study. Some study subjects or cohorts have been previously reported in previous studies involving the LIDC/IDRI database. The following publication describes the complete LIDC/IDRI database:
Armato SG, McLennan G, Bidaut L, et al. (2011) The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med Phys 38: 915–931
Methodology: retrospective, experimental, multicenter study.
Open Access This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.