Validation of deep learning-based computer-aided detection software use for interpretation of pulmonary abnormalities on chest radiographs and examination of factors that influence readers’ performance and final diagnosis

Purpose To evaluate the performance of a deep learning-based computer-aided detection (CAD) software for detecting pulmonary nodules, masses, and consolidation on chest radiographs (CRs) and to examine the effect of readers’ experience and data characteristics on the sensitivity and final diagnosis. Materials and methods The CRs of 453 patients were retrospectively selected from two institutions. Among these CRs, 60 images with abnormal findings (pulmonary nodules, masses, and consolidation) and 140 without abnormal findings were randomly selected for sequential observer-performance testing. In the test, 12 readers (three radiologists, three pulmonologists, three non-pulmonology physicians, and three junior residents) interpreted 200 images with and without CAD, and the findings were compared. Weighted alternative free-response receiver operating characteristic (wAFROC) figure of merit (FOM) was used to analyze observer performance. The lesions that readers initially missed but CAD detected were stratified by anatomic location and degree of subtlety, and the adoption rate was calculated. Fisher’s exact test was used for comparison. Results The mean wAFROC FOM score of the 12 readers significantly improved from 0.746 to 0.810 with software assistance (P = 0.007). In the reader group with < 6 years of experience, the mean FOM score significantly improved from 0.680 to 0.779 (P = 0.011), while that in the reader group with ≥ 6 years of experience increased from 0.811 to 0.841 (P = 0.12). The sensitivity of the CAD software and the adoption rate for the lesions with subtlety level 2 or 3 (obscure) lesions were significantly lower than for level 4 or 5 (distinct) lesions (50% vs. 93%, P < 0.001; and 55% vs. 74%, P = 0.04, respectively). Conclusion CAD software use improved doctors’ performance in detecting nodules/masses and consolidation on CRs, particularly for non-expert doctors, by preventing doctors from missing distinct lesions rather than helping them to detect obscure lesions.


Introduction
Chest radiography is a commonly used medical imaging and diagnostic technique for initial screening of patients due to its low cost and easy accessibility [1]. It plays an important role in detecting lung diseases such as lung cancer and tuberculosis [2,3].
However, accurate interpretation of chest radiographs (CRs) can occasionally be challenging for doctors. Approximately 90% of missed lung cancer cases involve CR assessments [4], and the miss rate of lung cancers on CRs is reportedly 19-22% [5,6]. The characteristics of abnormal lesions, such as size, conspicuity, and location, influence the detection accuracy [4,7]. Reader proficiency is another important factor. While expert observers establish specific scanning patterns for radiographs, non-expert observers generally search without order on the radiograph, and this can cause them to overlook obscure abnormal findings [4]. In Japan, doctors who do not specialize in pulmonology often read CRs in daily practice. Approximately 50% of the doctors who interpret CRs for patients screening are not lung disease experts, such as non-pulmonology physicians [8], and this may reduce the abnormality detection rate on CRs. Thus, there is a demand for detection tools for non-experts.
In recent years, Computer-aided detection (CAD) systems that use deep learning algorithms have been developed [9,10]. Some studies have shown that observer performance for detection of abnormal thoracic lesions with CAD is significantly better than without CAD [11,12]. These studies used algorithms developed by researchers from scratch, but few studies have used software products developed by vendors. Some CAD software packages for CRs are already commercially available, and their diagnostic performance has been reported (e.g., EIRL X-ray Lung nodule, Lpixel, Tokyo, Japan). However, few studies have analyzed data characteristics that affected improvement of readers' performance and final diagnosis with CAD.
This study aimed to compare doctors' performance in interpreting CRs with and without CAD. We also examined the effect of readers' experience and data characteristics on detection of abnormal pulmonary lesions with CAD. Lung nodules, masses, and consolidation on CRs were targeted because the CAD software used in this study was designed to detect these lesions.

Materials and methods
This retrospective multicenter study was approved by the institutional review board, and anonymized data were shared through a data-sharing agreement between the institutions. The requirement for written informed consent was waived because the data were collected retrospectively. This study was supported by Konica Minolta.

Data collection
Anonymized CRs (posteroanterior view) of 453 patients were retrospectively selected from two institutions in Japan. Only patients who were over the age of 19 years were included. Institution A, a university hospital, supplied the data of 238 patients who presented for physical examination between January 2012 and December 2019. Institution B, a health screening center, supplied the data of 215 patients who had routine health screening between January 2016 and December 2018. One CR image was acquired from each patient; thus, 453 images were collected. The images were acquired using Aero DR system (Konica Minolta, Tokyo, Japan) or FUJIFILM DR PRELIO U(Fujifilm, Tokyo, Japan) with 120-130 kVp tube voltage, 1-8mAs tube current time product. The exclusion criterion was poor image quality, for which no CR was excluded. Pulmonary nodules/masses and consolidation on the CRs were considered abnormal findings, while extrapulmonary abnormal findings, such as cardiomegaly and rib fractures, were considered normal for the purpose of this study. Nodules and masses were defined as focal lung opacities with smooth border, measuring ≤ 3 cm and > 3 cm in diameter, respectively. Consolidation was defined as lung opacities apart from nodules and masses. A total of 194 images (43%) included abnormal findings, while 259 (57%) were normal. For the observer-performance test, 60 images with and 140 without abnormal findings were randomly selected. Two board-certified radiologists (with 6 and 14 years of experience) reviewed the images and recorded the area, lesion type (nodule, mass, or consolidation), and the degree of subtlety of each lesion by mutual agreement. The degree of subtlety was measured on a five-point scale as follows: level 1, extremely subtle (detection is extremely difficult); level 2, very subtle (detection is very difficult); level 3, subtle (detection is difficult); level 4, relatively obvious (detection is relatively easy); and level 5, obvious (detection is easy) [13].

Ground truth
All images were independently reviewed by three boardcertified radiologists (with 14, 25, and 33 years of experience) to establish the ground truth. The radiologists confirmed the presence of abnormal findings on the images and marked the lesion locations. The common areas annotated by at least two of the three radiologists with an intersection over union (IoU) greater than specific threshold for each finding were adopted as abnormal lesions. The thresholds for nodule/mass and consolidation were determined as 0.5 and 0.0, respectively.

Software
CAD-Chest X-ray (Konica Minolta, Inc. and Enlitic, Inc.) software was used for this study. This software is currently commercially available in Japan (Approval No. 30300BZX00271000). This software was designed as a second-reader type CAD. It automatically detects pulmonary nodules/masses and consolidation, and marks the areas of the lesions.

Detection performance test of the CAD software
First, a performance test of the CAD software alone (standalone test) was conducted. All 453 images in the dataset were interpreted with the software alone. Then, an observerperformance test was performed to assess whether the software would improve doctors' performance. The test had a sequential-only design and was done in accordance with the US Food and Drug Administration guideline [14]. The CRs of the selected 200 patients were interpreted by 12 doctors, including three radiologists, 3 pulmonologists, three non-pulmonology physicians, and three junior residents with various years of experience (2-12 years). The readers were blinded to the clinical information of the patients, and the radiologists who defined the reference standard did not participate in the performance test. The test consisted of two sessions. In the first session, the readers were asked to determine whether each CR showed any nodule, mass, or consolidation. If any of these was present, the readers then marked the center of the lesion. All procedures were performed without CAD software. The readers were also asked to input a confidence score with a continuous value between zero and one for each annotation. In the second session, the readers were asked to re-evaluate every CR with the assistance of the CAD software and to modify their original decisions and confidence scores.

Statistical analyses
The sensitivity and specificity of the CAD software for detecting pulmonary nodules, masses, and consolidation were analyzed in the standalone test. Per lesion sensitivity and patient specificity were calculated. In the observerperformance test, the detection performances of the readers with and without CAD were compared. Jackknife alternative free-response receiver operating characteristic (JAFROC) analyses were performed using R statistical software version 4.0.2 (R Project for Statistical Computing, Vienna, Austria) and RJafroc version 2.0.1. Both the readers and CRs were treated as random effects. The weighted alternative free-response receiver operating characteristic (wAFROC) figure of merit (FOM) score was used as the performance measure for the analyses. The weights were equally divided by the number of lesions. Statistical significance was evaluated using the Dorfman-Berbaum-Metz method [15]. The results were stratified according to the specialty, years of experience of the readers. The mean FOM scores with and without CAD in each group were compared. For the analysis of the sensitivity of the CAD software and the adoption rate of the lesions that readers initially missed but CAD detected, the lesions were grouped by the anatomic location and the degree of subtlety. Fisher's exact test was used for the analyses. Statistical significance was set at P < 0.05.

Results
Among the 453 images used for the Standalone test, 194 showed abnormal findings. The abnormal findings included nodules/masses in 101 (52%) images, consolidation in 91 (47%) images, and both in 2 (1%) images. 60 images with abnormal findings were selected for the observer-performance test. The abnormal findings included 36 (53%) nodules/masses and 29 (47%) consolidation. 3 images showed multiple nodules/masses and 2 images showed multiple consolidation. The demographic features of each dataset are shown in Table 1.
Comparison of detection performances with and without CAD software.

Analyses of factors influencing sensitivity and final diagnosis with CAD assistance
Non-experts and experts detected 66% (256/390) and 78% (305/390), respectively, of all abnormal lesions without using CAD. Table 3 shows the sensitivity of the CAD software and human readers, stratified by anatomic location and degree of subtlety of the lesion. Experts had higher sensitivity than non-expert doctors in all groups. The sensitivity of    Figures 2 and 3 show true-positive examples of pulmonary nodules and consolidation, which some readers missed without using CAD but acknowledged after using the software. Table 4 describes the accepted lesions, stratified by their characteristics. Of the initially missed lesions, 67% (55/82) and 66% (23/35) were corrected with CAD in the nonexpert and expert groups, respectively. When stratified by the degree of subtlety, the adoption rate of obscure lesions and 50% (54/73) was significantly lower than that of distinct lesions [55% (24 of 44) vs 74% (54/73); P = 0.04]. The CAD software did not detect 13 lesions, and 12 of these were detected by at least one human reader.

Discussion
This study compared doctors' performances in interpreting CRs with and without using CAD. The CAD software achieved about 80% sensitivity for detecting pulmonary nodules, masses, and consolidation on CRs in the standalone test. The observer-performance test showed that using the CAD software significantly increased the wAFROC FOM scores for these lesions. Several studies have demonstrated that the assistance of deep learning-based algorithms yields higher detection performance than that achieved by human readers alone [11,12]. The results of this study consistent with previous findings. Hwang et al. showed a significant improvement in the area under the curve for lesion-wise localization in various reader groups (from 0.781-0.907 to 0.873-0.938) [11]. Choi et al. reported that the assistance of a deep learning-based algorithm improved the FOM score from 0.843 to 0.911 [12]. However, these studies used algorithms developed from scratch for academic purposes. Fig. 2 Examples of true-positive cases of pulmonary nodules. Four readers missed the lesion but corrected their decisions using the computer-aided detection (CAD) software. a The radiograph has a nodule at the right upper field, which is overlapped by the mediastinum. The yellow circle shows the ground truth. b The white circle shows the output of the CAD software. The software correctly detected the lesion Fig. 3 Example of true-positive cases of pulmonary consolidation. Five readers missed the lesion but corrected their decisions using computer-aided detection (CAD) software. a The radiograph shows consolidation in the right lower field. The yellow circle shows the ground truth. b The white circle shows the output of the CAD software. The software correctly detected the lesion Table 4 Adoption rate of abnormal lesions that were initially missed by readers but detected by computer-aided detection software This study used a CAD software package to demonstrate the utility of CAD. The FOM scores reported in previous studies were higher than those recorded in this study (0.746 to 0.810). However, the mean increments in FOM scores with CAD in the previous studies were 0.057 and 0.068, respectively, which were almost the same as those obtained in our study (0.064). Thus, while the lower FOM score in this study may be attributable to the difficulty of the dataset used, the degree of contribution of the software used is comparable to that of previous studies.
The increment in FOM scores by CAD was higher for non-pulmonology physicians and junior residents than for pulmonologists and radiologists. It was also higher for doctors with < 6 years of experience than for doctors with ≥ 6 years of experience. Thus, this study shows that CAD software is more useful for non-expert doctors than for expert doctors. In support of this findings, previous studies have also reported that CAD software was more beneficial for non-expert readers than expert readers [11,12,16].
In contrast, few studies have analyzed factors that influence readers' performance and final diagnosis with the use of CAD. In our study, the CAD software yielded significantly less sensitive for detecting obscure lesions than for distinct lesions. The adoption rate of obscure lesions detected by the CAD software was also significantly lower than that of distinct lesions. These results revealed that detection of obscure lesions contributed less to the improvement of readers' performance with CAD than detection of distinct lesions. Furthermore, adoption rate of CAD software detection for initially missed lesions by non-experts and experts were approximately the same (67% and 66%, respectively). Therefore, this study showed that CAD software was more effective for non-experts than experts because non-experts missed more distinct lesions than experts, and those lesions were detected by the CAD software.
The use of deep learning-based detection algorithms as second readers has already been described [11,12,17]. Such software packages can be adopted as second readers in daily practice, such as during medical checkups. The software automatically marks the regions where abnormal findings are suspected; thus, even non-expert doctors can recognize the lesions. Approximately 50% of doctors in Japan who read CRs for screening are not experienced readers [8]. Additionally, visual and mental fatigue caused by heavy workloads can increase the chances of perceptual errors [18]. The use of CAD software in institutions can therefore help to reduce misdiagnoses caused by these factors. However, the disadvantage of second-reader CAD is that it takes longer to read images with CAD than without CAD because of the necessity of two reading passes [19]. Therefore, some studies have highlighted the potential of using CAD software as a concurrent reader for CRs [11,12]. On the other hand, the use of CAD software as a concurrent reader is associated with the risk that human readers may not pay attention to lesions that the software fails to detect. 20% of the abnormal findings in our study were not detected by the CAD software, and 92% of those lesions were detected by at least one reader. Using this software as a concurrent reader may lead to missing these lesions. To the best of our knowledge, no study has validated the effect of deep-learning-based algorithms on CRs as concurrent readers. Therefore, further study will be required to determine which reader type is more suitable in routine clinical practice.
This study has several limitations. First, validation was performed using small, designed datasets. In our study, 30% of the images in the observer-performance test showed pulmonary nodules/masses and consolidation. By contrast, one study reported that only 8% of the CRs taken for mandated health examinations showed any abnormal finding [20]. Thus, the prevalence of abnormal findings in this study was relatively higher than what is usually seen in routine practice. This may affect the adoption rate of CAD software detection. Second, CT was not used for ground truth labeling. Although the images were reviewed by three boardcertified radiologists, some lesions might have been missed. Last, this study was conducted in accordance with the US Food and Drug Administration guideline, not the Japanese guidelines. Therefore, the performance could not be accurately compared with other products marketed in Japan.
In summary, this sequential evaluation study showed that the CAD software improved doctors' performance in detecting nodules/masses and consolidation on CRs, particularly for non-expert doctors, by preventing doctors from missing distinct lesions rather than by helping them to detect obscure lesions. This software may prevent doctors from missing incidental lung abnormalities such as lung cancers, in clinical practice, due to inexperience and carelessness. Further prospective studies using multicenter data are required to validate the contribution of CAD software packages to clinical practice.
Misa Nagasaka and Ryo Takeshita received lecture fees from Konica Minolta. Yoshitake Yamada and Minoru Yamada have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.