1 Introduction

Lung cancer is one of the most lethal diseases worldwide [1]. It can be early diagnosed among high-risk individuals through screening with low-dose computed tomography (LDCT). Compared with traditional chest radiography screening, LDCT has reduced the mortality of lung cancer by \(20\%\) in seven years with early diagnosis [3]. The prevalence of CT technology has generated enormous CT data. However, it is challenging for radiologists to accurately localize every pulmonary nodule appearing on all CT slices.

Over the past two decades, researchers have developed many Computer-Aided Diagnosis (CAD) systems for automatical detection of lung nodules [2, 3]. The performances of these CAD systems have been improved significantly over previous systems. However, these techniques still have a long way to enable practical applications. These CAD systems are often designed by using hand-craft features that are based on some low-level features such as size, shape, texture and intensity. They cannot deal with the large variations of nodules, thus may fail to distinguish nodules from those ambiguous regions. Fortunately, the development of deep learning enables us to engineer representative features to recognize various appearances of pulmonary nodules, which shows promising detection accuracy with significantly improved performances. Over the past two years, many deep learning based systems have been proposed and delivered exciting results [3,4,5,6]. However, it is still a challenging task to detect small pulmonary nodules from CT slices. Detecting small or even tiny nodules plays an important role in early diagnosis and treatment, which can effectively lower the risk of lung cancers before they develop into worse stages. This inspires us to develop a high sensitive CAD system.

Conventional pulmonary nodule detection systems usually consist of two stages: proposing nodule candidates and removing false positives. In this paper, we carefully design Convolutional Neural Network (CNN) structures to address the challenge of detecting small pulmonary nodules at each stage: First of all, in order to ensure detecting all the nodules without missing those small nodules, we design the candidate detection network (see Fig. 1-Stage 1) by exploring the property of Feature Pyramid Networks (FPN) [7]. This detection network could cover almost all nodules with only very few nodules missed (16/1186 \(\approx \) 1.3%). Then, we propose an effective Conditional 3D-NMS to remove the redundant candidates. Moreover, we propose a novel Attention 3D-CNN model (see Fig. 1-Stage 2) allowing the model to focus on the most relevant regions for reducing false positives. We will demonstrate that the proposed pulmonary nodule detection architecture achieves as high as \(95.8\%\) sensitivities with only two false positives per scan on LUNA16 dataset, which is very competitive compared to the current state-of-the-art methods.

Fig. 1.
figure 1

The framework of proposed pulmonary nodule detection system.

Fig. 2.
figure 2

Feature pyramid networks in our candidate detection architecture.

2 Methods

As shown in Fig. 1, the proposed framework consists of two main parts: (1) the detection of nodule candidates by using FPN and (2) false positive reduction with the Attention 3D-CNN.

2.1 High Sensitivity Candidate Detection with FPN

Candidate detection is a crucial stage for pulmonary nodule detection. The purpose of this step is to recall all possible nodules subject to a reasonable number of candidates. In principle, sliding windows of various scales can cover all possible nodules. However, it is practically impossible because of a too large number of candidates which bring great challenges for subsequent operations. On the contrary, Region Proposal Network (RPN) [8] makes a better balance between the computational cost and the amount of recalled candidates. By using the RPN, Ding et al. [5] showed a higher sensitivity with even fewer candidates compared with traditional CAD systems. However, small nodules are still hard to detect with the original RPN. To this end, we propose the FPN-based detection architecture in detail below.

Like [5], we take three consecutive layers of CT slices as input and resize them to \(3\times 1024\times 1024\) before feeding into the network. To detect candidates, we design the FPN-based architecture as illustrated in Fig. 2. The C1\(\sim \)C5 layers in the proposed network correspond to the VGG16 network [9], just adding a new \(2\times 2\) pooling layer with stride 2 after Conv5_3 of the original VGG16 network. In most cases, the C5 feature map is already effective for object detection in general natural image. However, since nodules in CT are usually very small (3\(\sim \)30 mm in LUNA 2016 [3]), after passing through several pooling layers, the features of small nodules would become too weak or even disappear. We then take P3\(\sim \)P5 as the final features instead of C5. A \(1\times 1\) convolutional layer is attached on C5 to produce the coarsest resolution map P5 of 256 channels. Since P5 has a lower resolution feature map, we attempt to find larger nodule proposals of size \(64 \times 64\) on it. Then, we upsample the P5 by a factor of two and add it to channel-reduced C4 to obtain a middle-resolution P4 layer. In a similar way, we can obtain the P3 layer at the highest resolution. Both P3 and P4 contain higher-resolution information (which comes from C3 and C4) than P5 layer for detecting smaller nodules. Thus, 32 \(\times \) 32 and 16 \(\times \) 16 nodule proposals are found at P4 and P3 layers respectively. With these feature maps at different resolutions, we use RPN-Net to get nodule proposals and then classify them into nodule candidates or not.

2.2 Conditional 3D-NMS for Redundant Candidate Removal

Most pulmonary nodule candidates generated from the FPN may be repetitively detected because they often exist in multiple consecutive slices. To reduce the unnecessary computational burden, we propose a simple yet effective Conditional 3D-NMS method to remove redundant candidates. The basic idea is to divide candidates of the same CT scan into different sets of highly overlapped candidates based on their positions and radius. Then we choose the candidate with the highest mean pixel value as the final candidate of the current set. The reason we use pixel mean value as the condition is that the mean value corresponds to the Hounsfield unit (HU) value which reflects the characteristics (pulmonary nodule or lung parenchyma) of the current region. The overview of the algorithm is summarized in Algorithm 1.

figure a
Fig. 3.
figure 3

The network structure of attention 3D-CNN.

2.3 Attention 3D-CNN for False Positive Reduction

The proposed candidate detection method can recall almost all of the nodules. Nevertheless, a large number of false positives exist among those candidates since it is difficult to distinguish true nodules from highly similar false positives without using three-dimensional spatial contextual information. Some researchers take advantage of 3D CNN for false positive reduction [5, 6, 10]. By contrast, to use spatial information more effectively, we propose a novel Attention 3D-CNN architecture for false positive reduction.

As shown in Fig. 1 (Stage2), the Attention 3D-CNN has two components: Branch A is the attention subnet (U-net structure [11]), which produces a 3D mask that is supposed to have a high response near the nodule. We apply the resultant 3D mask to the source patch before it is fed into the 3D CNN classification network. This allows the network to focus on the lesion area while ignoring the noisy irrelevant background. The detailed architecture of this Attention 3D-CNN network is presented in Fig. 3, where the convolutional layers are followed by batch normalization and ReLU activation.

While Branch A produces a mask of the nodule’s Gaussian distribution, its ground truth used to train the network is calculated below:

$$\begin{aligned} {\begin{matrix} V = \frac{K}{\root 3 \of {2\pi }\sigma ^3}\cdot exp(-\frac{(x-\overline{x})^2+(y-\overline{y})^2+(z-\overline{z})^2}{2\sigma ^2}) \qquad (3\sigma =1.5r) \end{matrix}} \end{aligned}$$
(1)

where \((\overline{x},\overline{y},\overline{z})\) represents the nodule centroid, (xyz) represents the voxel point of the mask in the CT scan, V is the mask value, and K is a constant term. We adopt \(3\sigma =1.5r\) instead of r, because we aim to consider context information in the model to better recognize nodules. Finally, Branch B outputs a classification probability to decide whether the current voxel is a nodule or not. We use a multi-task loss L to jointly train the network:

$$\begin{aligned} {\begin{matrix} L = \lambda L_{mask}+L_{cls} \end{matrix}} \end{aligned}$$
(2)

where the mask loss \(L_{mask}\) is mean squared error (MSE) between the ground truth mask and the prediction mask, the classification loss \(L_{cls}\) is focal loss [7] that is more effective than cross entropy loss to classify hard examples on an imbalanced set, and \(\lambda \) is a hyperparameter balancing between the two losses.

Table 1. Detection performance comparison (Sensitivity vs Candidates/scan)

3 Experiments

We evaluate the proposed framework on the LUNA16 [3] dataset. It contains 888 CT scans whose pulmonary nodules have been well annotated by four experienced radiologists. The LUNA16 dataset is divided into ten subsets for ten-fold cross validation. The performances of detection algorithms are evaluated by sensitivity and average number of false positives per scan (FPs/scan). The overall score (CPM score) is defined as the average of sensitivity at seven predefined false positive rates – 1/8, 1/4, 1/2, 1, 2, 4 and 8 FPs per scan.

3.1 Implementation Details

On the candidate detection stage, we normalize the values of CT scans (Houndsfield Unit between −1000 and 400) to the range (0, 1). In the stage of false positive reduction, we crop \(36\times 36\times 36\) voxels from the detected candidates that are preprocessed by Conditional 3D-NMS. Then, data augmentation is used for training the Attention 3D-CNN network: we flip the voxels from coronal, sagittal and axial dimensions and crop \(32\times 32\times 32\) patches as the input into the network. The constant term K in generating Gaussian masks is set to 1000 and the hyper-parameter \(\lambda \) is set based on the cross-validation result.

3.2 Ablation Study and Results

To check the contribution of each step in the proposed framework, we perform an ablation study. As shown in Table 1, our candidate detection network (Cand-Det) can achieve a sensitivity of 98.7% with an average of 179.6 detected candidate nodules per scan. Compared with the ‘FUSION’ result that combines five traditional CAD systems (ISICAD\(\sim \)ETROCAD), the proposed candidate detection method can achieve higher sensitivity with fewer candidates.

After candidate detection, the Conditional 3D-NMS (Cand-Det-CNMS) method is adopted to remove redundant candidates. From the table, we can see that, compared with the normal 3D-NMS, the Conditional 3D-NMS is more effective for removing redundancy. Finally, we use the Attention 3D-CNN to remove massive false positives and reach a sensitivity of 95.8% with only 6.6 detected nodules per scan. By comparison, if we remove the branch A from the Attention 3D-CNN network, it becomes a typical 3D CNN, and the result would reduce to 94.6% sensitivity with 8.5 detections per scan. This shows that adding attention branch can help improve the sensitivity as well as reduce false detections. The FROC curve for each step is plotted in Fig. 4. We can see that the performance of the proposed framework is improved by combining these steps.

Fig. 4.
figure 4

FROC performances in each stage of the proposed method

To further analyze the performance of the proposed framework, we compared with state-of-the-art methods [5, 6, 10] by using CPM score. In Table 2, we can see that the CPM score of the proposed system is 87.8%, which is better than the methods proposed in [6, 10], and only a little lower than the FRCN+3DCNN. However, in clinical practice, radiologists are more concerned with the sensitivities when the FPs per scan rates vary from 1 to 4 [5, 6], where the proposed architecture achieves the best performance.

Table 2. Comparison with state-of-art methods on LUNA16 dataset [3] (CPM score). We achieve best results when FPs per scan vary from 1 to 4 which is more worth noting in clinical practice.

4 Conclusion

In this work, we propose an architecture for the detection of pulmonary nodules. The architecture can first produce nodule candidates with high sensitivity using a FPN-based detection network. Then, a simple and effective Conditional 3D-NMS method removes the redundant candidates. Finally, a novel Attention 3D-CNN network is proposed to reduce the abundance of false positives. Experiments show that our architecture can achieve a high sensitive result with few candidates. The architecture can also be extended to other similar object detection tasks in CT scans.