Keywords

1 Introduction

Lung cancer is the most common cause of cancer-related death in men. Low-dose lung computed tomography (CT) screening provides an effective way for early diagnosis and can sharply reduce the lung cancer mortality rate. Advanced computer-aided diagnosis (CAD) systems are expected to have high sensitivities while maintaining low false positive rates to be truly useful. Recent advance in deep learning provides new opportunities to design more effective CAD systems to help facilitate doctors in their effort to catch lung cancer in their early stages.

The emergence of large-scale datasets such as the LUNA16 [12] has helped to accelerate research in nodule detection. Typically, nodule detection consists of two stages: nodule proposal generation and false positive reduction. Traditional approaches generally require hand-designed features such as morphological features, voxel clustering and pixel thresholding [6, 9]. More recently, deep convolutional architectures were employed to generate the candidate bounding boxes. Setio et al. proposed multi-view convolutional network for false positive nodule reduction [11]. Several work employed 3D convolutional networks to handle the challenge due to the 3D nature of CT scans. The 3D fully convolutional network (FCN) was proposed to generate region candidates and deep convolutional network with weighted sampling was used in the false positive reduction stage [3, 8, 13, 14]. CASED proposed curriculum adaptive sampling for 3D U-net training in nodule detection [7, 10]. Ding et al. used Faster R-CNN to generate candidate nodules, followed by 3D convolutional networks to remove false positive nodules [2]. Due to the effective performance of Faster R-CNN [14], Faster R-CNN with a U-net-like encoder-decoder scheme was proposed for nodule detection [14].

Fig. 1.
figure 1

Illustration of DeepEM framework. Faster R-CNN is employed for nodule proposal generation. Half-Gaussian model and logistic regression are employed for central slice and lobe location respectively. In the E-step, we utilize all the observations, CT slices, and weak label to infer the latent variable, nodule proposals, by maximum a posteriori (MAP) or sampling. In the M-step, we employ the estimated proposals to update parameters in the Faster R-CNN and logistic regression.

A prerequisite to utilization of deep learning models is the existence of an abundance of labeled data. However, labels are especially difficult to obtain in the medical image analysis domain. There are multiple contributing factors: (a) labeling medical data typically requires specially trained doctors; (b) marking lesion boundaries can be hard even for experts because of low signal-to-noise ratio in many medical images; and (c) for CT and magnetic resonance imaging (MRI) images, the annotators need to label the entire 3D volumetric data, which can be costly and time-consuming. Due to these limitations, CT medical image datasets are usually small, which can lead to over-fitting on the training set and, by extension, poor generalization performance on test sets [16].

By contrast, medical institutions have large amount of weakly labeled medical images. In these databases, each medical image is typically associated with an electronic medical report (EMR). Although these reports may not contain explicit information on detection bounding box or segmentation ground truth, it often includes the results of diagnosis, rough locations and summary descriptions of lesions if they exist. We hypothesize that these extra sources of weakly labeled data may be used to enhance the performance of existing detector and improve its generalization capability.

There are previous attempts to utilize weakly supervised labels to help train machine learning models. Deep multi-instance learning was proposed for lesion localization and whole mammogram classification [15]. Different pooling strategies were proposed for weakly supervised localization and segmentation respectively [1, 4]. Self-transfer learning co-optimized both classification and localization networks for weakly supervised lesion localization [5]. Different from these works, we consider nodule proposal as latent variable and propose DeepEM, a new deep 3D convolutional nets with Expectation-Maximization optimization, to mine the big data source of weakly supervised label in EMR as illustrated in Fig. 1. Specifically, we infer the posterior probabilities of the proposed nodules being true nodules, and utilize the posterior probabilities to train nodule detection models.

2 DeepEM for Weakly Supervised Detection

Notation. We denote by \(\varvec{I} \in \mathbb {R}^{h \times w \times s}\) the CT image, where h, w, and s are image height, width, and number of slices respectively. The nodule bounding boxes for \(\varvec{I}\) are denoted as \({\varvec{H}}=\{\varvec{H}_1, \varvec{H}_2, \dots , \varvec{H}_M\}\), where \(\varvec{H}_m = \{x_m, y_m, z_m, d_m\}\), the \((x_m, y_m, z_m)\) represents the center of nodule proposal, \(d_m\) is the diameter of the nodule proposal, and M is the number of nodules in the image \(\varvec{I}\). In the weakly supervised scenario, the nodule proposal \(\varvec{H}\) is a latent variable, and each image \(\varvec{I}\) is associated with weak label \({\varvec{X}}=\{\varvec{X}_1, \varvec{X}_2, \dots , \varvec{X}_M\}\), where \(\varvec{X}_m=\{{loc}_m, z_m\}\), \({loc}_m \in \{1,2,3,4,5,6\}\) is the location (right upper lobe, right middle lobe, right lower lobe, left upper lobe, lingula, left lower lobe) of nodule \(\varvec{H}_m\) in the lung, and \(z_m\) is the central slice of the nodule.

For fully supervised detection, the objective function is to maximize the log-likelihood function for observed nodule ground truth \(\varvec{H}\) given image \(\varvec{I}\) as

$$\begin{aligned} \mathcal {L}(\varvec{\theta }) = \log P(\varvec{H} \cup \varvec{\bar{H}} | \varvec{I}; \varvec{\theta }) = \frac{1}{M}\sum _{m=1}^{M} \log P(\varvec{H}_m | \varvec{I}; \varvec{\theta })+ \frac{1}{N}\sum _{n=1}^{N} \log P(\varvec{\bar{H}}_n | \varvec{I}; \varvec{\theta }), \end{aligned}$$
(1)

where \(\varvec{\bar{H}}=\{\varvec{\bar{H}}_1, \varvec{\bar{H}}_2, \dots , \varvec{\bar{H}}_N\}\) are hard negative nodule proposals [14], \(\varvec{\theta }\) is the weights of deep 3D ConvNet. We employ Faster R-CNN with 3D Res18 for the fully supervised detection because of its superior performance.

For weakly supervised detection, nodule proposal \(\varvec{H}\) can be considered as a latent variable. Using this framework, image \(\varvec{I}\) and weak label \(\varvec{X}=\{({loc}_1, z_1), ({loc}_2, z_2), \dots , ({loc}_M, z_M)\}\) can be considered as observations. The joint distribution is

$$\begin{aligned} \begin{aligned} P(\varvec{I}, \varvec{H}, \varvec{X}; \varvec{\theta })&= P(\varvec{I}) \prod _{m=1}^{M} \big ( P(\varvec{H}_m|\varvec{I}; \varvec{\theta }) P(\varvec{X}_m | \varvec{H}_m; \varvec{\theta }) \big ) \\&= P(\varvec{I}) \prod _{m=1}^{M} \big ( P(\varvec{H}_m|\varvec{I}; \varvec{\theta }) P({loc}_m | \varvec{H}_m; \varvec{\theta }) P({z}_m | \varvec{H}_m; \varvec{\theta }) \big ). \end{aligned} \end{aligned}$$
(2)

To model \(P({z}_m | \varvec{H}_m; \varvec{\theta })\), we propose using a half-Gaussian distribution based on nodule size distribution because \(z_m\) is correct if it is within the nodule area (center slice of \(\varvec{H}_m\) as \({z}_{{\varvec{H}}_m}\), and nodule size \(\sigma \) can be empirically estimated based on existing data) for nodule detection in Fig. 2(a). For lung lobe prediction \(P({loc}_m | \varvec{H}_m; \varvec{\theta })\), a logistic regression model is used based on relative value of nodule center \(({x}_{{\varvec{H}}_m}, {y}_{{\varvec{H}}_m}, {z}_{{\varvec{H}}_m})\) after lung segmentation. That is

$$\begin{aligned} \begin{aligned} P(z_m, {loc}_m | \varvec{H}_m ; \varvec{\theta }) = \frac{2}{\sqrt{2 \pi {\sigma }^2}} \exp \big ( -\frac{|z_m - {z}_{\varvec{H}_m} |^2}{2 {\sigma }^2} \big ) \frac{\exp (\varvec{f}(\varvec{H}_m) \varvec{\theta }_{{loc}_m})}{\sum _{{{loc}_m}=1}^{6}\exp (\varvec{f}(\varvec{H}_m) \varvec{\theta }_{{loc}_m})}, \end{aligned} \end{aligned}$$
(3)

where \(\varvec{\theta }_{{loc}_m}\) is the associated weights with lobe location \({loc}_m\) for logistic regression, feature \(\varvec{f}(\varvec{H}_m) = (\frac{{x}_{{\varvec{H}}_m}}{{x}_{\varvec{I}}}, \frac{{y}_{{\varvec{H}}_m}}{{y}_{\varvec{I}}}, \frac{{z}_{{\varvec{H}}_m}}{{z}_{\varvec{I}}})\), and \(({x}_{\varvec{I}}, {y}_{\varvec{I}}, {z}_{\varvec{I}})\) is the total size of image \(\varvec{I}\) after lung segmentation. In the experiments, we found logistic regression converges quickly and is stable.

The expectation-maximization (EM) is a commonly used approach to optimize the maximum log-likelihood function when there are latent variables in the model. We employ the EM algorithm to optimize deep weakly supervised detection model in Eq. 2. The expected complete-data log-likelihood function given previous estimated parameter \({\varvec{\theta }}^{\prime }\) in deep 3D Faster R-CNN is

$$\begin{aligned} \begin{aligned} Q(\varvec{\theta }; \varvec{\theta ^{\prime }}) =&\frac{1}{M}\sum _{m=1}^{M} \mathbb {E}_{P(\varvec{H}_m | \varvec{I}, z_m, {loc}_m; {\varvec{\theta }}^{\prime })} \big [ \log P(\varvec{H}_m|\varvec{I}; \varvec{\theta }) \\ {}&+ \log P(z_m, {loc}_m|{\varvec{H}}_m; \varvec{\theta }) \big ] + \mathbb {E}_{Q(\varvec{\bar{H}}_n | \varvec{z})}\big [ \log P(\varvec{\bar{H}}_n | \varvec{I}; \varvec{\theta }) \big ], \end{aligned} \end{aligned}$$
(4)

where \(\varvec{z} = \{z_1, z_2, \dots , z_m\}\). In the implementation, we only keep hard negative proposals far away from weak annotation \(\varvec{z}\) to simplify \(Q(\varvec{\bar{H}}_n | \varvec{z})\). The posterior distribution of latent variable \(\varvec{H}_m\) can be calculated by

$$\begin{aligned} \begin{aligned} P(\varvec{H}_m | \varvec{I}, z_m, {loc}_m; \varvec{{\theta }^{\prime }})&\propto P(\varvec{H}_m | \varvec{I}; \varvec{{\theta }^{\prime }}) P(z_m, {loc}_m | \varvec{H}_m; \varvec{{\theta }^{\prime }}). \end{aligned} \end{aligned}$$
(5)

Because Faster R-CNN yields a large number of proposals, we first use hard threshold (-3 before sigmoid function) to remove proposals of small confident probability, then employ non-maximum suppression (NMS) with intersection over union (IoU) as 0.1. We then employ two schemes to approximately infer the latent variable \(\varvec{H}_m\): maximum a posteriori (MAP) or sampling.

DeepEM with MAP. We only use the proposal of maximal posterior probability to calculate the expectation.

$$\begin{aligned} \hat{\varvec{H}}_m = {\arg \max }_{\varvec{H}_m} P(\varvec{H}_m | \varvec{I}; \varvec{{\theta }^{\prime }}) P(z_m, {loc}_m | \varvec{H}_m; \varvec{{\theta }^{\prime }}) \end{aligned}$$
(6)

DeepEM with Sampling. We approximate the distribution by sampling \(\hat{M}\) proposals \(\hat{\varvec{H}}_m\) according to normalized Eq. 5. The expected log-likelihood function in Eq. 4 becomes

$$\begin{aligned} \begin{aligned} Q(\varvec{\theta }; \varvec{\theta ^{\prime }}) =&\frac{1}{M \hat{M}}\sum _{m=1}^{M} \sum _{\hat{\varvec{H}}_m}^{\hat{M}} \big ( \log P(\hat{\varvec{H}}_m|\varvec{I}; \varvec{\theta }) + \log P(z_m, {loc}_m|{\hat{\varvec{H}}}_m; \varvec{\theta }) \big ) \\ {}&+ \mathbb {E}_{Q(\varvec{\bar{H}}_n | \varvec{z})}\big [ \log P(\varvec{\bar{H}}_n | \varvec{I}; \varvec{\theta }) \big ]. \end{aligned} \end{aligned}$$
(7)

After obtaining the expectation of complete-data log-likelihood function in Eq. 4, we can update the parameters \(\varvec{\theta }\) by

$$\begin{aligned} \hat{\varvec{\theta }} = \arg \max Q(\varvec{\theta } ; {\varvec{\theta }}^{\prime }). \end{aligned}$$
(8)

The M-step in Eq. 8 can be conducted by stochastic gradient descent commonly used in deep network optimization for Eq. 1. Our entire algorithm is outlined in Algorithm 1.

figure a

3 Experiments

We used 3 datasets, LUNA16 dataset [12] as fully supervised nodule detection, NCI NLSTFootnote 1 dataset as weakly supervised detection, Tianchi Lung Nodule DetectionFootnote 2 dataset as holdout dataset for test only. LUNA16 dataset is the largest publicly available dataset for pulmonary nodules detection [12]. LUNA16 dataset removes CTs with slice thickness greater than 3 mm, slice spacing inconsistent or missing slices, and consist of 888 low-dose lung CTs which have explicit patient-level 10-fold cross validation split. NLST dataset consists of hundreds of thousands of lung CT images associated with electronic medical records (EMR). In this work, we focus on nodule detection based on image modality and only use the central slice and nodule location as weak supervision from the EMR. As part of data cleansing, we remove negative CTs, CTs with slice thickness greater than 3 mm and nodule diameter less than 3 mm. After data cleaning, we have 17,602 CTs left with 30,951 weak annotations. In each epoch, we randomly sample \(\frac{1}{16}\) CT images for weakly supervised training because of the large numbers of weakly supervised CTs. Tianchi dataset contains 600 training low-dose lung CTs and 200 validation low-dose lung CTs for nodule detection. The annotations are location centroids and diameters of the pulmonary nodules, and do not have less than 3 mm diameter nodule, which are the same with those on LUNA16 dataset.

Parameter Estimation in \(P({z}_m | \varvec{H}_m; \varvec{\theta })\). If the current \(z_m\) is within the nodule, it is a true positive proposal. We can model \(|z_m-z_{{\varvec{H}}_m}|\) using a half-Gaussian distribution shown as the red dash line in Fig. 2(a). The parameters of the half-Gaussian is estimated from the LUNA16 data empirically. Because LUNA16 removes nodules of diameter less than 3 mm, we use the truncated half-Gaussian to model the central slice \(z_m\) as \(\max (|z_m-z_{{\varvec{H}}_m}|-\mu , 0)\), where \(\mu \) is the mean of related Gaussian as the minimal nodule radius with 1.63.

Fig. 2.
figure 2

(a)Empirical estimation of half-Gaussian model for \(P({z}_m | \varvec{H}_m; \varvec{\theta })\) on LUNA16. (b) FROC (%) comparison among Faster R-CNN [14], DeepEM with MAP, DeepEM with Sampling on LUNA16.

Performance Comparisons on LUNA16. We conduct 10-fold cross validation on LUNA16 to validate the effectiveness of DeepEM. The baseline method is Faster R-CNN with 3D Res18 network denoted as Faster R-CNN [14]. Then we employ it to model \(P(\varvec{H}_m | \varvec{I}; \varvec{{\theta }^{\prime }})\) for weakly supervised detection scenario. Two inference scheme for \({\varvec{H}}_m\) are used in DeepEM denoted as DeepEM (MAP) and DeepEM (Sampling). In the proposal inference of DeepEM with Sampling, we sample two proposals for each weak label because the average number of nodules each CT is 1.78 on LUNA16. The evaluation metric, Free receiver operating characteristic (FROC), is the average recall rate at the average number of false positives at 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan, which is the official evaluation metric for LUNA16 and Tianchi [12].

From Fig. 2(b), DeepEM with MAP improves about 1.3% FROC over Faster R-CNN and DeepEM with Sampling improves about 1.5% FROC over Faster R-CNN on average on LUNA16 when incorporating weakly labeled data from NLST. We hypothesize the greater improvement of DeepEM with Sampling over DeepEM with MAP is that MAP inference is greedy and can get stuck at a local minimum while the nature of sampling may allow DeepEM with Sampling to escape these local minimums during optimization.

Performance Comparisons on Holdout Test Set from Tianchi. We employed a holdout test set from Tianchi to validate each model from 10-fold cross validation on LUNA16. The results are summarized in Table 1. We can see DeepEM utilizing weakly supervised data improves 3.9% FROC on average over Faster R-CNN. The improvement on holdout test data validates DeepEM as an effective model to exploit potentially large amount of weak data from electronic medical records (EMR) which would not require further costly annotation by expert doctors and can be easily obtained from hospital associations (Fig. 3).

Table 1. FROC (%) comparisons among Faster R-CNN with 3D ResNet18 [14], DeepEM with MAP, DeepEM with Sampling on Tianchi.
Fig. 3.
figure 3

Detection visual comparison among Faster R-CNN [14], DeepEM with MAP and DeepEM with Sampling on nodules randomly sampled from Tianchi. DeepEM provides more accurate detection (central slice, center and diameter) than Faster R-CNN.

Visualizations. We compare Faster R-CNN with the proposed DeepEM visually in Fig. 3. We randomly choose nodules from Tianchi. From Fig. 3, DeepEM yields better detection for nodule center and tighter nodule diameter which demonstrates DeepEM improves the existing detector by exploiting weakly supervised data.

4 Conclusion

In this paper, we have focused on the problem of detecting pulmonary nodules from lung CT images, which previously has been formulated as a supervised learning problem and requires a large amount of training data with the locations and sizes of nodules precisely labeled. Here we propose a new framework, called DeepEM, for pulmonary nodule detection by taking advantage of abundantly available weakly labeled data extracted from EMRs. We treat each nodule proposal as a latent variable, and infer the posterior probabilities of proposal nodules being true ones conditioned on images and weak labels. The posterior probabilities are further fed to the nodule detection module for training. We use an EM algorithm to train the entire model end-to-end. Two schemes, maximum a posteriori (MAP) and sampling, are used for the inference of proposals. Extensive experimental results demonstrate the effectiveness of DeepEM for improving current state of the art nodule detection systems by utilizing readily available weakly supervised detection data. Although our method is built upon the specific application of pulmonary nodule detection, the framework itself is fairly general and can be readily applied to other medical image deep learning applications to take advantage of weakly labeled data.