1 Introduction

Accurate and fast detection of anatomical structures is a fundamental step for comprehensive medical image analysis [1, 2, 5]. In particular for automatic support of clinical image reading, where the field-of-view of the acquired CT scan is typically unknown, ensuring the accurate detection of the visible landmarks and recognizing the absence of missing structures pose significant challenges. Addressing these limitations is essential to enable artificial intelligence to increase and support the efficiency of the clinical workflow from admission through diagnosis, clinical care and patient follow-up. In this context, state-of-the-art deep learning solutions based on hypothesis scanning [1] or end-to-end learning [5] typically propose to threshold the detection confidence to handle cases of incomplete data — a suboptimal heuristic in terms of accuracy.

In this work we present a solution for robust anatomical landmark detection and recognition of missing structures using the capabilities of deep reinforcement learning (DRL) [4]. Inspired by the method introduced in [2], we choose to learn the process of finding an anatomical structure and use it as a natural mechanism to recognize its absence by signaling the divergence of search trajectories outside the image space. To increase the system robustness and avoid suboptimal local convergence, we propose to use scale-space theory [3] to enable the system to hierarchically exploit the image information. In addition, we ensure the spatial coherence of the detected anatomical structures using a robust statistical shape-model fitted with M-estimator sample consensus [7]. Based on the robust detections, we infer the vertical range of the body captured in the 3D-CT scan.

2 Background and Motivation

2.1 Challenges of 3D Landmark Detection in Incomplete Data

Deep scanning-based systems represent the main category of recent solutions [1]. Here the problem is reformulated to a patch-wise classification between positive \(\varvec{h} \in \mathbf {H}_+\) and negative hypotheses \(\varvec{h} \in \mathbf {H}_-\), sampled as volumetric boxes of image intensities. Alternatively, end-to-end deep learning systems based on fully convolutional architectures approach the problem by learning a direct mapping \(f(\mathbf {I}) = \mathbf {M}\) between the original image \(\mathbf {I}\) and a coded map \(\mathbf {M}\) highlighting the locations of anatomical landmarks [5]. However, in the case of over thousands of large range 3D-CT scans at high spatial resolution (e.g. 2 mm or less), the training of such systems becomes infeasible due to the excessive memory requirements and the high computational complexity. In particular for incomplete data, all these systems share a common limitation, i.e., they rely on suboptimal and inaccurate heuristics such as probability thresholding to recognize whether an anatomical landmark is visible in the field-of-view of the 3D scan.

2.2 Learning to Search Using Deep Reinforcement Learning

A different perspective on the general problem of landmark detection in 3D data is presented in [2]. The task is reformulated as an intrinsic behavior learning problem which asks the question of how to find a structure? given image evidence \(\mathbf {I}:\mathbb {Z}^3 \rightarrow \mathbb {R}\). To model the system dynamics and enable the navigation in image space, a Markov Decision Process (MDP) [6] \(\mathcal {M} := \left( \mathcal {S}, \mathcal {A}, \mathcal {T}, \mathcal {R}, \gamma \right) \) is defined, where: \(\mathcal {S}\) represents a finite set of states over time with \(s_t \in \mathcal {S}\) being the state of the agent at time t – a constrained axis-aligned box of image intensities centered at position \(\varvec{p}_t\) in image space; \(\mathcal {A}\) represents a finite set of actions allowing the agent to navigate voxel-wise within the environment (\(\pm 1\) voxels in each direction); \(\mathcal {T}:\mathcal {S}\times \mathcal {A}\times \mathcal {S}\rightarrow [0,1]\) is a stochastic transition function, where \(\mathcal {T}_{s,a}^{s'}\) describes the probability of arriving in state \(s'\) after performing action a in state s; \(\mathcal {R}:\mathcal {S}\times \mathcal {A}\times \mathcal {S}\rightarrow \mathbb {R}\) is a scalar reward function to drive the behavior of the agent, where \(\mathcal {R}_{s,a}^{s'} = \Vert \varvec{p}_t - \varvec{p}_{GT}\Vert _2^2 - \Vert \varvec{p}_{t+1} - \varvec{p}_{GT}\Vert _2^2\) denotes the expected distance-based reward for transitioning from state s to state \(s'\), i.e., \(\varvec{p}_t\) to \(\varvec{p}_{t+1}\) while seeking the ground-truth position \(\varvec{p}_{GT}\) of the landmark; and \(\gamma \in (0, 1)\) is the discount-factor controlling future versus immediate rewards [2].

In this context the optimal action-value function \(Q^*:\mathcal {S}\times \mathcal {A}\rightarrow \mathbb {R}\) is defined, which quantifies the maximum expected future reward of an optimal navigation policy \(\pi ^*\) starting in s with action a: \(Q^*(s,a) = \max _{\pi }\mathbb {E}\left[ R_t|s_t = s, a_t = a, \pi \right] \). A recursive formulation of this function based on the dynamic state-graph defines the so called Bellman criterion [6]: \(Q^*(s,a) = \mathbb {E}_{s'}\left( r + \gamma \max _{a'}Q^*(s',a')\right) \). Using a deep neural network with parameters \(\theta \) to approximate this complex non-linear function \(Q^*(s,a) \approx Q(s,a;\theta )\), one can learn optimal trajectories in image-space that converge to the sought anatomical structures with maximum reward [2, 4, 8].

Learning the navigation policy replaces the need for exhaustive and suboptimal search strategies [1, 5]. More importantly, this formalism can elegantly address the question of missing structures with trajectories that leave the image space — a natural ability of the system in contrast to explicit post-processing heuristics. However, in the context of large incomplete volumetric 3D scans this approach suffers from several inherent limitations. The first is related to the spatial coverage of the acquired state descriptor \(s \in \mathcal {S}\). Acquiring limited local information improves the sampling efficiency at the cost of local optima. On the contrary, extracting a very large context to represent the state, poses significant computational challenges in the 3D space. This indicates the inability to properly exploit the image information at different scales. Secondly, the system fails to exploit the spatial distribution of the landmarks to further increase robustness.

3 Proposed Method

To address these limitations, we propose to use scale-space theory [3] and robust statistical shape modeling for multi-scale spatially-coherent landmark detection.

3.1 A Discrete Scale-Space Model

In general, the continuous scale-space of a 3D image signal \(\mathbf {I}:\mathbb {Z}^3 \rightarrow \mathbb {R}\) is defined as: \(L(x; t) = \sum _{\xi \in \mathbb {Z}^3} T(\xi ; t)\,\mathbf {I}(x - \xi )\), where \(t \in \mathbb {R_+}\) denotes the continuous scale-level, \(x \in \mathbb {Z}^3\), \(L(x; 0) = \mathbf {I}(x)\) and T defines a one-parameter kernel-family. The main properties of such a scale-space representation are the non-enhancement of local extrema and implicitly the causality of structure across scales [3]. These properties are essential for the robustness of a search process, starting from coarse to fine scale. We propose to use a discrete approximation of the continuous space L while best preserving these properties. We define this discrete space as:

$$\begin{aligned} L_d(t) = \varPsi _\rho (\sigma (t-1) *L_d(t-1)), \end{aligned}$$
(1)

where \(L_d(0) = \mathbf {I}\), \(t \in \mathbb {N}_0\) denotes the discrete scale-level, \(\sigma \) represents a scale-dependent smoothing function and \(\varPsi _\rho \) denotes a signal operator reducing the spatial resolution with factor \(\rho \) using down-sampling [3].

3.2 Learning Multi-scale Search Strategies

Assuming w.l.o.g. a discrete scale-space of M scale-levels with \(\rho = 2\), we propose a navigation model across scales — starting from the coarsest to the finest scale-level. For this we redefine the optimal action-value function \(Q^*\) by conditioning the state-representation s and model parameters \(\theta \) on the scale-space \(L_d\) and the current scale \(t \in [0,\cdots ,M-1]\): \(Q^*(s,a \mid L_d, t) \approx Q(s,a;\theta _t \mid L_d, t)\). This results in M independent navigation sub-models \(\mathbf {\Theta } = \left[ \theta _0, \theta _1,\cdots ,\theta _{M-1}\right] \), one for each scale-level. Each model is trained on each individual scale-level as proposed in [2], i.e., by optimizing the Bellman criterion on each level \(t < M\):

$$\begin{aligned} \hat{\theta }^{(i)}_t = \mathop {\text {arg min}}\limits _{\theta ^{(i)}_t}\mathbb {E}_{s,a,r,s'}\left[ \left( y - Q\left( s,a;\theta ^{(i)}_t\mid L_d, t\right) \right) ^2\right] , \end{aligned}$$
(2)

with \(i \in \mathbb {N}_0\) denoting the training iteration. The reference estimate y is determined using the update-delay [4] technique: \(y = r + \gamma \max _{a'} Q\left( s',a';\bar{\theta }^{(i)}_t\mid L_d, t\right) \), where \(\bar{\theta }^{(i)}_t := \theta ^{(j)}_t\) represents a copy of the model parameters from a previous training step \(j < i\). This significantly increases the training stability [2].

Fig. 1.
figure 1

Visualization of the complete system pipeline.

The detection workflow is defined as follows: the search starts in the image center at the coarsest scale-level \(M-1\). Upon convergence the scale-level is changed to \(M-2\) and the search continued from the convergence-point at \(M-1\). The same process is repeated on the following scales until convergence on the finest scale. We empirically observed that optimal trajectories converge on minimal (oscillatory) cycles. As such, we define the convergence-point as the center of gravity of this cycle. The search-model \(Q(\cdot ,\cdot ;\theta _{M-1}\mid L_d, M-1)\) is trained for global convergence while the models on any of the following scales \(t < M - 1\) are trained in a constrained range around the ground-truth. This range is robustly estimated from the accuracy upper-bound on the previous scale \(t + 1\). Note that the spatial coverage of a fixed-size state \(s \in \mathcal {S}\) is increasing exponentially with the scale. This multi-scale navigation model allows the system to effectively exploit the image information and increase the robustness of the search (see Fig. 1).

Missing Landmarks: We propose to explicitly train the global search model \(\theta _{M - 1}\) for missing landmarks to further improve the accuracy for such cases. Assuming the CT-scans are cut only horizontally, the system is trained to constantly reward the trajectories to leave the image space through the correct volume border. For this we require for each missing landmark a ground-truth annotation on whether it is outside above the field of view, or below.

3.3 Robust Spatially-Coherent Landmark Detection

To ensure the robust recognition of missing anatomical structures and outliers we propose to model the spatial distribution of the considered anatomical landmarks using robust statistical shape modeling. This step constrains the output of the global search model \(\theta _{M-1}\) (see the complete pipeline visualized in Fig. 1). Assuming a complete set of N anatomical landmarks, we normalize the distribution of these points on all complete training images to zero mean and unit variance. In this space, we model the distribution of each individual landmark \(i \in [0,\cdots ,N-1]\) as a multi-variate normal distribution \(\varvec{p}_i \sim \mathcal {N}(\varvec{\mu }_i, \varvec{\varSigma }_i)\), where \(\varvec{\mu }_i\) and \(\varvec{\varSigma }_i\) are estimated using maximum likelihood. This defines a mean shape-model for the landmark set, defined as \(\varvec{\mu } = \left[ \varvec{\mu }_0,\cdots , \varvec{\mu }_{N-1}\right] ^\top \). Given an unseen configuration of detected points at scale \(M-1\) as \(\tilde{\varvec{P}} = [\tilde{\varvec{p}}_0,\cdots ,\tilde{\varvec{p}}_N]^\top \), one can approximate \(\tilde{\varvec{P}}\) with a translated and isotropic-scaled version of the mean model using least linear squares as: \(\hat{\varvec{\omega }} = {\text {arg min}}_{\varvec{\omega } = [\varvec{t}, s]} \Vert \tilde{\varvec{P}} - \varvec{t} - s\varvec{\mu }\Vert _2^2 \). However, for the case of incomplete data the cardinality of \(|\varvec{\tilde{P}}|\le N\). In addition, outliers can corrupt the data. To enable the robust fitting of the shape-model, we propose to use M-estimator sample consensus [7]. Based on random 3-point samples from the set of all triples \(I_3(\tilde{\varvec{P}})\) one can obtain the mean-model fit \(\hat{\varvec{\omega }} = [\varvec{t}, s]\). The target is to optimize the following cost function based on the redescending M-estimator [7] and implicitly maximize the cardinality of the consensus set \(\hat{\mathcal {S}}\):

$$\begin{aligned} \hat{\mathcal {S}} \leftarrow \mathop {\text {arg min}}\limits _{S \in I_3(\tilde{\varvec{P}})} \sum _{i=0}^{|\tilde{\varvec{P}}|} \min \left[ \frac{1}{Z_i} \left( \phi (\tilde{\varvec{p}}_i) - \varvec{\mu }_i\right) ^\top \varvec{\varSigma }_i^{-1} \left( \phi (\tilde{\varvec{p}}_i) - \varvec{\mu }_i\right) ,\,1\right] , \end{aligned}$$
(3)

where \(\phi (\varvec{x}) = \frac{\varvec{x}}{s} - \varvec{t}\) is a projector to normalized shape-space with the estimated fit \(\hat{\varvec{w}} = [\varvec{t}, s]\) on set S. The normalization coefficient \(Z_i \in \mathbb {R}_+\) defines an oriented ellipsoid – determining the outlier-rejection criterion. We use the \(\chi _3^2\)-distribution to select \(Z_i\) such that less than 0.5% of the inlier-points are incorrectly rejected.

Detect Outliers and Reset: Enforcing spatial coherency not only corrects for diverging trajectories by re-initializing the search, but also significantly reduces the false-negative rate by correcting for border cases. These are landmarks very close to the border (< 2 cm), falsely labeled as missing at scale \(M-1\).

Scan-range estimation: The robust fitting of the shape-model also enables the estimation of the body-region captured in the scan. We propose to model this as a continuous range within normalized z-axis, to ensure consistency among different patients. For a set of defined landmarks \(\varvec{P}\) in normalized shape-space, the point \(\varvec{p}_{min} = \min _{\varvec{p}_i \in \varvec{P}} [\varvec{p}_i^z]\) determines the \(0\%\)-point, while the point \(\varvec{p}_{max} = \max _{\varvec{p}_i \in \varvec{P}} [\varvec{p}_i^z]\) the \(100\%\)-point. Assume for a given set of landmarks \(\tilde{\varvec{P}}\) that the fitted robust subset is represented by \(\hat{\varvec{P}} \subseteq \tilde{\varvec{P}}\). Using our definition of range, the span of the point-set \(\hat{\varvec{P}}\) can be determined between 0%–100% in normalized shape-space. This also allows the linear extrapolation of the body-range outside the z-span of the point-set \(\hat{\varvec{P}}\) (more details follow in Sect. 4).

4 Experiments

Dataset: For evaluation we used a dataset of 2305 3D-CT volumes from over 850 patients. We determined a random split in 1887 training volumes and 418 test volumes, ensuring that all scans from each patient are either in the training or the test-set. We selected a set of 8 anatomical landmarks with annotations from medical experts (see Fig. 2). Each volume was annotated with the location of all visible landmarks. To allow the fitting of the shape-model, we selected scans with at least 4 annotations (this is not a limitation since our target for future work is to cover more than 100 landmarks). This resulted in a 70%–30% split of the annotations for each landmark into training and testing. Over the entire dataset the distribution of visible−missing landmarks was approximately as follows: 80%–20% for kidneys, 60%–40% for hip-bones, and 50%–50% for the rest. We refer to false-positive (FP) and false-negative (FN) rates to measure the accuracy in detecting landmarks or recognizing their absence from the scan.

Fig. 2.
figure 2

The anatomical landmarks used for evaluation. These are the front corner of the left (LHB) and right hip bones (RHB), the center of the left (LK) and right kidneys (RK), the bifurcation the left common carotic artery (LCCA), brachiocephalic artery (BA) and left subclavian artery (LSA) and the bronchial bifurcation (BB).

System Training: A scale-space of 4 scales was defined at isotropic resolutions of 2 – 4 – 8 and 16 mm. For the kidney center the fine resolution was set to 4 mm, given the higher variability of the annotations. For each scale and landmark the network structure was the same: conv-layer (40 kernels: 5 \(\times \) 5 \(\times \) 5, ReLU), pooling (2 \(\times \) 2 \(\times \) 2), conv-layer (58 kernels: 3 \(\times \) 3 \(\times \) 3), pooling (2 \(\times \) 2 \(\times \) 2) and three fully-connected layers (512 \(\times \) 256 \(\times \) 6 units, ReLU). The compact model-size under 8 MB per scale-level enables efficient loading and evaluation. Also the meta-parameters for training were shared across scales and landmarks: training-iterations (750), episode-length (1500), replay-memory size (\(10^5\)), learning rate (\(0.25\,\times \,10^{-2}\)), batch-size (128) and discount-factor \(\gamma = 0.9\). The dimensionality of the state was also fixed across scales to 25 \(\times \) 25 \(\times \) 25 voxels. Recall that on all scales except \(M-1\) the training is performed in a constrained image range around the ground-truth \(\varvec{p}_{GT} \pm \varvec{r}\). Depending on scale and landmark: \(\varvec{r} \in [-12, 12]^3\) voxels. The training time for one landmark averages to 4 h on an Nvidia Titan X GPU. We train all models in a 16-GPU cluster in 2.5 h.

Robust Multi-scale Navigation: Given trained multi-scale models for each landmark: \(\mathbf {\Theta }_0,\cdots ,\mathbf {\Theta }_8\), the search starts on the lowest scale in the center of the scan. Let \(\tilde{\varvec{P}}\) be the output of the navigation sub-models on coarsest scale. Robust shape-model fitting was performed on \(\tilde{\varvec{P}}\) to eliminate outliers and correct for misaligned landmarks to a robust set \(\hat{\varvec{P}}\). This reduced the FP and FN rates from around 2% to under 0.5%. Applying the training range \(\varvec{r}\) to bound the navigation on the following scales \([M-2,\cdots ,0]\), we empirically observed that the shape-constraint was preserved while the FP-/FN-rate were reduced to zero.

Table 1. Comparison with state-of-the-art deep learning [1]. Accuracy is in mm.

Result Comparison: In contrast to our method, the reference solution proposed in [1] uses a cascade of sparse deep neural networks to scan the complete image space. Missing structures are detected using a fixed cross-validated threshold on the hypothesis-probability. The operating point was selected to maintain a FP-rate of less than \(1.5\%\). Table 1 shows the obtained results. Our method significantly outperforms [1] in recognizing the presence/absence of structures (see FP and FN rates). In terms of accuracy, the improvement is statistically significant (paired t-Test p-value \(<10^{-4}\)), averaging 20–30% on all landmarks except the kidneys. The increased apparent performance on the kidney center of the method [1] might be explained by the high FN-rate as well as the robust candidate aggregation [1], which accounts for the high variability of the annotations. Please note: A comparison with the method [2] is not possible on this large volumetric data. Training the detector only on the finest scale, as proposed in [2], is only possible within a limited range around the ground-truth (e.g. \(\pm 15\) cm). This highlights the importance of using a scale-space model and robust shape M-estimation, which enable training in large-range incomplete 3D data.

Runtime: Learning the multi-scale search trajectory leads to real-time detection. With an average speed of 35–40 ms per landmark, our method is 15–20 times faster than MSDL [1] which achieved an average speed of around 0.8 s.

Body-region Estimation: We defined a continuous range-model based on the landmark set with the LHB corner at \(0\%\) and the LCCA bifurcation at \(100\%\). The levels of the remaining landmarks were determined in normalized shape-space using linear interpolation. Using the robust detections \(\hat{\varvec{P}}\) as reference range, we extrapolated the body-range above the LCCA or below the hip bones. Qualitative evaluation shows that the scan in Fig. 1 extends from 21.3% to 109.0%.

5 Conclusion

In conclusion, this paper presents an effective approach for multi-scale spatially coherent landmark detection for incomplete 3D-CT data. Learning multi-scale search trajectories and enforcing spatial constraints ensure high robustness and reduce the false-positive and false-negative rates to zero, significantly outperforming a state-of-the-art deep learning approach. Finally we demonstrate that the detected landmarks can be used to robustly estimate the body-range.

Disclaimer: This feature is based on research, and is not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.