1 Introduction

Modern radiation therapy treatment planning relies on imaging modalities like CT for tumor localization. For throat cancer, an additional kind of medical imaging, called endoscopy, is also taken at treatment planning time. Endoscopic videos provide direct optical visualization of the pharyngeal surface and provide information, such as a tumor’s texture and superficial (mucosal) spread, that is not available on CT due to CT’s relatively low contrast and resolution. However, the use of endoscopy for treatment planning is significantly limited by the fact that (1) the 2D frames from the endoscopic video do not explicitly provide 3D spatial information, such as the tumor’s 3D location; (2) reviewing the video is time-consuming; and (3) the optical views do not provide the full geometric conformation of the throat.

In this paper, we introduce a pipeline for reconstructing a 3D textured surface model of the throat, which we call an endoscopogram, from 2D video frames. The model provides (1) more complete 3D pharyngeal geometry; (2) efficient visualization; and (3) the opportunity to register endoscopy data with the CT, thereby enabling transfer of the tumor contours and texture into the CT space.

State-of-the-art monocular endoscopic reconstruction techniques have been applied in applications like colonoscopy inspection [1], laparoscopic surgery [2] and orthopedic surgeries [3]. However, most existing methods cannot simultaneously deal with the following three challenges: (1) non-Lambertian surfaces; (2) non-rigid deformation of tissues across frames; and (3) poorly known shape or motion priors. Our proposed pipeline deals with these problems using (1) a Shape-from-Motion-and-Shading (SfMS) method [4] incorporating a new reflectance model for generating single-frame-based partial reconstructions; and (2) a novel geometry fusion algorithm for non-rigid fusion of multiple partial reconstructions. Since our pipeline does not assume any prior knowledge on environments, motion and shapes, it can be readily generalized to other endoscopic applications in addition to our nasopharyngoscopy reconstruction problem.

In this paper we focus on the geometry fusion step mentioned above. The challenge here is that all individual reconstructions are only partially overlapping due to the constantly changing camera viewpoint, may have missing data (holes) due to camera occlusion, and may be slightly deformed since the tissue may have deformed between 2D frame acquisitions. Our main contribution in this paper is the design of a novel groupwise surface registration algorithm that can deal with these limitations. An additional contribution is an outlier geometry trimming algorithm based on robust regression. We generate endoscopograms and validate our registration algorithm with data from synthetic CT surface deformations and endoscopic video of a rigid phantom and real patients.

2 Endoscopogram Reconstruction Pipeline

The input to our system (Fig. 1) is a video sequence of hundreds of consecutive frames \(\{\mathcal {F}_i | i=1...N\}\). The output is an endoscopogram, which is a textured 3D surface model derived from the input frames. We first generate for each frame \(\mathcal {F}_i\) a reconstruction \(\mathcal {R}_i\) by the SfMS method. We then fuse multiple single-frame reconstructions \(\{\mathcal {R}_i\}\) into a single geometry \(\mathcal {R}\). Finally, we texture \(\mathcal {R}\) by pulling color from the original frames \(\{\mathcal {F}_i\}\). We will focus on the geometry fusion step in Sect. 3 and briefly introduce the other techniques in the rest of this section.

Fig. 1.
figure 1

The endoscopogram reconstruction pipeline.

Shape from Motion and Shading (SfMS). Our novel reconstruction method [4] has been shown to be efficient in single-camera reconstruction of live endoscopy data. The method leverages sparse geometry information obtained by Structure-from-Motion (SfM), Shape-from-Shading (SfS) estimation, and a novel reflectance model to characterize non-Lambertian surfaces. In summary, it iteratively estimates the reflectance model parameters and a SfS reconstruction surface for each individual frame under sparse SfM constraints derived within a sliding time window. One drawback of this method is that large tissue deformation and lighting changes across frames can induce inconsistent individual SfS reconstructions. Nevertheless, our experiments show that this kind of error can be well compensated in the subsequent geometry fusion step. In the end, for each frame \(\mathcal {F}_i\), a reconstruction \(\mathcal {R}_i\) is produced as a triangle mesh and transformed into the world space using the camera position parameters estimated from SfM. Mesh faces that are nearly tangent to the camera viewing ray are removed because they correspond to occluded regions. The end result of this is that the reconstructions \(\{\mathcal {R}_i\}\) have missing patches and different topology and are only partially overlapping with each other.

Texture Mapping. The goal of texture mapping is to assign a color to each vertex \(v^k\) (superscripts refer to vertex index) in the fused geometry \(\mathcal {R}\), which is estimated by the geometry fusion (Sect. 3) of all the registered individual frame surfaces \(\{\mathcal {R}'_i\}\). Our idea is to find a corresponding point of \(v^k\) in a registered surface \(\mathcal {R}'_i\) and to trace back its color in the corresponding frame \(\mathcal {F}_i\). Since \(v^k\) might have correspondences in multiple registered surfaces, we formulate this procedure as a labeling problem and optimize a Markov Random Field (MRF) energy function. In general, the objective function prefers pulling color from non-boundary nearby points in \(\{\mathcal {R}'_i\}\), while encouraging regional label consistency.

3 Geometry Fusion

This section presents the main methodological contributions of this paper: a novel groupwise surface registration algorithm based on N-body interaction, and an outlier-geometry trimming algorithm based on robust regression.

Related Work. Given the set of partial reconstructions \(\{\mathcal {R}_i\}\), our goal is to non-rigidly deform them into a consistent geometric configuration, thus compensating for tissue deformation and minimizing reconstruction inconsistency among different frames. Current groupwise surface registration methods often rely on having or iteratively estimating the mean geometry (template) [5]. However, in our situation, the topology change and partially overlapping data renders initial template geometry estimation almost impossible. Missing large patches also pose serious challenges to the currents metric [6] for surface comparison. Template-free methods have been studied for images [7], but it has not been shown that such methods can be generalized to surfaces. The joint spectral graph framework [8] can match a group of surfaces without estimating the mean, but these methods do not explicitly compute deformation fields for geometry fusion.

Zhao et al. [9] proposed a pairwise surface registration algorithm, Thin Shell Demons, that can handle topology change and missing data. We have extended this algorithm into our groupwise situation.

Thin Shell Demons. Thin Shell Demons is a physics-motivated method that uses geometric virtual forces and a thin shell model to estimate surface deformation. The so-called forces \(\{f\}\) between two surfaces \(\{\mathcal {R}_1,\mathcal {R}_2\}\) are vectors connecting automatically selected corresponding vertex pairs, i.e. \(\{f(v^k) = u^k-v^k \mid v^k \in \mathcal {R}_1, u^k \in \mathcal {R}_2\}\) (with some abuse of notation, we use k here to index correspondences). The algorithm regards the surfaces as elastic thin shells and produces a non-parametric deformation vector field \(\phi :\mathcal {R}_1\rightarrow \mathcal {R}_2\) by iteratively minimizing the energy function \(\mathrm {}{E}(\phi )=\sum _{k=1}^M c(v^k)(\phi (v^k)-f(v^k))^2 +E_{shell}(\phi ).\) The first part penalizes inconsistency between the deformation vector and the force vector applied on a point and uses a confidence score c to weight the penalization. The second part minimizes the thin shell deformation energy, which is defined as the integral of local bending and membrane energy:

$$\begin{aligned} E_{shell}(\phi )= & {} \int _\mathcal {R} \lambda _1 W(\sigma _{mem}(p)) + \lambda _2 W(\sigma _{bend}(p)), \end{aligned}$$
(1)
$$\begin{aligned} W(\sigma )= & {} {Y/(1-\tau ^2)} ((1-\tau )\text{ tr }(\sigma ^2) + \tau \text{ tr }(\sigma )^2), \end{aligned}$$
(2)

where Y and \(\tau \) are the Young’s modulus and Poisson’s ratio of the shell. \(\sigma _{mem}\) is the tangential Cauchy-Green strain tensor characterizing local stretching. The bending strain tensor \(\sigma _{bend}\) characterizes local curvature change and is computed as the shape operator change.

3.1 N-Body Surface Registration

Our main observation is that the virtual force interaction is still valid among N partial shells even without the mean geometry. Thus, we propose a groupwise deformation scenario as an analog to the N-body problem: N surfaces are deformed under the influence of their mutual forces. This groupwise attraction can bypass the need of a target mean and still deform all surfaces into a single geometric configuration. The deformation of a single surface is independent and fully determined by the overall forces exerted on it. With the physical thin shell model, its deformation can be topology-preserving and not influenced by its partial-ness. With this notion in mind, we now have to define (1) mutual forces among N partial surfaces; (2) an evolution strategy to deform the N surfaces.

Mutual Forces. In order to derive mutual forces, correspondences should be credibly computed among N partial surfaces. It has been shown that by using the geometric descriptor proposed in [10], a set of correspondences can be effectively computed between partial surfaces. Additionally, in our application, each surface \(\mathcal {R}_i\) has an underlying texture image \(\mathcal {F}_i\). Thus, we also compute texture correspondences between two frames by using standard computer vision techniques. To improve matching accuracy, we compute inlier SIFT correspondences only between frame pairs that are at most T seconds apart. Finally, these SIFT matchings can be directly transformed to 3D vertex correspondences via the SfSM reconstruction procedure.

In the end, any given vertex \(v^k_i \in \mathcal {R}_i\) will have \(M^k_i\) corresponding vertices in other surfaces \(\{\mathcal {R}_j | j \ne i \}\), given as vectors \(\{f^\beta (v^k_i) = u^\beta - v^k_i, \beta = 1...M^k_i\}\), where \(u^\beta \) is the \(\beta ^{th}\) correspondence of \(v^k_i\) in some other surface. These correspondences are associated with confidence scores \(\{c^\beta (v^k_i)\}\) defined by

$$\begin{aligned} c^\beta (v^k_i) = {\left\{ \begin{array}{ll} \delta (u^\beta ,v^k_i) &{} \text { if } \langle u^\beta ,v^k_i \rangle \text { is a geometric correspondence, }\\ \bar{c}&{} \text { if } \langle u^\beta ,v^k_i \rangle \text { is a texture correspondence, }\\ \end{array}\right. } \end{aligned}$$
(3)

where \(\delta \) is the geometric feature distance defined in [10]. Since we only consider inlier SIFT matchings using RANSAC, the confidence score for texture correspondences is a constant \(\bar{c}\). We then define the overall force exerted on \(v^k_i\) as the weighted average: \(\bar{f}(v^k_i) = \sum _{\beta =1}^{M^k_i} c^\beta (v^k_i) f^\beta (v^k_i) / \sum _{\beta =1}^{M^k_i} c^\beta (v^k_i).\)

Deformation Strategy. With mutual forces defined, we can solve for the group deformation fields \(\{\phi _i\}\) by optimizing independently for each surface

$$\begin{aligned} E(\phi _i)=\sum _{k=1}^{M_i} c(v_i^k)(\phi (v_i^k)-\bar{f}(v_i^k))^2 +E_{shell}(\phi _i), \end{aligned}$$
(4)

where \(M_i\) is the number of vertices that have forces applied. Then, a groupwise deformation scenario is to evolve the N surfaces by iteratively estimating the mutual forces \(\{f\}\) and solving for the deformations \(\{\phi _i\}\). However, a potential hazard of our algorithm is that without a common target template, the N surfaces could oscillate, especially in the early stage when the force magnitudes are large and tend to overshoot the deformation. To this end, we observe that the thin shell energy regularization weights \(\lambda _1,\lambda _2\) control the deformation flexibility. Thus, to avoid oscillation, we design the strategy shown in Algorithm 1.

figure a

3.2 Outlier Geometry Trimming

The final step of geometry fusion is to estimate a single geometry \(\mathcal {R}\) from the registered surfaces \(\{\mathcal {R}'_i\}\) [11]. However, this fusion step can be seriously harmed by the outlier geometry created by SfMS. Outlier geometries are local surface parts that are wrongfully estimated by SfMS under bad lighting conditions (insufficient lighting, saturation, or specularity) and are drastically different from all other surfaces (Fig. 2a). The sub-surfaces do not correspond to any part in other surfaces and thereby are carried over by the deformation process to \(\{\mathcal {R}'_i\}\).

Our observation is that outlier geometry changes a local surface’s topology (branching) and violates many differential geometry properties. We know that the local surface around a point in a smooth 2-manifold can be approximately presented by a quadratic Monge Patch \(h:U\rightarrow \mathbb {R} ^3\), where U defines a 2D open set in the tangent plane, and h is a quadratic height function. Our idea is that if we robustly fit a local quadratic surface at a branching place, the surface points on the wrong branch of outlier geometry will be counted as outliers (Fig. 2b).

Fig. 2.
figure 2

(a) 5 overlaying registered surfaces, one of which (pink) has a piece of outlier geometry (circled) that does not correspond to anything else. (b) Robust quadratic fitting (red grid) to normalized \(\mathcal {N}(v^k)\). The outlier scores are indicated by the color. (c) Color-coded \(\mathcal {W}\) on \(\mathcal {L}\). (d) Fused surface after outlier geometry removal.

We define the 3D point cloud \(\mathcal {L}=\{v^1,...v^P\}\) of P points as the ensemble of all vertices in \(\{\mathcal {R}'_i\}\), \(\mathcal {N}(v^k)\) as the set of points in the neighborhood of \(v^k\) and \(\mathcal {W}\) as the set of outlier scores of \(\mathcal {L}\). For a given \(v^k\), we transform \(\mathcal {N}(v^k)\) by taking \(v^k\) as the center of origin and the normal direction of \(v^k\) as the z-axis. Then, we use Iteratively Reweighted Least Squares to fit a quadratic polynomial to the normalized \(\mathcal {N}(v^k)\) (Fig. 2b). The method produces outlier scores for each of the points in \(\mathcal {N}(v^k)\), which are then accumulated into \(\mathcal {W}\) (Fig. 2c). We repeat this robust regression process for all \(v^k\) in \(\mathcal {L}\). Finally, we remove the outlier branches by thresholding the accumulated scores \(\mathcal {W}\), and the remaining largest point cloud is used to produce the final single geometry \(\mathcal {R}\) [11] (Fig. 2d).

4 Results

We validate our groupwise registration algorithm by generating and evaluating endoscopograms from synthetic data, phantom data, and real patient endoscopic videos. We selected algorithm parameters by tuning on a test patient’s data (separate from the datasets presented here). We set the thin shell elastic parameters \(Y=2,\tau =0.05\), the energy weighting parameters \(\lambda _1=\lambda _2=1, \sigma = 0.95\), the frame interval \(T=0.5s\), and the texture confidence score \(\bar{c} = 1\).

Synthetic Data. We produced synthetic deformations to 6 patients’ head-and-neck CT surfaces. Each surface has 3500 vertices and a 2–3 cm cross-sectional diameter, covering from the pharynx down to the vocal cords. We created deformations typically seen in real data, such as the stretching of the pharyngeal wall and the bending of the epiglottis. We generated for each patient 20 partial surfaces by taking depth maps from different camera positions in the CT space. Only geometric correspondences were used in this test. We measured the registration error as the average Euclidean distance of all pairs of corresponding vertices after registration (Fig. 3). Our method significantly reduced error and performed better than a spectral-graph-based method [10], which is another potential framework for matching partial surfaces without estimating the mean.

Fig. 3.
figure 3

Left to right: error plot of synthetic data for 6 patients; a phantom endoscopic video frame; the fused geometry with color-coded deviation (in millimeters) from the ground truth CT.

Phantom Data. To test our method on real-world data in a controlled environment, we 3D-printed a static phantom model (Fig. 3) from one patient’s CT data and then collected endoscopic video and high-resolution CT for the model. We produced SfMS reconstructions for 600 frames in the video, among which 20 reconstructions were uniformly selected for geometry fusion (using more surfaces for geometry fusion won’t further increase accuracy, but will be computationally slower). The SfMS results were downsampled to \(\sim \)2500 vertices and rigidly aligned to the CT space. Since the phantom is rigid, the registration plays the role of unifying inconsistent SfMS estimation. No outlier geometry trimming was performed in this test. We define a vertex’s deviation as its distance to the nearest point in the CT surface. The average deviation of all vertices is 1.24 mm for the raw reconstructions and is 0.94 mm for the fused geometry, which shows that the registration can help filter out inaccurate SfMS geometry estimation. Figure 3 shows that the fused geometry resembles the ground truth CT surface except in the farther part, where less data was available in the video.

Patient Data. We produced endoscopograms for 8 video sequences (300 frames per sequence) extracted from 4 patient endoscopies. Outlier geometry trimming was used since lighting conditions were often poor. We computed the overlap distance (OD) defined in [12], which measures the average surface deviation between all pairs of overlapping regions. The average OD of the 8 cases is \(\mathbf 1.6 \,\pm \,\mathbf 0.13 \) mm before registration, \(\mathbf 0.58 \,\pm \,\mathbf 0.05 \) mm after registration, and \(\mathbf 0.24 \,\pm \,\mathbf 0.09 \) mm after outlier geometry trimming. Figure 4 shows one of the cases.

Fig. 4.
figure 4

OD plot on the point cloud of 20 surfaces. Left to right: before registration, after registration, after outlier geometry trimming, the final endoscopogram.

5 Conclusion

We have described a pipeline for producing an endoscopogram from a video sequence. We proposed a novel groupwise surface registration algorithm and an outlier-geometry trimming algorithm. We have demonstrated via synthetic and phantom tests that the N-body scenario is robust for registering partially-overlapping surfaces with missing data. Finally, we produced endoscopograms for real patient endsocopic videos. A current limitation is that the video sequence is at most 3–4 s long for robust SfM estimation. Future work involves fusing multiple endoscopograms from different video sequences.