Abstract
Endoscopy enables high resolution visualization of tissue texture and is a critical step in many clinical workflows, including diagnosis and radiation therapy treatment planning for cancers in the nasopharynx. However, an endoscopic video does not provide explicit 3D spatial information, making it difficult to use in tumor localization, and it is inefficient to review. We introduce a pipeline for automatically reconstructing a textured 3D surface model, which we call an endoscopogram, from multiple 2D endoscopic video frames. Our pipeline first reconstructs a partial 3D surface model for each input individual 2D frame. In the next step (which is the focus of this paper), we generate a single high-quality 3D surface model using a groupwise registration approach that fuses multiple, partially overlapping, incomplete, and deformed surface models together. We generate endoscopograms from synthetic, phantom, and patient data and show that our registration approach can account for tissue deformations and reconstruction inconsistency across endoscopic video frames.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Modern radiation therapy treatment planning relies on imaging modalities like CT for tumor localization. For throat cancer, an additional kind of medical imaging, called endoscopy, is also taken at treatment planning time. Endoscopic videos provide direct optical visualization of the pharyngeal surface and provide information, such as a tumor’s texture and superficial (mucosal) spread, that is not available on CT due to CT’s relatively low contrast and resolution. However, the use of endoscopy for treatment planning is significantly limited by the fact that (1) the 2D frames from the endoscopic video do not explicitly provide 3D spatial information, such as the tumor’s 3D location; (2) reviewing the video is time-consuming; and (3) the optical views do not provide the full geometric conformation of the throat.
In this paper, we introduce a pipeline for reconstructing a 3D textured surface model of the throat, which we call an endoscopogram, from 2D video frames. The model provides (1) more complete 3D pharyngeal geometry; (2) efficient visualization; and (3) the opportunity to register endoscopy data with the CT, thereby enabling transfer of the tumor contours and texture into the CT space.
State-of-the-art monocular endoscopic reconstruction techniques have been applied in applications like colonoscopy inspection [1], laparoscopic surgery [2] and orthopedic surgeries [3]. However, most existing methods cannot simultaneously deal with the following three challenges: (1) non-Lambertian surfaces; (2) non-rigid deformation of tissues across frames; and (3) poorly known shape or motion priors. Our proposed pipeline deals with these problems using (1) a Shape-from-Motion-and-Shading (SfMS) method [4] incorporating a new reflectance model for generating single-frame-based partial reconstructions; and (2) a novel geometry fusion algorithm for non-rigid fusion of multiple partial reconstructions. Since our pipeline does not assume any prior knowledge on environments, motion and shapes, it can be readily generalized to other endoscopic applications in addition to our nasopharyngoscopy reconstruction problem.
In this paper we focus on the geometry fusion step mentioned above. The challenge here is that all individual reconstructions are only partially overlapping due to the constantly changing camera viewpoint, may have missing data (holes) due to camera occlusion, and may be slightly deformed since the tissue may have deformed between 2D frame acquisitions. Our main contribution in this paper is the design of a novel groupwise surface registration algorithm that can deal with these limitations. An additional contribution is an outlier geometry trimming algorithm based on robust regression. We generate endoscopograms and validate our registration algorithm with data from synthetic CT surface deformations and endoscopic video of a rigid phantom and real patients.
2 Endoscopogram Reconstruction Pipeline
The input to our system (Fig. 1) is a video sequence of hundreds of consecutive frames \(\{\mathcal {F}_i | i=1...N\}\). The output is an endoscopogram, which is a textured 3D surface model derived from the input frames. We first generate for each frame \(\mathcal {F}_i\) a reconstruction \(\mathcal {R}_i\) by the SfMS method. We then fuse multiple single-frame reconstructions \(\{\mathcal {R}_i\}\) into a single geometry \(\mathcal {R}\). Finally, we texture \(\mathcal {R}\) by pulling color from the original frames \(\{\mathcal {F}_i\}\). We will focus on the geometry fusion step in Sect. 3 and briefly introduce the other techniques in the rest of this section.
Shape from Motion and Shading (SfMS). Our novel reconstruction method [4] has been shown to be efficient in single-camera reconstruction of live endoscopy data. The method leverages sparse geometry information obtained by Structure-from-Motion (SfM), Shape-from-Shading (SfS) estimation, and a novel reflectance model to characterize non-Lambertian surfaces. In summary, it iteratively estimates the reflectance model parameters and a SfS reconstruction surface for each individual frame under sparse SfM constraints derived within a sliding time window. One drawback of this method is that large tissue deformation and lighting changes across frames can induce inconsistent individual SfS reconstructions. Nevertheless, our experiments show that this kind of error can be well compensated in the subsequent geometry fusion step. In the end, for each frame \(\mathcal {F}_i\), a reconstruction \(\mathcal {R}_i\) is produced as a triangle mesh and transformed into the world space using the camera position parameters estimated from SfM. Mesh faces that are nearly tangent to the camera viewing ray are removed because they correspond to occluded regions. The end result of this is that the reconstructions \(\{\mathcal {R}_i\}\) have missing patches and different topology and are only partially overlapping with each other.
Texture Mapping. The goal of texture mapping is to assign a color to each vertex \(v^k\) (superscripts refer to vertex index) in the fused geometry \(\mathcal {R}\), which is estimated by the geometry fusion (Sect. 3) of all the registered individual frame surfaces \(\{\mathcal {R}'_i\}\). Our idea is to find a corresponding point of \(v^k\) in a registered surface \(\mathcal {R}'_i\) and to trace back its color in the corresponding frame \(\mathcal {F}_i\). Since \(v^k\) might have correspondences in multiple registered surfaces, we formulate this procedure as a labeling problem and optimize a Markov Random Field (MRF) energy function. In general, the objective function prefers pulling color from non-boundary nearby points in \(\{\mathcal {R}'_i\}\), while encouraging regional label consistency.
3 Geometry Fusion
This section presents the main methodological contributions of this paper: a novel groupwise surface registration algorithm based on N-body interaction, and an outlier-geometry trimming algorithm based on robust regression.
Related Work. Given the set of partial reconstructions \(\{\mathcal {R}_i\}\), our goal is to non-rigidly deform them into a consistent geometric configuration, thus compensating for tissue deformation and minimizing reconstruction inconsistency among different frames. Current groupwise surface registration methods often rely on having or iteratively estimating the mean geometry (template) [5]. However, in our situation, the topology change and partially overlapping data renders initial template geometry estimation almost impossible. Missing large patches also pose serious challenges to the currents metric [6] for surface comparison. Template-free methods have been studied for images [7], but it has not been shown that such methods can be generalized to surfaces. The joint spectral graph framework [8] can match a group of surfaces without estimating the mean, but these methods do not explicitly compute deformation fields for geometry fusion.
Zhao et al. [9] proposed a pairwise surface registration algorithm, Thin Shell Demons, that can handle topology change and missing data. We have extended this algorithm into our groupwise situation.
Thin Shell Demons. Thin Shell Demons is a physics-motivated method that uses geometric virtual forces and a thin shell model to estimate surface deformation. The so-called forces \(\{f\}\) between two surfaces \(\{\mathcal {R}_1,\mathcal {R}_2\}\) are vectors connecting automatically selected corresponding vertex pairs, i.e. \(\{f(v^k) = u^k-v^k \mid v^k \in \mathcal {R}_1, u^k \in \mathcal {R}_2\}\) (with some abuse of notation, we use k here to index correspondences). The algorithm regards the surfaces as elastic thin shells and produces a non-parametric deformation vector field \(\phi :\mathcal {R}_1\rightarrow \mathcal {R}_2\) by iteratively minimizing the energy function \(\mathrm {}{E}(\phi )=\sum _{k=1}^M c(v^k)(\phi (v^k)-f(v^k))^2 +E_{shell}(\phi ).\) The first part penalizes inconsistency between the deformation vector and the force vector applied on a point and uses a confidence score c to weight the penalization. The second part minimizes the thin shell deformation energy, which is defined as the integral of local bending and membrane energy:
where Y and \(\tau \) are the Young’s modulus and Poisson’s ratio of the shell. \(\sigma _{mem}\) is the tangential Cauchy-Green strain tensor characterizing local stretching. The bending strain tensor \(\sigma _{bend}\) characterizes local curvature change and is computed as the shape operator change.
3.1 N-Body Surface Registration
Our main observation is that the virtual force interaction is still valid among N partial shells even without the mean geometry. Thus, we propose a groupwise deformation scenario as an analog to the N-body problem: N surfaces are deformed under the influence of their mutual forces. This groupwise attraction can bypass the need of a target mean and still deform all surfaces into a single geometric configuration. The deformation of a single surface is independent and fully determined by the overall forces exerted on it. With the physical thin shell model, its deformation can be topology-preserving and not influenced by its partial-ness. With this notion in mind, we now have to define (1) mutual forces among N partial surfaces; (2) an evolution strategy to deform the N surfaces.
Mutual Forces. In order to derive mutual forces, correspondences should be credibly computed among N partial surfaces. It has been shown that by using the geometric descriptor proposed in [10], a set of correspondences can be effectively computed between partial surfaces. Additionally, in our application, each surface \(\mathcal {R}_i\) has an underlying texture image \(\mathcal {F}_i\). Thus, we also compute texture correspondences between two frames by using standard computer vision techniques. To improve matching accuracy, we compute inlier SIFT correspondences only between frame pairs that are at most T seconds apart. Finally, these SIFT matchings can be directly transformed to 3D vertex correspondences via the SfSM reconstruction procedure.
In the end, any given vertex \(v^k_i \in \mathcal {R}_i\) will have \(M^k_i\) corresponding vertices in other surfaces \(\{\mathcal {R}_j | j \ne i \}\), given as vectors \(\{f^\beta (v^k_i) = u^\beta - v^k_i, \beta = 1...M^k_i\}\), where \(u^\beta \) is the \(\beta ^{th}\) correspondence of \(v^k_i\) in some other surface. These correspondences are associated with confidence scores \(\{c^\beta (v^k_i)\}\) defined by
where \(\delta \) is the geometric feature distance defined in [10]. Since we only consider inlier SIFT matchings using RANSAC, the confidence score for texture correspondences is a constant \(\bar{c}\). We then define the overall force exerted on \(v^k_i\) as the weighted average: \(\bar{f}(v^k_i) = \sum _{\beta =1}^{M^k_i} c^\beta (v^k_i) f^\beta (v^k_i) / \sum _{\beta =1}^{M^k_i} c^\beta (v^k_i).\)
Deformation Strategy. With mutual forces defined, we can solve for the group deformation fields \(\{\phi _i\}\) by optimizing independently for each surface
where \(M_i\) is the number of vertices that have forces applied. Then, a groupwise deformation scenario is to evolve the N surfaces by iteratively estimating the mutual forces \(\{f\}\) and solving for the deformations \(\{\phi _i\}\). However, a potential hazard of our algorithm is that without a common target template, the N surfaces could oscillate, especially in the early stage when the force magnitudes are large and tend to overshoot the deformation. To this end, we observe that the thin shell energy regularization weights \(\lambda _1,\lambda _2\) control the deformation flexibility. Thus, to avoid oscillation, we design the strategy shown in Algorithm 1.
3.2 Outlier Geometry Trimming
The final step of geometry fusion is to estimate a single geometry \(\mathcal {R}\) from the registered surfaces \(\{\mathcal {R}'_i\}\) [11]. However, this fusion step can be seriously harmed by the outlier geometry created by SfMS. Outlier geometries are local surface parts that are wrongfully estimated by SfMS under bad lighting conditions (insufficient lighting, saturation, or specularity) and are drastically different from all other surfaces (Fig. 2a). The sub-surfaces do not correspond to any part in other surfaces and thereby are carried over by the deformation process to \(\{\mathcal {R}'_i\}\).
Our observation is that outlier geometry changes a local surface’s topology (branching) and violates many differential geometry properties. We know that the local surface around a point in a smooth 2-manifold can be approximately presented by a quadratic Monge Patch \(h:U\rightarrow \mathbb {R} ^3\), where U defines a 2D open set in the tangent plane, and h is a quadratic height function. Our idea is that if we robustly fit a local quadratic surface at a branching place, the surface points on the wrong branch of outlier geometry will be counted as outliers (Fig. 2b).
We define the 3D point cloud \(\mathcal {L}=\{v^1,...v^P\}\) of P points as the ensemble of all vertices in \(\{\mathcal {R}'_i\}\), \(\mathcal {N}(v^k)\) as the set of points in the neighborhood of \(v^k\) and \(\mathcal {W}\) as the set of outlier scores of \(\mathcal {L}\). For a given \(v^k\), we transform \(\mathcal {N}(v^k)\) by taking \(v^k\) as the center of origin and the normal direction of \(v^k\) as the z-axis. Then, we use Iteratively Reweighted Least Squares to fit a quadratic polynomial to the normalized \(\mathcal {N}(v^k)\) (Fig. 2b). The method produces outlier scores for each of the points in \(\mathcal {N}(v^k)\), which are then accumulated into \(\mathcal {W}\) (Fig. 2c). We repeat this robust regression process for all \(v^k\) in \(\mathcal {L}\). Finally, we remove the outlier branches by thresholding the accumulated scores \(\mathcal {W}\), and the remaining largest point cloud is used to produce the final single geometry \(\mathcal {R}\) [11] (Fig. 2d).
4 Results
We validate our groupwise registration algorithm by generating and evaluating endoscopograms from synthetic data, phantom data, and real patient endoscopic videos. We selected algorithm parameters by tuning on a test patient’s data (separate from the datasets presented here). We set the thin shell elastic parameters \(Y=2,\tau =0.05\), the energy weighting parameters \(\lambda _1=\lambda _2=1, \sigma = 0.95\), the frame interval \(T=0.5s\), and the texture confidence score \(\bar{c} = 1\).
Synthetic Data. We produced synthetic deformations to 6 patients’ head-and-neck CT surfaces. Each surface has 3500 vertices and a 2–3 cm cross-sectional diameter, covering from the pharynx down to the vocal cords. We created deformations typically seen in real data, such as the stretching of the pharyngeal wall and the bending of the epiglottis. We generated for each patient 20 partial surfaces by taking depth maps from different camera positions in the CT space. Only geometric correspondences were used in this test. We measured the registration error as the average Euclidean distance of all pairs of corresponding vertices after registration (Fig. 3). Our method significantly reduced error and performed better than a spectral-graph-based method [10], which is another potential framework for matching partial surfaces without estimating the mean.
Phantom Data. To test our method on real-world data in a controlled environment, we 3D-printed a static phantom model (Fig. 3) from one patient’s CT data and then collected endoscopic video and high-resolution CT for the model. We produced SfMS reconstructions for 600 frames in the video, among which 20 reconstructions were uniformly selected for geometry fusion (using more surfaces for geometry fusion won’t further increase accuracy, but will be computationally slower). The SfMS results were downsampled to \(\sim \)2500 vertices and rigidly aligned to the CT space. Since the phantom is rigid, the registration plays the role of unifying inconsistent SfMS estimation. No outlier geometry trimming was performed in this test. We define a vertex’s deviation as its distance to the nearest point in the CT surface. The average deviation of all vertices is 1.24 mm for the raw reconstructions and is 0.94 mm for the fused geometry, which shows that the registration can help filter out inaccurate SfMS geometry estimation. Figure 3 shows that the fused geometry resembles the ground truth CT surface except in the farther part, where less data was available in the video.
Patient Data. We produced endoscopograms for 8 video sequences (300 frames per sequence) extracted from 4 patient endoscopies. Outlier geometry trimming was used since lighting conditions were often poor. We computed the overlap distance (OD) defined in [12], which measures the average surface deviation between all pairs of overlapping regions. The average OD of the 8 cases is \(\mathbf 1.6 \,\pm \,\mathbf 0.13 \) mm before registration, \(\mathbf 0.58 \,\pm \,\mathbf 0.05 \) mm after registration, and \(\mathbf 0.24 \,\pm \,\mathbf 0.09 \) mm after outlier geometry trimming. Figure 4 shows one of the cases.
5 Conclusion
We have described a pipeline for producing an endoscopogram from a video sequence. We proposed a novel groupwise surface registration algorithm and an outlier-geometry trimming algorithm. We have demonstrated via synthetic and phantom tests that the N-body scenario is robust for registering partially-overlapping surfaces with missing data. Finally, we produced endoscopograms for real patient endsocopic videos. A current limitation is that the video sequence is at most 3–4 s long for robust SfM estimation. Future work involves fusing multiple endoscopograms from different video sequences.
References
Hong, D., Tavanapong, W., Wong, J., Oh, J., de Groen, P.C.: 3D reconstruction of virtual colon structures from colonoscopy images. Comput. Med. Imaging Graph. 38(1), 22–23 (2014)
Maier-Hein, L., Mountney, P., Bartoli, A., Elhawary, H., Elson, D., Groch, A., Kolb, A., Rodrigues, M., Sorger, J., Speidel, S., Stoyanov, D.: Optical techniques for 3D surface reconstruction in computer-assisted laparoscopic surgery. Med. Image Anal. 17(8), 974–996 (2013)
Wu, C., Narasimhan, S.G., Jaramaz, B.: A multi-image shape-from-shading framework for near-lighting perspective endoscopes. Int. J. Comput. Vis. 86(2), 211–228 (2010)
Price, T., Zhao, Q., Rosenman, J., Pizer, S., Frahm, J.M.: Shape from motion and shading in uncontrolled environments. Under submission, To appear. http://midag.cs.unc.edu/
Durrleman, S., Prastawa, M., Korenberg, J.R., Joshi, S., Trouvé, A., Gerig, G.: Topology preserving atlas construction from shape data without correspondence using sparse parameters. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012, Part III. LNCS, vol. 7512, pp. 223–230. Springer, Heidelberg (2012)
Durrleman, S., Pennec, X., Trouvé, A., Ayache, N.: Statistical models of sets of curves and surfaces based on currents. Med. Image Anal. 13(5), 793–808 (2009)
Balci, S.K., Golland, P., Shenton, M., Wells, W.M.: Free-form B-spline deformation model for groupwise registration. In: MICCAI, pp. 23–30 (2007)
Arslan, S., Parisot, S., Rueckert, D.: Joint spectral decomposition for the parcellation of the human cerebral cortex using resting-state fMRI. In: Ourselin, S., Alexander, D.C., Westin, C.-F., Cardoso, M.J. (eds.) IPMI 2015. LNCS, vol. 9123, pp. 85–97. Springer, Heidelberg (2015)
Zhao, Q., Price, J.T., Pizer, S., Niethammer, M., Alterovitz, R., Rosenman, J.: Surface registration in the presence of topology changes and missing patches. In: Medical Image Understanding and Analysis, pp. 8–13 (2015)
Zhao, Q., Pizer, S., Niethammer, M., Rosenman, J.: Geometric-feature-based spectral graph matching in pharyngeal surface registration. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014, Part I. LNCS, vol. 8673, pp. 259–266. Springer, Heidelberg (2014)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: SIGGRAPH, pp. 303–312 (1996)
Huber, D.F., Hebert, M.: Fully automatic registration of multiple 3D data sets. Image Vis. Comput. 21(7), 637–650 (2003)
Acknowledgements
This work was supported by NIH grant R01 CA158925.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Zhao, Q., Price, T., Pizer, S., Niethammer, M., Alterovitz, R., Rosenman, J. (2016). The Endoscopogram: A 3D Model Reconstructed from Endoscopic Video Frames. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9900. Springer, Cham. https://doi.org/10.1007/978-3-319-46720-7_51
Download citation
DOI: https://doi.org/10.1007/978-3-319-46720-7_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46719-1
Online ISBN: 978-3-319-46720-7
eBook Packages: Computer ScienceComputer Science (R0)