Sparse-then-Dense Alignment based 3D Map Reconstruction Method for Endoscopic Capsule Robots

Since the development of capsule endoscopcy technology, substantial progress were made in converting passive capsule endoscopes to robotic active capsule endoscopes which can be controlled by the doctor. However, robotic capsule endoscopy still has some challenges. In particular, the use of such devices to generate a precise and globally consistent three-dimensional (3D) map of the entire inner organ remains an unsolved problem. Such global 3D maps of inner organs would help doctors to detect the location and size of diseased areas more accurately, precisely, and intuitively, thus permitting more accurate and intuitive diagnoses. The proposed 3D reconstruction system is built in a modular fashion including preprocessing, frame stitching, and shading-based 3D reconstruction modules. We propose an efficient scheme to automatically select the key frames out of the huge quantity of raw endoscopic images. Together with a bundle fusion approach that aligns all the selected key frames jointly in a globally consistent way, a significant improvement of the mosaic and 3D map accuracy was reached. To the best of our knowledge, this framework is the first complete pipeline for an endoscopic capsule robot based 3D map reconstruction containing all of the necessary steps for a reliable and accurate endoscopic 3D map. For the qualitative evaluations, a real pig stomach is employed. Moreover, for the first time in literature, a detailed and comprehensive quantitative analysis of each proposed pipeline modules is performed using a non-rigid esophagus gastro duodenoscopy simulator, four different endoscopic cameras, a magnetically activated soft capsule robot (MASCE), a sub-millimeter precise optical motion tracker and a fine-scale 3D optical scanner.

all the selected key frames jointly in a globally consistent way, a significant improvement of the mosaic and 3D map accuracy was reached. To the best of our knowledge, this framework is the first complete pipeline for an endoscopic capsule robot based 3D map reconstruction containing all of the necessary steps for a reliable and accurate endoscopic 3D map. For the qualitative evaluations, a real pig stomach is employed. Moreover, for the first time in literature, a detailed and comprehensive quantitative analysis of each proposed pipeline modules is performed using a non-rigid esophagus gastro duodenoscopy simulator, four different endoscopic cameras, a magnetically activated soft capsule robot (MASCE), a sub-millimeter precise optical motion tracker and a fine-scale 3D optical scanner.

INTRODUCTION
Many diseases necessitate access to the internal anatomy of the patient for diagnosis and treatment. Since direct access to most anatomical regions of interest is traumatic, and sometimes impossible, endoscopic cameras have become a common method for viewing the anatomical structure. In particular, capsule endoscopy has emerged as a promising new technology for minimally invasive diagnosis and treatment of gastrointestinal (GI) tract disease. The low invasiveness and high potential of this technology has led to substantial investment in their development by both academic and industrial research groups, such that it may soon be feasible to produce a capsule endoscope with most of the functionality of current flexible endoscopes.
Although robotic capsule endoscopy has high potential, it continues to face many challenges. In particular, there is no broadly accepted method for generating a 3D map of the organ being investigated. This problem is made more severe by the fact that such a map may require a precise localization method for the endoscope, and such a method will itself require a map of the organ, a classic chicken-and-egg problem [1]. The repetitive texture, lack of distinctive features, and specular reflections characteristic of the GI tract exacerbate this difficulty, and the non-rigid deformities introduced by peristaltic motions further complicate the reconstruction task [2]. Finally, the small size of endoscope camera systems implies a number of limitations, such as restricted fields of view, low signal-to-noise ratio, and low frame rate, all of which degrade image quality [3]. These issues, to name a few, make accurate and precise localization and reconstruction a difficult problem and can render navigation and control counterintuitive [4], [5].
Despite these challenges, accurate and robust three-dimensional (3D) mapping of patient-specific anatomy remains a tantalizing goal. Such a map would provide doctors with a reliable measure of the size and location of a diseased area, thus allowing more intuitive and accurate diagnoses. In addition, should nextgeneration medical devices be actively controlled, a map would dramatically improve the doctors control in diagnostic, prognostic, and biopsy-like operations. As such, considerable energy has been devoted to adapting computer vision techniques to the problem of in-vivo 3D reconstruction of tissue surface geometry.
Two primary approaches have been pursued as workarounds for the challenges mentioned previously. First, tomographic intra-operative imaging modalities, such as ultrasound (US), intra-operative computed tomography (CT), and interventional magnetic resonance imaging (iMRI) have been investigated for capturing detailed information of patient-specific tissue geometry [6]. However, surgical and diagnostic operations pose significant technological challenges and costs for the use of such devices, due to the need to acquire a high signal-to-noise ratio (SNR) in real-time without impediment to the doctor. Another proposal has been to equip endoscopes with alternative sensor systems in the hope of providing additional information; however, these alternative systems have other restrictions that limit their use within the body. This paper proposes a complete pipeline for 3D visual map reconstruction using only RGB camera images, with no additional sensor information. This pipeline is arranged in a modular form, and includes a preprocessing module for removal of specular reflections, vignetting and radial lens distortions, an image-stitching module for registration of images, and a shape-from-shading (SfS) module for reconstruction of 3D structures. We provide qualitative and quantitative analysis of pose estimation and 3D mapping accuracy using a real pig stomach, an esophagus gastro-duodenoscopy simulator, four different endoscopic camera models, an optical motion tracker and a 3D optical scanner. In sum, our method proposes a substantial contribution towards a more general, therapeutically relevant and extensive use of the information that capsule endoscopes may provide.

• stereoscopy
• shape-from-shading (SfS) • structured light (SL) • time-of-flight (ToF) Structured light and time-of-flight methods require additional sensors, with a concordant increase in cost and space; as such, they are not covered in this paper. Stereo-based methods use the parallax observed when viewing a scene from two distinct viewpoints to obtain an estimate of the distance from observer to object under observation. Typically, such algorithms have four stages in computing the disparity map [15]: cost computation, cost aggregation, disparity computation and optimization, and disparity refinement.
With multiple algorithms reported per year, computational stereo depth perception has become a saturated field. The first work reporting stereoscopic depth reconstruction in endoscopic images was the work done by [7], which was implemented a dense computational stereo algorithm. Later, Hager et al. developed a semi-global optimization [8], which was used to register the depth map acquired during surgery to pre-operative models [9]. Stoyanov et al. used local optimization to propagate disparity information around feature-matched seed points, and it has also been reported to perform well for endoscopic images. This method was able to ignore highlights, occlusions or noise regions.
Similar to stereo vision, another method that employs epipolar geometry and feature extraction is also proposed in [16]. Similar to the stereo vision, this work flow starts with camera calibration and it mostly relies on SIFT extraction and feature description. Finally, the main algorithm calculates the 3D spatial point location using extrinsic parameters which is calculated from matched features in consecutive frames. Although this system exploits the advantage of sparse 3D reconstruction, the strong dependency on feature extraction causes performance related issues for endoscopic type of imaging. Despite the variety of algorithms and simplicity of implementation, computational stereo techniques bear several important flaws. To begin with, stereo reconstruction algorithms generally require two cameras, since the triangulation needs a known baseline between viewpoints. Further, the accuracy of triangulation decreases with distance from the cameras due to the shrinkage of relative baseline between camera centers and reconstructed points (Micro-baseline problem). Most endoscopic capsule robots mount only one camera, and in those that mount more, the diameter of endoscope inherently bounds the baseline. As such, stereo techniques have yet to find wide application in endoscopy.
Due to the difficulty in obtaining stereo-compatible hardware, efforts have been made to adapt passive monocular three-dimensional reconstruction techniques to endoscopic images. These techniques have been a focus of research in computer vision for decades, and have the distinct advantage of not requiring ex- tra hardware equipment in addition to existing endoscopic devices. Two main methods have emerged as useful in the field of endoscopic images: Shape-from-Motion (SfM) and Shape-from-Shading (SfS). SfS, which has been studied since the 1970s [17] has demonstrated some suitability for endoscopic image reconstruction. Its primary assumption is that there's a single lightsource on the scene, of which the intensity and pose relative to the camera are known. Both assumptions are mostly fulfilled in endoscopy [12], [13], [14]. Further, the transfer function of the camera may be included in the algorithm to additionally refine estimates [18]. Additional assumptions are that the object reflects light obeying Lambertian rules and that the object surface has a constant albedo. If these assumptions hold up to a degree and the equation parameters are known, SfS can use the brightness of a pixel to estimate the angle between cameras depth axis and the shape normal at that pixel. This has been demonstrated to be effective in recovering details, although global shape recovery often has flaws. Therefore, many state of art works are mainly based on the combination of these two techniques: In [19], a complete pipeline for 3D reconstruction of endoscopy imaging using SfS and SfM techniques is presented. In this work, the pipeline starts with basic preprocessing steps and focuses on 3D map reconstruction which is independent of light source position and illumination. Finally, the framework ends with frame-to-frame feature matching to solve the scaling issue of mono camera. This paper proposes interesting methods for the arduous task of reconstruction, however, more enhanced preprocessing and especially less dependency to feature extraction and matching is still needed. In recent work of [20], SfS and SfM are fused together to reach a better 3D map accuracy. With SFM, a sparse point cloud is obtained and dense version of this cloud is gen-erated by the help of SFS. For better performance of SFS, they also propose refined reflectance model. One notable idea based on SfS and SfM fusion is proposed in [21]. This methodology firstly reconstructs sparse 3D map points using SfM and iteratively refines the final reconstruction map using SfS. The approach does not directly attack the impediments caused by the ill posed illumination and specular reflectance, although, proposed geometric fusion tries to eliminates such defects. And the strong reliance on feature correspondence establishment remains yet unsolved. Attempts to solve the latter problem with template-matching techniques have had some success, but tend to be computationally very complex which makes it unsuitable for real-time performance. In [22], only SFS is used for reconstruction and 2D features are preferred for estimating transformation. Similarly, [23] and [24] combines SFM and SFS for 3D reconstruction without any preprocessing and with lambertian surface assumption. In [25], machine learning algorithms are applied for 3D reconstruction.
Basically, training is completed with artificial dataset and real endoscopy images are used for test data. Another state of the art pipeline is proposed in [26] which presents a work flow combining RGB camera and inertial measurement unit sensors (IMU). Besides improved results, this hardware makes the overall flow more complex and costly. Moreover, IMU sensors occupy extra place, they are not accurate enough and interfere with the magnetic actuation systems which makes them unsuitable for the next generation actively controllable endoscopic capsule robots. Main common issue remaining for 3D reconstruction of endoscopic-type datasets is the visual complexity of these images. These challenges which we mentioned in abstract and introduction cripples the performance of standard computer vision algorithms. In particular, a proposed method must be robust to specular view-dependent highlights, noise, peristaltic movements, and focus-dependent changes in calibration parameters. Unfortunately, a quantitative measure of algorithm robustness has not been suggested

METHOD
This section represents the proposed framework qualitatively and quantitatively in depth. Preprocessing steps, stitching module and SfS module will be discussed in detail.

Preprocessing
The proposed modular endoscopic 3D map reconstruction framework starts with a comprehensive preprocessing module which suppresses reflections caused by inner organ fluids, eliminates radial distortions and finally de-vignettes the images. Eliminating specular artifacts is a fundamental endoscopic image preprocessing step due to the accumulative errors caused by such distortions affecting the reconstructed final 3D map accuracy. For the reflection detection task, we propose an original method which identifies specular regions by combining shape and appearance information (see Fig. ??). To extract the shape information of the reflection areas from gray-scale image, the gradient map of the input image is created and a morphological closing operation is applied on this map to fill the gaps inside reflection distorted areas. For the closing operation, we used OPEN CV function close(). In parallel to that shape-based approach, an appearance based method applies adaptive thresholding determined by the mean and standard deviation of the gray-scale image I to identify the specular regions: µ I , and σ I are the mean and standard deviation, respectively, of the image I.
Combining both appearance based and shape based reflection detection parts using AND operation leads to a robust reflection detection performance. Once specular reflection pixels are detected, the inpainting method proposed by [27] is applied to suppress the saturated pixels which replaces the specularity by an intensity value derived from a combination of neighboring pixel values Distortion parameters obtained by the chessboard calibration method were used to remove radial lens distortions. The Brown-Conrady undistortion technique was applied to handle the radial distortions [28], [29]. Another common artifact of endoscopic type of images called vignetting which is referring to an inhomogeneous illumination distribution on the image corners with respect to image centers primarily caused by camera lens imperfections and light source limitations was handled by applying a radial gradient symmetry enforcement based method. Especially photometric pose estimation methods are very sensitive to vignetting artifacts, so a robust de-vignetting operation is required before proceeding to pose estimation steps. Our framework applies the vignetting correction approach proposed by [30] which de-vignettes the image by enforcing the symmetry of the radial gradient from center to boundaries. An example of input image and vignettingcorrected output image may be seen in Fig. 2. The effect of de-vignetting is demonstrated in Fig. 3, where it is clearly observable that the intensity levels of de-vignetted image show a more homogenous pattern.

Evaluation of Modern Feature Descriptors
In that section, we evaluate the matching capability of feature point descriptors SURF, SIFT, HOG, ORB and two state-of-the art Optical Flow (OF) methods, Farneback and Lucas-Kanade regarding their matching accuracy on endoscopic type of images. We used OPENCV library for all of the implementations in that section. The re-projection error was used as the accuracy criteria for the performance evaluations [31]. We evaluated the re-projection error over more than 500 endoscopic frame pairs carefully chosen from our real pig stomach dataset; results displayed in    4: Divide the cumulative value by total pixel number for a normalization.

5:
If the normalized cumulative optical flow value is less than predefined threshold τ = 20 pixels, go to the next frame. Else identify the frame as a key frame and go tho the first step.
6: If fifteen frames failed to fulfill the key frame conditions, and still τ = 30 pixels could not be exceeded, assign the frame with highest τ value among these fifteen frames as a key frame and go to the next step.

Frame stitching
A state-of-the art stitching pipeline contains several stages: Feature detection, which detects features within input images, feature matching, which matches features between input images, homography estimation, which estimates extrinsic camera parameters between pairs of matched images, bundle adjustment, which solves for all camera parameters jointly, image warping, which warps the images onto a compositing surface, gain compensation, which nor- , a 3 , a 4 , t x , t y the parameters of affine transformation matrix A, respectively.
We define a cost function measuring the pixel intensity similarity between the image pair (see Eq. 4), which is supposed to be minimized by the corresponding affine transformation parameters.
Since the cost function has to ignore the pixels lying outside the circular patches defined around inlier points, a weighting function w(x, y) is defined: The affine transformation matrix A is iteratively determined by the image transformation that minimizes e M SE using Gaussian-Newton Optimization. CUDA library was utilized to achieve better performance and reduce timing through parallelism. Now that our cost function is defined, an efficient search strategy for the global minimum has to be determined to avoid the Gaussian-Newton Optimization resulting in false convergences. From the six affine transformation parameters, the most significant ones are the two translations in x-and y-direction, tx and ty, because endoscopic scanning procedure contains dominantly translational motions. Thus, the first step of our algorithm gives a rough estimate of t x , t y using the OF vectors u and v acquired by Farneback OF.
Once a rough initialization is done, the second step is the estimation of all affine transformation parameters, using the iterative Gaussian-Newton Optimization to minimize e M SE .  image blending algorithm has to be applied to overcome these issues and create a high quality final mosaic image. We used multi-band blending method proposed by [32] which preserves high frequency information of endoscopic images and suppresses low frequency variations caused by irregular illumination. The blending algorithm overview is to see in Fig. 6. For details of this multi-band blending technique, the reader is referred to the original work of [32]. The following steps in algorithm 3 were applied for the mosaicking process:

Depth image creation
Once the final mosaic image is obtained, the next module creates its depth image using SfS technique of Tsai and Shah [ [35]]. Tsai-Shah SfS method is based on the following assumptions: • The object surface is lambertian This first assumption is not obeyed by raw endoscopic images due to the specular reflections inside the organs. We addressed this problem through the reflection Algorithm 3 Coarse-to-fine image mosaicking 1: Execute key frame selection module to identify a key frame. nique of [32] was employed in our framework.) suppression technique previously described. Subsequently, the above assumptions allow the image intensities to be modeled by where I is the intensity value, p is the albedo (reflecting power of surface), and theta is the angle between surface normal N and light source direction S. With this equation, the gray values of an image I are related only to albedo and angle theta. Using these assumptions, the above equation can be rewritten as follows: where (.) is the dot product, N is the unit normal vector of the surface, and S is the incidence direction of the source light. These may be expressed respectively S = (cosτ · sinσ, sinτ · sinσ, cosσ) where (τ ) and (σ) are the slant and tilt angles, respectively, and p and q are the x and y gradients of the surface Z: p(x, y) = ∂Z(x, y) ∂x (9) q(x, y) = ∂Z(x, y) ∂y The final function then takes the form I(x, y) = ρ · (cosσ + p(x, y) · cosτ · sinσ + q(x, y) · sinτ · sinσ) ((p(x, y)) 2 + (q(x, y)) 2 + 1) (1/2) = R(p x,y , q x,y ) (11) Solving this equation for p and q essentially corresponds to the general problem of SfS. The approximations and solutions for p and q give the reconstructed surface map Z. The necessary parameters are tilt, slant and albedo, and can be estimated as proposed in [31]. The unknown parameters of the 3D reconstruction are the horizontal and vertical gradients of the surface Z, p and q. With discrete approximations, they can be written as follows: q(x, y) = Z(x, y) − Z(x, y − 1) where Z(x, y) is the depth value of each pixel. From these approximations, the reflectance function R(p x,y , q x,y ) can be expressed as Using equations 12, 13, and 14, the reflectance equation may also be written as Tsai and Shah proposes a linear approximation using a first-order Taylor series expansion for function f and for depth map Z n-1 , whereZ n-1 is the recovered depth map after n − 1 iterations. The final equation is where f is a predefined function, constrained by df (Z (n-1) (x, y)) dZ(x, y) and The n th depth map Z n is calculated by using the estimated slant, tilt, and albedo values. Resulting sample images for reflection removal and SfS are to see in Fig. 2

Evaluation
We evaluate the performance of our system both quantitatively and qualitatively in terms of trajectory estimation and surface reconstruction. We also report the computational complexity of the proposed framework.

Dataset
We created our dataset on a real pig stomach and a non-rigid open GI tract model EGD (esophagus gastro duodenoscopy) surgical simulator LM-103. We used the EGD surgical simulator for quantitative analyses, and the real pig stomach for qualitative evaluations. Synthetic stomach fluid was applied to the surface of the EGD simulator to imitate the mucosa layer of the inner tissue.
A robust endoscopic localization and mapping framework should preserve its performance and functionality in case of varying camera specifications. To ensure that our algorithm is not tuned to a specific camera model, which is a common problem we observed on the proposed methods in literature, four different commercially available endoscopic cameras were employed for the video capture. With that aim, we carefully selected four commercially available endoscopic cameras varying in their specifications (resolution, pixel size, image quality, blurriness etc). A total of 17010 endoscopic frames were acquired by these four camera models which were mounted on our robotic magnetically actuated soft capsule endoscope prototype (MASCE). The first sub-dataset, consisting of 4230 frames, was acquired with an Awaiba NanEye camera (see Table 2). The second sub-dataset, consisting of 4340 frames, was acquired by the Misumi V3506-2ES endoscopic camera with the specification shown in Table   3. The third sub-dataset of 4320 frames, was obtained by the Misumi V5506-2ES endoscopic camera with the specification shown in Table 4. Finally, the fourth sub-dataset of 4120 frames, was obtained by the Potensic mini camera with the specification shown in Table 5.
We scanned the open stomach simulator using the 3D Artec Space Spider image scanner and used this 3D scan as the ground truth for the 3D map reconstruction framework (see Fig. 1).  Even though our focus and ultimate goal is an accurate and therapeutically relevant 3D map reconstruction, we also evaluated the pose estimation accuracy of the proposed framework quantitatively since a precise pose estimation is a prerequisite for an accurate 3D mapping. Thus, Optitrack motion tracking system consisting of eight Prime-13 cameras and a tracking software were utilized to obtain 6-DoF localization ground truth data of the endoscopic capsule motion in a sub-millimeter precision (see Fig. 1).     Fig. 9 indicate that our coarse-to-fine pose esti- The reason for such a decrease after reflection suppression is related to the fact that such saturated peak values contain pose information which will be cutoff during reflection suppression. Since such saturated pixels do not change in their appearance between small base-line frame pairs drastically (very few alteration of the illumination incidence angle on the surface), existence of these reflection pixels might have resulted in a slightly better pose alignment. On the other hand, radial un-distortion and de-vignetting operations both increase pose estimation accuracy of the framework as expected.

Surface Reconstruction
We evaluated the surface reconstruction accuracy of our system on the same dataset we applied for the trajectory estimation framework as well.
We scanned the open non-rigid EGD (Esophago-gastroduodenoscopy) simulator to obtain the ground truth 3D data using a highly accurate commercial 3D scanner (Artec 3D Space Spider). Final 3D map of the stomach model obtained by the pro-    Table 8) prove that our system can achieve very high map accuracies. Even in more challenge trajectories such as trajectory 3, our system is still capable of providing an acceptable 3D map of the explored inner organ tissue. 3D reconstructed maps of real pig stomach and syntetic human stomach are represented in Fig. 10 for visual reference. to indicate the contributions of each preprocessing module on the map reconstruction accuracy, map accuracy was tested after applying each preprocessing step and reconstructing the map. As shown in table 9, each preprocessing operation has a certain influence on the RMSE results. One important observation is that even though pose accuracy increases with existence of reflection points, these saturated pixels have negative influence on the map accuracy, as expected. Therefore, disabling reflection suppression during pose estimation and enabling it for map reconstruction is the best option to follow.

Computational Performance
To analyze the computational performance of the proposed framework, we determined the average frame processing time across the trajectory sequences.
The test platform was a desktop PC with an Intel Xeon E5-1660v3-CPU at implying that the pipeline needs to be accelerated using more effective parallelcomputing and GPU power to reach real time performance.