1 Introduction

Many diseases necessitate access to the internal anatomy of the patient for diagnosis and treatment. Since direct access to most anatomic regions of interest is traumatic, and sometimes impossible, endoscopic cameras have become a common method for viewing the anatomical structure. In particular, capsule endoscopy has emerged as a promising new technology for minimally invasive diagnosis and treatment of gastrointestinal (GI) tract diseases. The low invasiveness and high potential of this technology have led to substantial investment in their development by both academic and industrial research groups, such that it may soon be feasible to produce a robotic capsule endoscope with most of the functionality of current flexible endoscopes.

Although robotic capsule endoscopy has high potential of diagnostic and therapeutic capabilities, it continues to face many challenges. In particular, there is no broadly accepted approach for generating a comprehensive and therapeutically relevant 3D map of the organ being investigated. This problem is made more severe by the fact that such a map may require a precise localization method for the endoscope, and such a method will itself require a map of the organ, a classic chicken-and-egg problem [1]. The repetitive texture, lack of distinctive features, and specular reflections characteristic of the GI tract exacerbate this difficulty, and the non-rigid deformations introduced by peristaltic motions further complicate the reconstruction task [2]. Finally, the small size of endoscopic camera systems implies a number of limitations, such as restricted fields of view (FOV), low signal-to-noise ratio, and low frame rate; all of which degrade image quality [3]. These issues, to name a few, make accurate and precise localization and reconstruction a difficult problem and can render navigation and control counterintuitive [4].

Despite these challenges, accurate and robust three-dimensional (3D) mapping of patient-specific anatomy remains a difficult goal. Such a map would provide doctors with a reliable measure of the size and location of a diseased area, thus allowing more intuitive and accurate diagnoses. In addition, should next-generation medical devices be actively controlled, a map would dramatically improve the doctors control in diagnostic, prognostic, and therapeutic operations [5]. As such, considerable energy has been devoted to adapt computer vision techniques to the problem of in vivo 3D reconstruction of tissue surface geometry.

Two primary approaches have been pursued as workarounds for the challenges mentioned previously. First, tomographic intra-operative imaging modalities, such as ultrasound (US), intra-operative computed tomography (CT), and interventional magnetic resonance imaging (iMRI), have been investigated for capturing detailed information of patient-specific tissue geometry [5]. However, surgical and diagnostic operations pose significant technological challenges and costs for the use of such devices, due to the need to acquire a high signal-to-noise ratio (SNR) without impediment to the doctor. Another proposal has been to equip endoscopes with alternative sensor systems in the hope of providing additional information; however, these alternative systems have other restrictions that limit their use within the body.

This paper proposes a complete pipeline for 3D visual map reconstruction using only RGB camera images, with no additional sensor information. The pipeline is arranged in a modular form and includes a preprocessing module for removal of specular reflections, vignetting and radial lens distortions, a keyframe selection module, a pose estimation and image stitching module for registration of images, and a shape-from-shading (SfS) module for reconstruction of 3D structures. We provide both qualitative and quantitative analysis of pose estimation and 3D map reconstruction accuracy using a porcine pig stomach, an esophagus gastroduodenoscopy simulator, four different endoscopic camera models, an optical motion tracker, and a 3D optical scanner. In sum, our method proposes a substantial contribution toward a more general, therapeutically relevant, and extensive use of the information that capsule endoscopes may provide.

2 Literature survey

Several studies in the literature have discussed 3D map reconstruction for standard hand-held and passive capsule endoscopes [6,7,8,9,10,11,12,13], etc. These methods may be broken into four major classes, i.e.,

  • stereoscopy

  • shape from shading (SfS)

  • structured light (SL)

  • time of flight (ToF)

Structured light and time-of-flight methods require additional sensors, with a concomitant increase in cost and space; as such, they are not covered in this paper. Stereo-based methods use the parallax observed when viewing a scene from two distinct viewpoints to obtain an estimate of the distance from observer to object under observation. Typically, such algorithms have four stages in computing the disparity map [14]: cost computation, cost aggregation, disparity computation and optimization, and disparity refinement.

With multiple algorithms reported per year, computational stereo depth perception has become an extremely researched field. The first work reporting stereoscopic depth reconstruction in endoscopic images was the work done by [6], which implemented a dense computational stereo algorithm. Later, Hager et al. developed a semi-global optimization [7], which was used to register the depth map acquired during surgery to preoperative models [8]. Stoyanov et al. used local optimization to propagate disparity information around feature-matched seed points, and it has also been reported to perform well for endoscopic images. This method was able to handle highlights, occlusions, and noisy regions. Similar to stereo vision, another method that employs epipolar geometry and feature extraction is also proposed in [15]. This work flow starts with camera calibration, and it relies on SIFT extraction and feature description. Finally, the main algorithm calculates the 3D spatial point location using extrinsic parameters, which is calculated from matched features in consecutive frames. Although this system exploits the advantage of sparse 3D reconstruction, the strong dependency on feature extraction causes performance-related issues for endoscopic type of imaging. Despite the variety of algorithms and simplicity of implementation, computational stereo techniques are affected by several important disadvantages. To begin with, stereo reconstruction algorithms generally require two cameras, since the triangulation needs a known baseline between viewpoints. Further, the accuracy of triangulation decreases with distance from the cameras due to the shrinkage of relative baseline between camera centers and reconstructed points. Most endoscopic capsule robots have only one camera, and in those that have more, the diameter of endoscope inherently bounds the baseline. As such, stereo techniques have yet to find wide application in endoscopy.

Due to the difficulty in obtaining stereo-compatible hardware, efforts have been made to adapt passive monocular three-dimensional reconstruction techniques to endoscopic images. These techniques have been focused on research in computer vision for decades and have the distinct advantage of not requiring extra hardware equipment in addition to existing endoscopic devices. Two main methods have emerged as useful in the field of endoscopic images: shape from motion (SfM) and shape from shading (SfS). SfS, which has been studied since the 1970s [16], has demonstrated some suitability for endoscopic image reconstruction. Its primary assumption is that there is a single light source on the scene, of which the intensity and pose relative to the camera are known. Both assumptions are mostly fulfilled in endoscopy [11,12,13]. Furthermore, the transfer function of the camera can be included in the algorithm to additionally refine estimates [17]. Additional assumptions are that the object reflects light obeying lambertian model and that the object surface has a constant albedo. If these assumptions hold to a degree and the equation parameters are known, SfS can use the brightness of a pixel to estimate the angle between cameras depth axis and the shape normal at that pixel. This has been demonstrated to be effective in recovering details, although global shape recovery often fails.

Fig. 1
figure 1

Preprocessing pipeline: reflection removal, radial undistortion, de-vignetting

Both methods have been demonstrated to have disadvantages: SfS often fails in the presence of uncertain information, e.g., bleeding, reflections, noise artifacts, and occlusions; feature tracking-based SfM methods tend to fail in the presence of poorly textured areas and occlusions.

Therefore, many state-of-the-art works are mainly based on the combination of these two techniques: In [18], a pipeline for 3D reconstruction of endoscopy imaging using SfS and SfM techniques is presented. In this work, the pipeline starts with basic preprocessing steps and focuses on 3D map reconstruction, which is independent of light source position and illumination. Finally, the framework ends with frame-to-frame feature matching to solve the scaling issue of monocular images. This paper proposes interesting methods for the difficult task of reconstruction. However, enhanced preprocessing and especially less dependency on feature extraction and matching are still needed. In the recent work of [19], SfS and SfM are fused together to reach a better 3D map accuracy. With SFM, a sparse point cloud is obtained and a dense version of this cloud is generated by means of SFS. For better performance of SFS, they also propose a refined reflectance model. One notable idea based on SfS and SfM fusion is proposed in [20]. This methodology first reconstructs a sparse 3D map using SfM and iteratively refines the final reconstruction using SfS. The approach does not directly address the difficulties caused by the ill-posed illumination and specular reflectance, although the proposed geometric fusion tries to eliminate such issues. And the strong reliance on the establishment of feature correspondence remains unsolved. Attempts to solve the latter problem with template-matching techniques have had some success, but tend to be computationally very complex which makes it unsuitable for real-time performance. In [21], only SFS is used for reconstruction and 2D features are preferred for estimating the transformation. Similarly, [22] and [23] combine SFM and SFS for 3D reconstruction without any preprocessing and with the Lambertian surface assumption. In [24], machine learning algorithms are applied for 3D reconstruction. Basically, training is completed with an artificial dataset and real endoscopy images are used for test data. Another state-of-the-art pipeline is proposed in [25], which presents a workflow combining RGB camera and inertial measurement sensors (IMU). Besides improved results, this hardware makes the overall flow more complex and costly. Moreover, IMU sensors occupy extra place and they are not accurate enough. In addition, they interfere with the magnetic actuation systems which makes them unsuitable for the next generation of actively controllable endoscopic capsule robots. The main common issue remaining for 3D reconstruction of endoscopic-type datasets is the visual complexity of these images. The challenges which we mentioned in the abstract and introduction affect the performance of standard computer vision algorithms. In particular, the proposed method must be robust to specular view-dependent highlights, noise, peristaltic movements, and focus-dependent changes in calibration parameters. Unfortunately, a quantitative measure of algorithm robustness has not been suggested in the literature until today, despite its clear value for the evaluation of algorithmic dependability and precision. Moreover, all of the mentioned methods in that section were developed and evaluated on only one specific camera model, which makes it impossible to justify the robustness of the framework in the case of different camera choices with limited specifications such as lower resolution and image quality.

Our paper proposes a full pipeline consisting of camera calibration, reflection detection and suppression, radial undistortion, de-vignetting, keyframe selection, pose estimation, frame stitching, and SfS to reconstruct a therapeutically relevant 3D map of the organ under observation. Both synthetic and real pig stomachs are used for evaluation. Among other contributions, an extensive quantitative analysis has been proposed and performed to demonstrate the influence of pipeline modules on the accuracy and robustness of the estimated camera pose and reconstructed 3D map. To our knowledge, this is the first such comprehensive quantitative analysis to be enacted in endoscopic type of image processing.

3 Method

This section represents the proposed framework in more depth. Preprocessing steps, keyframe selection, pose estimation, frame stitching, and SfS module will be discussed in detail.

3.1 Preprocessing

The proposed modular endoscopic 3D map reconstruction framework starts with a preprocessing module which performs intrinsic camera calibration, reflection detection and suppression, radial distortion correction, and de-vignetting. Specular reflections are a common problem causing inaccurate depth estimation and map reconstruction. Therefore, eliminating specular artifacts is a fundamental endoscopic image preprocessing step to ensure lambertian surface properties and increase the quality of the 3D map. On the other hand, specularities can deliver useful information for pose estimation, especially orientation information. For the reflection detection task, we propose an original method which determines the reflection regions by making use of geometric and photometric information. To determine the locations of the reflection areas, the gradient map of the input gray-scale image is created and a morphological closing operation is applied to fill the gaps inside reflection-distorted areas. For the closing operation, we used OPENCV function close(). In parallel, a photometric method applies adaptive thresholding determined by the mean and standard deviation of the gray-scale image I to identify the specular regions:

(1)

where \(\mu _I\) and \(\sigma _I\) are the mean and standard deviation of the intensity levels of the gray-scale image I. The pixel-wise combination of both detection strategies leads to a robust reflection detection approach. Once specular reflection pixels are detected, the inpainting method proposed by [26] is applied to suppress the saturated pixels by replacing the specularity by an intensity value derived from a combination of neighboring pixel values.

Fig. 2
figure 2

Demonstration of the de-vignetting process

As a next step, the Brown-Conrady [27] undistortion technique is applied to handle the radial distortions. Vignetting, referring to an inhomogeneous illumination distribution relative to the image center, primarily caused by camera lens imperfections and light source limitations, is handled by applying a radial gradient symmetry enforcement-based method (Fig. 1). Our framework applies the vignetting correction approach proposed by [28] which de-vignettes the image by enforcing the symmetry of the radial gradient from center to boundaries. An example of input image and vignetting-corrected output image can be seen in Fig. 1. De-vignetting is demonstrated in Fig. 2, where it is clearly observable that the intensity levels of de-vignetted image have a more homogeneous pattern.

3.2 Keyframe selection

Endoscopic videos generally contain thousands of highly overlapping frames (more than \(\% 75\) overlap) due to slow endoscopic capsule movement during organ exploration. A subset of the most relevant keyframes has to be chosen automatically. The minimum amount of key frames required to recover the entire stomach surface with approximately \(\%50\) overlapping area between keyframes is around 300 frames. Thus, at least every tenth frame could be selected as a keyframe. However, since the endoscopic capsule robot motion is not constant during organ exploration, it is not a good practice to blindly assign keyframes with a constant interval. We developed an adaptive keyframe selection method based on Farneback optical flow (OF) estimation between frame pairs. Farneback OF is chosen due to its improved performance relative to other optical flow methods applied to our dataset. We add the magnitudes of optical flow values for each frame pair and normalize the sum by total image resolution. If the normalized sum does not exceed a predefined threshold \(\tau = 30\) pixels, the overlap between reference keyframe and keyframe candidate is accepted as being high (more than \(\% 70\) overlap). In that case, candidate frame fails and the algorithm goes to the next frame. The loop starts again and runs until a keyframe is found. The key frame selection procedure and termination criteria are represented in algorithm 1:

figure g

3.3 Keyframe stitching

A state-of-the-art image stitching pipeline contains several stages:

  • Feature detection, which detects features in input image pair.

  • Feature matching, which matches features between input images.

  • Homography estimation, which estimates extrinsic camera parameters between the image pairs.

  • Bundle adjustment, which is a postprocessing step to correct drifts in a global manner.

  • Image warping, which warps the images onto a compositing surface.

  • Gain compensation, which normalizes the brightness and contrast of all images.

  • Blending, which blends pixels along the stitch seam to reduce the visibility of seams.

Fig. 3
figure 3

Image stitching flowchart

Fig. 4
figure 4

Multi-band blending flowchart

Stitching algorithms fall broadly into two categories: direct alignment-based methods and feature-based methods. Direct alignment-based methods attempt to match every pixel between the frame pair using iterative optimization techniques. These methods have the benefit of using all the available data which is a good practice for low-textured images such as endoscopic type of images. However, direct methods require a good initialization so that they do not converge into local minima. Moreover, they are very susceptible to varying brightness conditions. Feature-based methods, on the other hand, first find unique feature points such as corners and try to match them. These methods do not require an initialization, but the features are not easy to detect in low-textured images and detected features can be susceptible to illumination changes, scale changes caused by zoom-in and out and viewpoint changes. Our keyframe stitching technique makes use of both alignment methods in a coarse-to-fine fashion combining Farneback OF-based coarse alignment with patch-wise fine alignment. Farneback OF delivers the initial 2D motion estimation, whereas the SSD-based energy minimization applied to circular regions of interest with a radius of 15 pixels around each inlier point refines this estimation. Patch-wise fine alignment estimates the parameters of affine transformation by minimizing an intensity difference-based energy cost function. The affine transformation maps an image \(I_1\) onto the reference image \(I_2\), where \(x'\) , \(y'\) represent the transformed and x, y the original pixel coordinates, and \(a_1\), \(a_2\), \(a_3\), \(a_4\), \(t_x\), \(t_y\) the parameters of affine transformation matrix A, respectively. We define a cost function measuring the pixel intensity similarity between the image pair (Eq. 4), which is supposed to be minimized by the corresponding affine transformation parameters.

$$\begin{aligned} \begin{pmatrix} x_2\\ y_2\\ 1 \end{pmatrix} =\begin{pmatrix} a_1 &{}\quad a_2 &{}\quad t_x \\ a_3&{} a_4 &{}t_y \\ 0&{}0&{}1 \end{pmatrix}\cdot \begin{pmatrix} x_1\\ y_1\\ 1 \end{pmatrix} \end{aligned}$$
(2)

Since the cost function has to ignore the pixels lying outside the circular patches defined around inlier points, a weighting function w(xy) is defined:

$$\begin{aligned} \omega (x,y)=\left\{ \begin{matrix} 0, if (x-x_c)^{2}+(y-y_c)^{2}\ge r^{2}\\ 1, if (x-x_c)^{2}+(y-y_c)^{2}< r^{2} \end{matrix}\right. \end{aligned}$$
(3)

where \(x_\mathrm{c}\) and \(y_\mathrm{c}\) are the coordinates of inlier point and r the radius of the circular image region around this inlier point center. The resulting cost function has a bias toward smaller overlapping solutions; thus a normalization of it by the overlapping area is necessary, resulting in the mean squared pixel error (MSE):

$$\begin{aligned} e_\mathrm{MSE}(A){=}\frac{\sum _i \omega (x_i,y_i)\omega (x_i',y_i')(I_2(x_i',y_i')-I_1(x_i,y_i))^{2}}{\sum _i{\omega (x_i,y_i)\omega (x_i',y_i')}}. \end{aligned}$$
(4)

The affine transformation matrix A is iteratively determined by the image transformation that minimizes \(e_{MSE}\) using Gaussian–Newton optimization. CUDA library was utilized to achieve better performance and reduce execution time of GN Optimization through parallelism. The system architecture diagram of the proposed frame stitching algorithm is demonstrated in Fig. 3.

Fig. 5
figure 5

Demonstration of the keyframe stitching process for the non-rigid esophagus gastroduodenoscopy simulator (left) and real pig stomach (right)

The termination criteria of the Gaussian–Newton optimization were defined by a threshold \(\tau =e^{-9}\), whereas the optimization stops when the \(e_{MSE}\) drops below the threshold \(\tau \) or maximum number of iterations have already been reached. Once the optimization has converged and the affine transformation parameters are estimated, bundle adjustment is performed to correct drifts for all the camera parameters jointly and to minimize the accumulative errors. At the next step, all keyframes \(I_i\) are transformed into the coordinate system of the anchor keyframe \(I_A\). In areas where several keyframes overlap, corresponding image pixels often do not have the same intensity due to illumination changes, scale changes, and intensity level variations. Multi-band blending method is applied to overcome these issues. The overview of multi-blending approach is shown in Fig. 4. For further details, the reader is referred to the original work of [29]. Algorithm 2 summarizes the steps of keyframe stitching module. Results of the stitching process for the real pig stomach and nonrigid simulator are shown in Fig. 5.

figure h

3.4 Deep learning and frame stitching

A major drawback of our frame stitching module is the need for an extensive engineering and implementation effort. To overcome these issues, we investigated the applicability of deep learning techniques to the endoscopic capsule robot pose estimation [2]. Deep learning (DL) has been drawing the attention of the machine learning research community over the last decade. Much of its success roots on having made available models and technologies capable of achieving ground-breaking performances in a variety of traditional fields of application of machine learning, such as machine vision and natural language processing. Admittedly, some of the DL flagships, like NLP and image processing, have their implications in medical fields, e.g., in extracting information from the images taken from patients’ records to find anomalous patterns and detect diseases. With that motivation, we are trying to extend the application of DL technology into endoscopic capsule robot localization. The core idea of our DL-based method is the use of deep recurrent convolutional neural networks (RCNNs) for the pose estimation task, where convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used for the feature extraction and inference of dynamics across the frames, respectively [2]. Using this pretrained neural network, we are able to achieve pose estimation accuracies comparable our sparse-then-dense pose alignment [2]. Thus, as a future step, we might consider to integrate DL-based pose estimation into our frame stitching module to decrease the complexity of our stitching method and relax the extensive engineering and implementation efforts required in this study. Since DL-based pose estimation is out of scope of this paper, the reader is referred to the original paper [2] for further details.

3.5 Endo-VMFusenet and frame stitching

Even though the proposed sparse-then-dense alignment-based visual pose estimation achieves very promising results for endoscopic capsule robot localization, it fails in case of very fast frame-to-frame motions. This is a common issue of any vision-based odometry algorithm. If the overlap between consecutive frames becomes less than a certain percentage, any vision-based pose estimation approach fails. It can even occur that due to drifts of endoscopic capsule robot, the overlap area between frame pairs decreases drastically, which can even be zero in some cases. To overcome this issue, we developed a supervised sensor fusion approach based on an end-to-end trainable deep neural network consisting of multi-rate long short-term memories (LSTMs) for frequency adjustment between sensors and a core LSTM unit for fusion of the adjusted sensor information. Detailed evaluations indicate that our pretrained DL-based sensor fusion network detects whether visual odometry fails and instantaneously makes use of magnetic localization until visual odometry path again recovers. The same applies if magnetic sensor-based localization fails. Additionally, monocular cameras suffer with the absence of real depth information which causes any measurements made by them to be recoverable only up to a scale. This condition is known as scale ambiguity. Another contribution of our DL-based sensor fusion approach is the accurate scale estimation by using absolute position information obtained by the magnetic localization system. In that way, doctors will have a 3D map of exactly same size of the explored inner organ, which will not only help the exact estimation of the diseased region size, but also enable biopsy-like treatments or local drug delivery onto the diseased region. Since it is out of scope, for further details of our DL-based sensor fusion approach, the reader is referred to our paper [4].

Fig. 6
figure 6

Non-rigid esophagus gastroduodenoscopy simulator dataset overview for different endoscopic cameras

Fig. 7
figure 7

Real pig stomach dataset overview for different endoscopic cameras

3.6 Depth image creation

Once the final mosaic image is obtained, the next module creates its depth image using the SfS technique of Tsai and Shah [32]. Tsai–Shah SfS method is based on the following assumptions:

  • The object surface is lambertian.

  • The light comes from a single-point light source.

  • The surface has no self-shaded areas.

Lambertian surface assumption is not obeyed by raw endoscopic images due to the specular reflections inside the organs. We addressed this problem through the reflection suppression technique previously described. Subsequently, the above assumptions allow the image intensities to be modeled by

$$\begin{aligned} I(x,y)=\rho (x,y,z)\cdot \cos \Theta _i, \end{aligned}$$
(5)

where \(\textit{I}\) is the intensity value, p is the albedo (reflecting power of surface), and theta is the angle between surface normal \(\textit{N}\) and light source direction \(\textit{S}\). With this equation, the gray values of an image I are related only to albedo and angle theta. Using these assumptions, the above equation can be rewritten as follows:

$$\begin{aligned} I(x,y)=\rho \cdot N.S, \end{aligned}$$
(6)

where (.) is the dot product, N is the unit normal vector of the surface, and S is the incidence direction of the source light. These may be expressed respectively as

$$\begin{aligned} N= & {} \frac{(-p(x,y),-q(x,y),1)}{(p^2+q^2+1)^\text {(1/2)}} \end{aligned}$$
(7)
$$\begin{aligned} S= & {} (\cos \tau \cdot \sin \sigma ,\sin \tau \cdot \sin \sigma ,\cos \sigma ) \end{aligned}$$
(8)

where (\(\tau \)) and (\(\sigma \)) are the slant and tilt angles, respectively, and p and q are the x and y gradients of the surface Z:

$$\begin{aligned} p(x,y)= & {} \frac{\partial Z(x,y)}{\partial x} \end{aligned}$$
(9)
$$\begin{aligned} q(x,y)= & {} \frac{\partial Z(x,y)}{\partial y}. \end{aligned}$$
(10)

The final function then takes the form

$$\begin{aligned}&I(x,y)\nonumber \\&\quad = \rho \cdot \frac{(\cos \sigma +p(x,y)\cdot \cos \tau \cdot \sin \sigma +q(x,y)\cdot \sin \tau \cdot \sin \sigma )}{((p(x,y))^2 +(q(x,y))^2+1)^\text {(1/2)}}\nonumber \\&\quad = R(p_\text {x,y}, q_\text {x,y}). \end{aligned}$$
(11)

Solving this equation for p and q essentially corresponds to the general problem of SfS. The approximations and solutions for p and q yield the reconstructed surface map Z. The necessary parameters are tilt, slant, and albedo, and can be estimated as proposed in [33]. The unknown parameters of the 3D reconstruction are the horizontal and vertical gradients of the surface Z, p, and q. With discrete approximations, they can be written as follows:

$$\begin{aligned} p(x,y)= & {} Z(x,y)-Z(x-1,y) \end{aligned}$$
(12)
$$\begin{aligned} q(x,y)= & {} Z(x,y)-Z(x,y-1), \end{aligned}$$
(13)

where Z(xy) is the depth value of each pixel. From these approximations, the reflectance function R(\(p_\text {x,y}, q_\text {x,y}\)) can be expressed as

$$\begin{aligned} R(Z(x,y)-Z(x-1,y),Z(x,y)-Z(x,y-1)). \end{aligned}$$
(14)

Using equations 12, 13, and 14, the reflectance equation may also be written as

$$\begin{aligned}&f(Z(x,y),Z(x,y-1),Z(x-1,y),I(x,y))\nonumber \\&\quad = I(x,y)- R(Z(x,y)-Z(x-1,y),\nonumber \\&\quad Z(x,y)-Z(x,y-1))=0. \end{aligned}$$
(15)

Tsai and Shah proposes a linear approximation using a first-order Taylor series expansion for function f and for depth map \(Z^{n-1}\), where \( Z^{n-1}\) is the recovered depth map after \(n-1\) iterations. The final equation is

$$\begin{aligned} Z^n (x,y)=Z^{(n-1)} (x,y) -\frac{f(Z^{(n-1)} (x,y))}{\frac{\mathrm {d}(f(Z^{(n-1)} (x,y))}{\mathrm {d}(Z(x,y))}}, \end{aligned}$$
(16)

where f is a predefined function, constrained by

$$\begin{aligned} \frac{\mathrm {d}f(Z^{(n-1)}(x,y))}{\mathrm {d}Z(x,y)}(1+i_x^2+i_y^2 )) \end{aligned}$$
(17)

and

$$\begin{aligned} i_x= & {} \cos \tau \cdot \frac{\sin \sigma }{\cos \sigma } \end{aligned}$$
(18)
$$\begin{aligned} i_y= & {} \sin \tau \cdot \frac{\sin \sigma }{\cos \sigma }. \end{aligned}$$
(19)

The \(n\mathrm{th}\) depth map \(Z^n\) is calculated by using the estimated slant, tilt, and albedo values.

4 Evaluation

We evaluate the performance of our system both quantitatively and qualitatively in terms of pose estimation and surface reconstruction. We also report the computational complexity of the proposed framework.

4.1 Dataset

We created our own dataset from a real pig stomach and from a non-rigid open GI tract model EGD (esophagus gastroduodenoscopy) surgical simulator LM-103 (Figs. 6, 7). The EGD surgical simulator was used for quantitative analyses, and the real pig stomach for qualitative evaluations. Synthetic stomach fluid was applied to the surface of the EGD simulator to imitate the mucosa layer of the inner tissue. To ensure that our algorithm is not tuned to a specific camera model, four different commercially available endoscopic cameras were employed for the video capture varying in their resolution, pixel size, depth of focus, and image quality. A total of 17010 endoscopic frames were acquired by these four camera models which were mounted on our robotic magnetically actuated soft capsule endoscope prototype (MASCE) (Fig. 8, [34, 35]). The first sub-dataset, consisting of 4230 frames, was acquired with an Awaiba NanEye camera (Table 1). The second sub-dataset, consisting of 4340 frames, was acquired by the Misumi V3506-2ES endoscopic camera with the specification shown in Table 2. The third sub-dataset of 4320 frames was obtained by the Misumi V5506-2ES endoscopic camera with the specification shown in Table 3. Finally, the fourth sub-dataset of 4120 frames was obtained by the Potensic mini camera with the specification shown in Table 4. We scanned the open stomach simulator using the 3D Artec Space Spider image scanner and used this 3D scan as the ground truth for the 3D map reconstruction framework (Fig. 9). Even though our focus and ultimate goal is an accurate and therapeutically relevant 3D map reconstruction, we also evaluated the pose estimation accuracy of the proposed framework quantitatively since a precise pose estimation is a prerequisite for an accurate 3D mapping. Thus, an Optitrack motion-tracking system consisting of eight Prime-13 cameras and a tracking software was utilized to obtain a 6-DoF localization ground truth data of the endoscopic capsule motion with a sub-millimeter precision (Fig. 9).

Fig. 8
figure 8

Robotic magnetically actuated soft capsule endoscopes (MASCE) [34, 35]

Fig. 9
figure 9

Schematics of the experimental setup for 3D visual map reconstruction: a real pig stomach, an esophagus gastroduodenoscopy simulator for surgical training, 3D image scanner, Optitrack system, endoscopic camera, and active robotic capsule endoscope

Table 1 Awaiba Naneye monocular endoscopic camera

4.2 Trajectory estimation

To evaluate the pose estimation performance, we tested our system on different trajectories of various difficulty levels. The absolute trajectory (ATE) root-mean-square error metric (RMSE) is used for quantitative pose accuracy evaluations. The absolute trajectory (ATE) root-mean-square error metric measures the root-mean-square of Euclidean distances between the estimated endoscopic capsule robot poses and the ground truth poses estimated by the motion capture system. Table 5 shows the results of the trajectory estimation for six different trajectories. Trajectory 1 is an uncomplicated path with very slow incremental translations and rotations. Trajectory 2 follows a comprehensive scan of the stomach with many local loop closures. Trajectory 3 contains an extensive scan of the stomach with more complicated local loop closures. Trajectory 4 consists of more challenge motions including fast rotational and translational frame-to-frame motions. Trajectory 5 is the same of trajectory 4, but included synthetic noise to evaluate the robustness of system against noise effects. Before capturing trajectory 6, we added more synthetic stomach oil into the simulator tissue to have heavier reflection conditions. Similar to the trajectory 5, trajectory 6 consists of very loopy and complex motions. As seen in Table 5, the system performs very robust and accurate in terms of trajectory tracking in all of the challenge datasets. Tracking accuracy is only decreased for very fast frame-to-frame movements, motion blur, noise, or heavy spectral reflections occurring frequently in last trajectories especially.

Table 2 Misumi-V3506-2ES monocular camera
Table 3 Misumi-V5506-2ES monocular camera
Table 4 Potensic monocular mini camera
Table 5 Comparison of ATE RMSE for different trajectories and cameras

RMSE results for pose estimation before and after application of reflection suppression, de-vignetting, and radial undistortion were evaluated and compared to quantitatively analyze their effects in terms of pose estimation accuracy. Results shown in Table 6 for Misumi camera-II indicate that reflection suppression leads to a decrease in pose estimation performance. This decrease might be related to the fact that such saturated peak values contain orientation information. Thus, in consideration of pose estimation, reflection suppression should be avoided. On the other hand, radial undistortion and de-vignetting operations both increase pose estimation accuracy of the framework as expected.

Table 6 Comparison of ATE RMSE for MISUMI-II camera and different combinations of preprocessing operations
Table 7 Comparison of surface reconstruction accuracy results on the evaluated datasets
Fig. 10
figure 10

Qualitative 3D reconstructed map results for different cameras [(real pig stomach (left), synthetic human stomach (right)]

Table 8 Comparison of ATE RMSE for different trajectories and combinations of preprocessing operations on the evaluated dataset

4.3 Surface reconstruction

We evaluated the surface reconstruction accuracy of our system on the same dataset that we used for the trajectory estimation framework as well. We scanned the open non-rigid esophago-gastroduodenoscopy (EGD) simulator to obtain the ground truth 3D data using a highly accurate commercial 3D scanner (Artec 3D Space Spider). The final 3D map of the stomach model obtained by the proposed framework and the ground truth scan were aligned using iterative closest point algorithm (ICP). The absolute depth (ADE) RMSE was used to evaluate the performance of map reconstruction approach, which measured the root-mean-square of Euclidean distances between estimated depth values and the corresponding ground truth depth values. A lowest RMSE of \(2.14\,\mathrm{cm}\) (Table 7) proves that our system can achieve very high map accuracies. Even in more challenge trajectories such as trajectory 3, our system is still capable of providing an acceptable 3D map of the explored inner organ tissue. Three-dimensional reconstructed maps of real pig stomach and synthetic human stomach are represented in Fig. 10 for visual reference.

To evaluate the contributions of each preprocessing module on the map reconstruction accuracy, we tested the approach with leave-one out strategy leaving one module each time. As shown in Table 8, each preprocessing operation has a certain influence on the RMSE results. One important observation is that even though pose accuracy increases with existence of reflection points, these saturated pixels have negative influence on the map accuracy, as expected. Therefore, disabling reflection suppression during pose estimation and enabling it for map reconstruction are the best option to follow.

4.4 Computational performance

To analyze the computational performance of the proposed framework, we determined the average frame pair processing time across the trajectory sequences. The test platform was a desktop PC with an Intel Xeon E5-1660v3-CPU at 3.00, 8 cores, 32GB of RAM, and an NVIDIA Quadro K1200 GPU with 4GB of memory. Three-dimensional reconstruction of 100 frames took 80.54 s to process, whereas processing of 200 frames took 180.83 s, and processing of 300 frames 290.12 s, respectively. That indicates an average frame pair processing time of 919.15 ms, implying that our pipeline needs to be accelerated using more effective parallel computing and GPU power in order to reach real-time performance. To achieve this, we developed a RGB-Depth SLAM method, which is capable of capturing comprehensive and globally dense surfel-based maps of the inner organs in real time, by using joint photometric–volumetric pose alignment, dense frame-to-model camera tracking, and frequent model refinement through non-rigid surface deformations [1]. The execution time of the RGB-Depth SLAM is dependent on the number of surfels in the map, with an overall average of 48 ms per frame scaling to a peak average of 53 ms, implying a worst case processing frequency of 18 Hz. Even though RGB-Depth SLAM is much faster than our sparse-then-dense alignment-based 3D reconstruction method, the map quality decreases due to the use of surfel elements. Moreover, the joint photometric–volumetric pose alignment is prone to converge into local minima in low-textured areas. For further details of our RGB Depth SLAM method, the reader is referred to our paper [1].

4.5 Conclusion

In this study, we proposed a therapeutically relevant and very detailed 3D map reconstruction approach for endoscopic capsule robots consisting of preprocessing, key frame selection, a sparse-then-dense pose estimation, frame stitching, and shading-based 3D reconstruction. Detailed quantitative and qualitative evaluations show that the proposed system achieves sub-millimeter precision for both 3D map reconstruction and pose estimation. In future, we aim to achieve real-time operation for the proposed framework so that it can be used for active navigation of the robot during endoscopic operations, as well. Moreover, we plan to incorporate magnetic localization and scale estimation module into our method to develop even more robust endoscopic reconstruction tools.