Robotic-assisted minimally invasive surgery (RAMIS) has been an attractive alternative to traditional and laparoscopic surgeries during the last years since it offers diverse advantages to both surgeons and patients [1]. Particularly, RAMIS has allowed performing complex procedures including off-pump coronary artery bypass grafting (OPCABG) [2]. This procedure avoids the associated complications of using cardiopulmonary bypass (CPB) since the heart is not arrested. Thus, surgeons have to deal with a dynamic target which compromises their dexterity and precision.

Fig. 1
figure 1

Overview of our proposed approach to estimate the cardiac motion in a RAMIS setup which is composed of three main parts

To compensate the heart motion, different authors have proposed solutions based on mechanical stabilization (for example, see [3, 4]), in which small devices are positioned over the heart surface to keep the region to be repaired in a steady state. However, works such as the one presented in [5] reported that there is still a significant residual motion (1.5–2.4 mm) after mechanical stabilization. This entails the need of manual compensation from the surgeon, which is not possible since the heart motion exceeds the human tracking bandwidth [6]. Moreover, these mechanical stabilizers can only be positioned on a small region of the heart surface and can cause irreversible heart damage that affects the cardiac mechanics [7, 8].

To overcome those difficulties, the pioneered work of Nakamura [9] reported that motion cancelation is possible by tracking the heart dynamics and continuously synchronizing this motion with the robot. This direction has been followed by different authors. An image-based motion tracking algorithm was proposed in [10] for retrieving the cardiac surface deformation using a stereo endoscopic system. However, authors in that work did not take into account the effect of occlusions on the performance and stability of the tracking algorithm. Later on, Ortmaier et al. presented in [11] a 2D affine matching algorithm using natural landmarks for estimating the heart motion. These authors dealt with occlusions by integrating a prediction scheme based on Takens’ theorem and combining electrocardiogram and respiration pressure signals.

Richa et al. [12] proposed tracking the heart surface using a thin-plate spline (TPS) deformable model and included an illumination compensation solution. Another approach was presented in [13] in which the heart motion was retrieved using a stochastic physics-based tracking technique and occlusions were tackled using a extended Kalman filter (EKF). Another 3D tracking approach based on a quasi-spherical triangle was introduced in [14] where authors modeled the heart surface using a triangle-based model with a curving parameter. They handled occlusions by applying an algorithm based on the peak-valley characteristics of motion signals.

In more recent works, authors in [15] presented a scheme for tracking the heart motion using two recursive processes. The first represents the target region in joint spatial color space, while the second applies the thin-plate spline model to fit the heart shape around the region of interest. Yang [16] proposed a motion prediction scheme for tracking the heart motion during occlusion events based on the dual Kalman filter in which a point of interest was modeled as a dual time-varying Fourier series.

Aim of this work

In this work, we propose a new approach to estimate the heart motion in which the main contributions of our solution are:

  • A diffeomorphic variational framework that is able to deal with the inherent complex deformation of a beating heart while guaranteeing preservation of the anatomy using a topology preserving penalizer. Our framework maintains affine linear transformations by means of the curvature penalizer and incorporates a preprocessing stage for dealing with specular highlights.

  • A prediction stage, which is a key point of this paper as it is different from existing approaches related to the problem at hand. We propose sliding the cardiac motion data to formulate a standard supervised learning problem, which is handled via a conditional restricted Boltzmann machine (CRBM).

Toward estimating the beating heart motion

In this section, we present our approach which is composed of three main parts illustrated in Fig. 1, but in this work we focus on the second and third parts.

Fig. 2
figure 2

3D diffeomorphic surface reconstruction from the projection of the lattice points defined in each stereo-pair image

Cardiac motion estimation

Specular highlights hinder the performance of the vision-based solution as they partially occlude the targeted surface, appear as additional features, generate discontinuities in the images or cause loss of texture or color information. In this work, we adapted our specular-free image solution, presented in [17], to the stereo-pair frames case.

Assume a calibrated image sequence \(G=\{g_{s}\}_{s=0}^{S-1}\) composed of S stereo-pair frames, where \(g_{s}=\{f_{\mathrm{r}}^{s},f_{\mathrm{l}}^{s}\}\). Let \(f_{\mathrm{r}}^{s}\rightarrow {\mathbb {R}}^2\) and \(f_{\mathrm{l}}^{s}\rightarrow {\mathbb {R}}^2\) denote the left and right view of s in its bounded domain \(\Omega \). To retrieve the heart motion, we start with defining a lattice on each stereo view according to the next definition:

Definition 1

A lattice, \({\mathfrak {L}}\), is a subgroup in a real vector space V of dimension d that has the form \({\mathbb {Z}}v_{1}+\cdots +{\mathbb {Z}}v_{d}\)

Consider \({\mathfrak {L}}_{\mathrm{l}}^{s},{\mathfrak {L}}_{\mathrm{r}}^{s} \subset {\mathbb {R}}^{2}\) as the lattices defined at the left and right views of \(g_{s}\). We recover the 3D heart surface by computing the projections of the corresponding points from \({\mathfrak {L}}_{\mathrm{l}}^{s}\) and \({\mathfrak {L}}_{\mathrm{r}}^{s}\) as illustrated in Fig. 2, which results in the three dimensional lattice \({\mathfrak {L}}^{s}\subset {\mathbb {R}}^{3}\) with a set of lattice points \({\mathbf {B}}\). In this work, we represent the deformable heart surface by the tensor product of the b-splines \(\xi _{c}\). Assume a given position \(x \subseteq {\mathbb {R}}^{d}\), a defined d-dimensional lattice point as \(z:=y_{1}{\ldots }y_{d}\) and the n degree b-splines. Then deformation can be represented as:

$$\begin{aligned} \begin{aligned}&\varphi (x;{\mathbf {B}})=\sum _{j_{1}=0}^{n}{\ldots }\sum _{j_{d}=0}^{n}{\mathbf {B}}_{j_{1},{\ldots },j_{d}}\prod _{k=1}^{d}\xi _{k,c}(x_{k})\\&\quad \text {for}\quad c=0,1,2,3 \end{aligned} \end{aligned}$$

After defining the deformation model, the changes on the heart surface’s deformation over time are computed by an energy functional that is composed of three terms: (i) a data term that allows measuring the discrepancy between the current \(f_{\mathrm{r}}\) and \(f_{\mathrm{l}}\), (ii) a regularization term that enforces a plausible transformation and (iii) a topology preservation term which ensures connectivity between the structures created within the lattice.

Particularly, we represent the data term with the sum of squared differences modifying the minimization of the residual error \(\sum _{i}{r_{i}^{2}}\) for \(\sum _{i}\rho ({r_{i}})\) where \(\rho \) is the Tukey’s M estimator for increasing robustness in the sense of outliers. The second term is formulated using the curvature method which has the advantage of penalizing oscillations and keeping affine linear transformations [18].

Definition 2

A map \(f:X\rightarrow Y\) preserves topology if there exists \(f^{-1}\) and both f and \(f^{-1}\) are smooth.

For the third term and Definition 2, we use the topology preservation term that we first proposed in [19], but here we extended it to 3D. This penalization term is based on controlling the Jacobian determinant for preserving the anatomical structure of organs. Unlike works where topology preservation is not considered, such as [12, 14, 20, 21], in this work we demonstrate the relevance of preserving the heart anatomical structure specially during complex deformations. Taking these three terms, our energy functional is given by:

$$\begin{aligned} \begin{aligned}&{\hat{\mathbf {E}}}_{s}({\mathbf {B}})\\&\quad =\,\bigg (\frac{1}{m}\bigg )\underbrace{\int _{\Omega }\rho (f_{\mathrm{r}}^{s}(\varphi (x;{\mathbf {B}})+x)-f_{\mathrm{l}}^{s}(x)){\mathrm{d}x}}_{\text {data term}}\\&\qquad +\,\underbrace{\sum _{i=1}^{d}\int _{\Omega }(\Delta \varphi (x;{\mathbf {B}})_{i})^{2}{\mathrm{d}x}}_{\text {regularization term}} \underbrace{ + \int _{\Omega } \delta _{{\varphi }}(x;{\mathbf {B}}) {\mathrm{d}x} }_{\text {topology preservation term}} \end{aligned} \end{aligned}$$

where m is the number of pixels in the overlapped domain \(\Omega _{f_{\mathrm{r}},f_{\mathrm{l}}}\) and our term \(\delta _{{\varphi }}\) is defined as:

$$\begin{aligned} \delta _{\varphi }(x;{\mathbf {B}}):= \left\{ \begin{array}{ll} \frac{\displaystyle \frac{\displaystyle 1}{\displaystyle 2}\pi -\arctan (| J_{\varphi }(x;{\mathbf {B}})|)}{\displaystyle \pi } +\varphi \sqrt{\vert J_{\varphi }(x;{\mathbf {B}})\vert ^2} &{}\quad \mathbf if\mathbf |\; |J_{\varphi }(x;{\mathbf {B}})| -1\;| \ge \tau \\ 0 &{}\quad \text {otherwise} \\ \end{array}\right. \end{aligned}$$
Fig. 3
figure 3

(Top) Illustration of both RBM and CRBM architectures and (left bottom) how the reconstructed heart motion is used as an input for CRBM. (Right bottom) The accumulated lattice points over time

where \(\varphi \in {\mathbb {R}}^+\) offers a balance in our penalization and \(\tau \in {\mathbb {R}}^+\) is the margin of acceptance for values close to one. While the main purpose of the first term is to guarantee the positivity of the Jacobian determinant, which translates in avoiding the creation of new structures in the defined lattice, the second term penalizes big values which translates in prevention of big expansions and contractions. An illustrative explanation can be found in Supplementary Material Fig. 1. To solve our energy functional described in Eq. 2, we use the Levenberg–Marquardt (LM) method, which benefits of the advantage of both Gradient Descent and Gauss–Newton methods.

Cardiac motion prediction

During a RAMIS procedure, a common challenging factor is the presence of partial occlusions which compromises the tracking precision and could lead to algorithm failure. The studies in the literature of cardiac motion estimation cope with this problem using algorithms from classic estimation theory, such as the EKF and the Auto-Regressive eXogenous (ARX) model. In this work, we go beyond those solutions and use tools drawn from machine learning as an alternative to solve prediction of sequential data.

As in any supervised learning problem, a set of n training samples in the form of input–output pairs \(\{(x_{i},y_{i})\}_{i=1}^{n}\) is needed to find the function M that maps \(X\xrightarrow {M} Y\) and works well on unseen inputs x. Particularly, in a real clinical scenario, it is difficult to extract true observed values Y when estimating the cardiac motion. To mitigate the lack of a set Y and define a standard supervised learning approach, we slide [22] the given sequential data \(\{(x_{i})\}_{i=1}^{n}\) in the form \(Y=\{({x}_{i+d})\}_{i=1}^{n-1}\) where d is the time step size known as the lag, which results in input–output \(\{(x_{i},y_{i})\}_{i=1}^{n}\). An example illustrating this process can be found in supplementary material Fig. 2.

Taking the previous restructured data, our goal is to predict the heart motion within the lattice domain not just to deal with occlusion events, but as a feedback information for improving the heart motion estimation.

Definition 3

A restricted Boltzmann machine (RBM) is a two-layer graphical model that learns a probability distribution of a given set of inputs and can be defined as the energy E where the probability distribution of the visible and hidden units is given in terms of E as:

$$\begin{aligned}&E_{RBM}(v,h|W,b^{v},b^{h})= -( v^{\intercal } W^{vh}h+ v^{\intercal }b^{v}+ h^{\intercal }b^{h})\nonumber \\&= -\left( \sum _{i}\sum _{j}v_{i}W_{ij}h_{j}+\sum _{i}v_{i}b_{i}^{v}+\sum _{j}h_{j}b_{j}^{h}\right) \end{aligned}$$
$$\begin{aligned}&p(v,h)=\frac{1}{Z}exp({-E_{RBM}(v,h)}) \end{aligned}$$

where W refers to the weights matrix, h and v are the hidden and visible units, \(b^{v}\) and \(b^{h}\) are the unit bias, and Z the normalization factor.

Although RBMs are powerful models, they are not able to capture temporal dependencies from the model data. To cope with this problem, an extension of RBMs called conditional restricted Boltzmann machines (CRBM) [23] has been recently a focus of attention, and in particular, in dealing with motion capture [23, 24]. For illustration purposes, refer to the top part of Fig. 3.

Fig. 4
figure 4

(Left) Specularity elimination and inpainting results. (Right) Error and Signal-to noise ration (SNR) plots

For improving the cardiac motion estimation within the lattice domain, we exploit CRBM as a tool to, on the one side, improve the heart motion estimation and, on the other, predict the motion during occlusion events. Let c be the vector (the conditional) that contains the past information in the form time \(t-1, t-2, {\ldots }, t-M\) of the lattice (points motion). See the illustration in the bottom part of Fig. 3. The joint probability function, given the hidden and visible layers, the conditional data and M past elements, is expressed in terms of the energy \(E_\mathrm{CRBM}\) as:

$$\begin{aligned}&E_\mathrm{CRBM}(v_{t},h_{t}|c,W,{\mathcal {W}},b^{v},b^{h})= E_\mathrm{RBM}(v,h|W,b^{v},b^{h})\nonumber \\&-\sum _{m}\left( \sum _{k}\sum _{i}v_{ki,t-m}{\mathcal {W}}_{ki,t-m}v_{it} +\sum _{k}\sum _{j}v_{kj,t-m}{\mathcal {W}}_{kj,t-m}h_{j,t}\right) \end{aligned}$$
$$\begin{aligned}&p(v_{t},h_{t}|c,W,{\mathcal {W}},b^{v},b^{h})=\frac{1}{Z}e^{\big (-E_{CRBM}(v_{t},h_{t}|c,W,{\mathcal {W}},b^{v},b^{h})\big )} \end{aligned}$$

For training the CRBM, we used the well-known contrastive divergence algorithm [25]. Details about the architecture, for example number of units, are explained in the experimental results.

Experimental results

Cardiac data description

We used both phantom and in vivo datasets [26] to evaluate our approach. The phantom dataset is a silicon heart with cardiac motion. It is composed of 3389 stereo-pair images of size \(720\times 288\). We refer to this phantom dataset as Dataset I (see the bottom part of Fig. 4).

The in vivo data come from a robotic-assisted totally endoscopic coronary artery bypass surgery. It is composed of 1573 stereo-pair images of size \(720\times 288\). We refer to this sequence as Dataset II (see the top part of Fig. 4).

Fig. 5
figure 5

(From top to bottom) For each dataset: example input raw data frames, accumulated displacement of the reconstructed 3D heart at different time instances and visualization of the recovered region of interest. (Bottom left) The Jacobian Determinant results of our vision-based approach, with and without applying our topology preservation term, in two different cases: retrieval of complex deformation and under illumination variation. (Bottom right) The convergence results of our optimization process on the two datasets while using the topology preservation term

Fig. 6
figure 6

Motion of a point of interest over time used in the prediction stage

Results and discussion

In this section, we focus the attention on evaluating the three parts that compose our approach through a set of numerical results, graphical and visual analyses.

Specular-free approach The evaluation of our specular-free approach is shown in Fig. 4. To offer a quantitative evaluation of our detection approach, we used a ground truth from each of the sequences. The results showed that the specular highlight regions were detected with \(\sim \) 99% accuracy in all datasets. Aside from this numerical evaluation, we also show detection and inpainting results on frames from each dataset in the left part of Fig. 4. From visual inspection, it is clear that our approach is able to adapt well to diverse color variations. The right part of Fig. 4 shows visualizations of the inpainting results along with plots that represent Sobelev energy minimization and signal-to-noise ratio (SNR) improvement during the inpainting process.

Vision-based cardiac motion estimation We start evaluating our vision-based approach (see Eq. 2) by recovering the heart motion. In Fig. 5, we show the resulting 3D reconstruction of the heart surface using Datasets I and II. The top rows of Fig. 5 of both datasets show stereo-pair image samples with the region to be repaired pointed out. The middle rows show the accumulated displacement field of the complete image domain. As evidenced by the images, unlike Dataset I which exhibits a strong homogeneity in the surface, Dataset II presents strong visual texture which provides more stable features during the tracking process of the region of interest. The bottom rows from both datasets illustrate the 3D reconstruction of the region of interest (ROI), which is used as input to the next stage (prediction stage). We only use information from the ROI since the surgeon’s attention is focused on the zone to be repaired. The plots at the bottom rows clearly show pleasant visual results of the 3D ROI with both phantom and in vivo data.

For quantitative analysis, we evaluated the global performance of our vision-based approach. The first question that we pose is—How robust is our vision-based cardiac motion estimation approach?. To respond to this question, we carried out two experiments as follows:

  • Experiment 1: Without topology preservation by setting \(\delta _{{\varphi }}=0\) in Eq. 2.

  • Experiment 2: With our topology preservation term by setting \(\varphi =3\cdot 10^{-3}\) in Eq. 2.

After running both experiments, we found that the average range [min, max] of the Jacobian determinant for Exp. 1 was \([-\,2.5471, 3.0012]\) with an average residual error of the order of magnitude \(10^{-2}\), while for Exp. 2, the Jacobian exhibited stable values with an average range of [0.9715, 1.015] yielding to an average minima in the order of magnitude of \(10^{-7}\). The significance of the minima lies in the fact that a small value of the energy is equivalent to computational efficiency of the minimization. Some samples showing the Jacobian determinant over the region of interest are displayed at the bottom part of Fig. 5.

This results, together with a nonparametric Wilcoxon test that revealed statistical significant difference between both experiments, lead us to conclude that our penalizer helps obtaining a better minima and speeds up the solution convergence (see bottom right side of Fig. 5).

Cardiac motion prediction In this subsection, we analyze the performance of our approach during partial occlusions. To do this, we first extracted the motion of a point of interest in (x,y,z) directions from both datasets as shown in Fig. 6. This is the data used in the remaining of this section.

In order to offer a detailed analysis of our prediction scheme, we took two well-known predictors from classic estimation theory: the NARX and EKF. We use these two predictors to check whether a statistical significant difference exists between those schemes and the one based on CRBM over 200 frames.

Fig. 7
figure 7

Estimated vs predicted comparison in x, y, z directions and for two predictors from the body of the literature, NARX and EKF, and the one used in this work—CRBM over 200 future frames, and the corresponding RMSE

We begin by analyzing the NARX predictor and Fig. 7 (top left) shows the resulted prediction for x, y and z directions. From visual inspection, it is clear that for the x and y directions, the prediction was acceptable. However, in the z direction, the predicted values were far from the target. This is further supported by the root mean square error (RMSE) computed for all directions and plotted in the bottom of Fig. 7. The RMSE shows that NARX was able to predict x and y direction within a maximum RMSE of 1.1 mm, while z was far to be retrieved accurately since it reached a maximum of 1.7 mm with an average of 0.69 mm.

We also evaluated the performance of the EKF, which is probably the most used well-known predictor. The results are reported in Fig. 7 (top middle). A visual inspection shows that EKF overcame the NARX predictor in all directions. This is also evidenced by the RMSE reported in the bottom of Fig. 7 which exhibits a concentration of error values lower than 0.2 mm. Particularly, the maximum errors for x, y and z are 0.38, 0.43 and 0.27 mm, respectively, and the average RMSE is 0.1153 mm.

Finally, we evaluated the CRBM for predicting the cardiac motion. For the CRBM, we set the learning rate as \(10^{-2}\), a momentum value of 0.9 and 350 hidden units. The results from the prediction are shown in Fig. 7 (top right). In a visual comparison, one can see that the estimated values of the CRBM are closer to the target values. This is supported by the RMSE which offered a maximum value of 0.12 mm for all directions with an average of 0.071 mm.

But is there a significant difference in terms of prediction between NARX, EKF and CRBM? Results derived from the nonparametric Friedman test, \(\chi (3)=18.154\), \(p<0.001\), indicated statistically significant difference. This leads us to conclude that CRBM achieves a better prediction than NARX and EKF. The same quantitative analysis was performed with the in vivo dataset, in which results also favored the CRBM. (Detailed description can be found in supplementary material text and Fig. 3.)


In this work, we proposed recovering the 3D cardiac motion by the means of a variational framework that guarantees the anatomical preservation of the heart. A key point of our solution is its robustness to partial occlusions by using a generative model (a CRBM).

The results revealed a robust visual approach that reached an average minima in the order of magnitude of \(10^{-7}\) providing stable values for the Jacobian determinant. In terms of prediction, our approach using CRBM reported the lowest average RMSE of 0.071 in comparison with the NARX and EKF. This is further supported by a statistical test that pointed out significant difference in estimation between the three predictors. This together with the RMSE leads us to demonstrate the potential of using a CRBM (deep learning) in RAMIS scenarios.

While we wanted to demonstrate the potentials of combining a diffeomorphic variational framework with supervised learning techniques (particularly CRBM), from a technical point of view, the aim of this work is to report an initial study for a proof of concept. Future work will include a more extensive evaluation to explore the clinical potential of our approach