A ToF-Aided Approach to 3D Mesh-Based Reconstruction of Isometric Surfaces

  • S. Jafar HosseiniEmail author
  • Helder Araujo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9443)


In this paper, we investigate structure-from-motion (SfM) for surfaces that deform isometrically. Our SfM framework is intended for the estimation of both the 3D surface and the camera motion at one time through a template-based approach founded on the combination of a ToF sensor and a conventional RGB camera. The objective is to take advantage of depth maps acquired by the ToF sensor so that a considerable enhancement can be achieved in the reconstruction of the non-rigid structure using the high-resolution images captured by means of the RGB camera. A triangular mesh is adopted to represent isometric surfaces. The depth of a sparse set of 3D feature points spread all over the surface will be obtained with the help of the ToF camera, thereby enabling the recovery of the depth of the mesh vertices using a multivariate linear system. Subsequently, a non-linear constraint is formed based on the projected length of each edge of the mesh. A second non-linear constraint is then used for minimizing re-projection errors. These constraints are finally incorporated into an optimization scheme to solve for structure and motion. Experimental results show that the proposed approach has good performance even if only a low-resolution depth image is used.


Structure from motion Isometric surface ToF camera 3D reconstruction 

1 Introduction

Structure-from-motion can be defined as the problem of simultaneous inference of the motion of a camera and the 3D geometry of the scene solely from a sequence of images. SfM was also extended to the case of deformable objects. Non-rigid SfM is under-constrained, which means that the recovery of non-rigid 3D shape is an inherently ambiguous problem [23, 24]. Given a specific configuration of points on the image plane, different 3D non-rigid shapes and camera motions can be found that fit the measurements. To solve this ambiguity, prior knowledge on the shape and motion should be used to constrain the solution. For example, Aanaes et al. [1] impose the prior knowledge that the reconstructed shape does not vary much from frame to frame while Del Bue et al. [2] impose the constraint that some of the points on the object are rigid. The priors can be divided in two main categories: the statistical and the physical priors. For instance, the methods relying on the low-rank factorization paradigm [1, 2] can be classified as statistical approaches. Learning approaches such as [3, 21, 22] also belong to the statistical approaches. Physical constraints include spatial and temporal priors on the surface to be reconstructed [6, 7]. A physical prior of particular interest is the hypothesis of having an inextensible (i.e. isometric) surface [8, 9, 10]. In this paper, we consider this type of surface. This hypothesis means that the length of the geodesics between every two points on the surface should not change across time, which makes sense for many types of material such as paper and some types of fabric.

3D reconstruction of non-rigid surfaces from images is an under-constrained problem and many different kinds of priors have been introduced to restrict the space of possible shapes to a manageable size. Based on the type of the surface model (or representation) used, we can classify the algorithms for reconstruction of deformable surfaces. The point-wise methods only reconstruct the 3D position of a relatively small number of feature points resulting in a sparse reconstruction of the 3D surface [9]. Physics-based models such as superquadrics [11], triangular meshes [10], Thin-Plate Splines (TPS) [9], or tensor product B-splines [18] have been also utilized in other algorithms. In TPS, the 3D surface is represented as a parametric 2D-3D map between the template image space and the 3D space. Then, a parametric model is fit to a sparse set of reconstructed 3D points in order to obtain a smooth surface which is not actually used in the 3D reconstruction process. There has been increasing interest in learning techniques that build surface deformation models from training data. More recently, linear models have been learned for SfM applications [12, 13]. There have also been a number of attempts at performing 3D surface reconstruction without using a deformation model. One approach is to use lighting information in addition to texture clues to constrain the reconstruction process [14], which has only been demonstrated under very restrictive assumptions on lighting conditions and is therefore not generally applicable. A common assumption in deformable surface reconstruction is to consider that the surface is inextensible. In [9], the authors propose a dedicated algorithm that enforces the inextensibility constraints. However, the inextensibility constraint alone is not sufficient to reconstruct the surface. Another sort of implementation is given by [4, 10]. In these papers, a convex cost function combining the depth of the reconstructed points and the negative of the reprojection error is maximized while enforcing the inequality constraints arising from the surface inextensibility. The resulting formulation can be easily turned into a SOCP problem. A similar approach is explored in [8]. The approach of [9] is a point-wise method. The approaches of [4, 8, 10] use a triangular mesh as surface model, and the inextensibility constraints are applied to the vertices of the mesh.

1.1 Model and Approach

In this work, we aim at the combined inference of the 3D surface and the camera motion while preserving the geodesics by using a RGB camera aided by a ToF range sensor. Usually, RGB cameras have high image resolutions. With these cameras, one can use efficient algorithms to calculate the depth of the scene, recover object shape or reveal structure, but at a high computational cost. ToF cameras deliver a depth map of the scene in real-time but with insufficient resolution for some applications. So, a combination of a common camera and a ToF sensor can exploit the capabilities of both. We assume that the fields of view of both the RGB and the ToF cameras mostly overlap. The surface is represented as a triangular 3D mesh and a set of correspondences between 3D feature points and 2D locations in an input image is available. In practice, they are obtained by matching SIFT features between the input image and a reference image in which the surface shape is known. The 2D points in the reference image correspond to 3D feature points on the mesh. The goal of the algorithm is to allow the 3D reconstruction of the surface mesh when matching is difficult and depth estimates are available for a limited number of points on the surface. Our approach performs SfM under the constraint that the deformation be isometric.

1.2 Outline of the Paper

This paper is organized as follows: to represent an isometric surface, a triangular mesh as well as a planar reference configuration is used. In Sect. 3, the matching between data from the range and the RGB cameras is described. Next, the estimation of the depth of the mesh vertices based on the depth of the feature points is described. The entire approach for the estimation of the 3D shape and motion is based on minimizing the sum of both the re-projection errors and the errors on the projected length of the mesh edges. Experimental results and quantitative evaluation are presented in the last section. We show that our approach is able to handle isometry indirectly without having to directly apply this constraint. In addition, it obviates the need for a dense set of 3D points lying on the surface by effective use of a ToF sensor.

2 Notation and Background

2.1 Notation

Matrices are represented as bold capital letters (\(\mathbf {A}\in \mathbb {R}^{n\times m}\), n rows and m columns). Vectors are represented as bold small letters (\(\mathbf {a}\in \mathbb {R}^{n}\), n elements). By default, a vector is considered a column. Small letters (a) represent one dimensional elements. By default, the jth column vector of \(\mathbf {A}\) is specified as \(\mathbf { \mathbf {a} }_{j}\). The jth element of a vector \(\mathbf {a}\) is written as \( a _{j}\). The element of \(\mathbf {A}\) in the row i and column j is represented as \(\mathbf {\mathrm { A }_{\mathrm {i,j}}}\). \(\mathbf {A}^{(1:2)}\) and \(\mathbf {a^{\mathrm {(1:2)}}}\) indicate the first 2 rows of \(\mathbf {A}\) and \(\mathbf {a}\). \(\mathbf {A}^{(3)}\) and \(\mathbf {a^{\mathrm {(3)}}}\) denote the third row of \(\mathbf {A}\) and \(\mathbf {a}\), respectively. Regular capital letters (\( A \)) indicate one dimensional constants. We use \(\mathbb {R}\) after a vector or matrix to denote that it is represented up to a scale factor. There might be few cases opposed to this notation, however, the aim is to comply with it as closely as possible.

2.2 Barycentric Coordinates

In geometry, the barycentric coordinate system is a coordinate system in which the location of a point of a simplex (a triangle, tetrahedron, etc.) is specified as the center of mass, or barycenter, of masses placed at its vertices.
Fig. 1.

RGB/ToF camera setup.

3 Combining Depth and RGB Images

3.1 Mapping Between Depth and RGB Images

The resolutions of the depth and the RGB images are different. A major issue that directly arises from the difference in resolution is that a pixel-to-pixel correspondence between the two images can not be established even if the FOVs fully overlap. Therefore the two images have to be registered so that the mapping between the pixels in the ToF image and in the RGB image can be established. The depth map provided by the ToF camera is sparse and affected by errors. Several methods can be used to improve the resolution of the depth images [15, 16, 17, 25] allowing the estimation of a dense depth image. We will use a simple approach based on linear interpolation.

To estimate depth for all the pixels of the RGB image, based on the depth provided by the ToF camera, a simple linear approach is used. We assume that the relative pose between both cameras, specified by the rotation matrix \(\mathbf {R}^{'}\) and translation vector \(\mathbf {t}^{'}\) has been estimated. We also assume that both cameras are internally calibrated, i.e., their intrinsic parameters are known. Let \(\mathbf {p}_{tof}\) and \(\mathbf {p}_{rgb}\) represent the 3D coordinates of a 3D point in the coordinate system of the Tof and the RGB cameras, respectively.

We use a pinhole camera model for both the RGB and the ToF cameras. Assume that the relative pose of the RGB camera and ToF sensor is fixed with a rotation \(\mathbf {R}^{'}\) and a translation \(\mathbf {t}^{'}\): \(\mathbf {p}_{rgb}=\mathbf {R}^{'}\,\mathbf {p}_{tof}+\mathbf {t}^{'}\) as shown in Fig. 1. The point cloud \(\mathbf {p}_{tof}\) is obtained directly from the calibrated ToF camera. Since the relative pose is known as well as the intrinsic parameters for both cameras, \(\mathbf {p}_{rgb}\) can be obtained from \(\mathbf {p}_{tof}\). To estimate depth for all points of the RGB image, a simple linear interpolation procedure is used. For each 2D point of the RGB image, we select the 4 closest neighbors whose depth was obtained from the depth image. Then, a bilinear interpolation is performed. Another possibility would be to select the 3 closest neighboring points (therefore, defining a triangle) and assume that the corresponding 3D points define a plane. An estimate for the depth of the point could then be obtained by intersecting its projecting ray with the 3D plane defined by the three 3D points.

3.2 Recovery of the Mesh Depth

Assume that a sparse set of 3D feature points \(\mathbf {p}^{ref}=\left\{ \begin{array}{ccc} \mathbf {p}_{1}^{ref},&\cdots ,&\mathbf {p}_{N}^{ref}\end{array}\right\} \) on a reference template with a known shape (usually a flat surface), and a set of 2D image points \(\mathbf {q}=\left\{ \begin{array}{cc} \mathbf {q}_{1},&\cdots ,\end{array}\right. \) \(\left. \mathbf {q}_{N}\right\} \) tracked on the RGB input image of the same surface but with a different and unknown deformation are given. As already stated, we represent the surface as a triangulated 3D mesh with \(n_{v}\) vertices \(\mathbf {v}_{i}\) (and \(n_{tr}\) triangles) concatenated in a vector \(\mathbf {s}=\left[ \begin{array}{ccc} \mathbf {v}_{1}^{T},&\cdots ,&\mathbf {v}_{n_{v}}^{T}\end{array}\right] ^{T}\), and denote by \(\mathbf {s}^{ref}\) the reference mesh, and \(\mathbf {s}\) the mesh we seek to recover. Let \(\mathbf {p}_{i}\) be a feature point on the mesh \(\mathbf {s}\) corresponding to the point \(\mathbf {p}_{i}^{ref}\) in the reference configuration. We can express \(\mathbf {p}_{i}\) in terms of the barycentric coordinates of the triangle it belongs to:
$$\begin{aligned} \mathbf {p}_{i}=\sum _{j=1}^{3} a _{ij}\mathbf {v}_{j}^{[i]} \end{aligned}$$
where the \(a_{ij}\) are the barycentric coordinates and \(\mathbf {v}_{j}^{[i]}\) are the vertices of the triangle containing the point \(\mathbf {p}_{i}\). Since we are dealing with rigid triangles, these barycentric coordinates remain constant for each point and can be easily computed from points \(\mathbf {p}_{i}^{ref}\) and the mesh \(\mathbf {s}^{ref}\). Let us denote by \(\mathbf {A}=\left\{ \begin{array}{ccc} \mathbf {a}_{1},&\cdots ,&\mathbf {a}_{N}\end{array}\right\} \) the set of barycentric coordinates associated to the 3D feature points, where \(\mathbf {a}_{i}=\left[ \begin{array}{ccc} a_{i1},&a_{i2},&a_{i3}\end{array}\right] \). The rigidity of a triangle enforces that the sum of the relative depths around a closed triangle be zero. Assuming that the depth of the vertices of a triangle is denoted as \(v_{z,1}\), \(v_{z,2}\) and \(v_{z,3}\), we have: \((v_{z,1}-v_{z,2})+(v_{z,2}-v_{z,3})+(v_{z,3}-v_{z,1})=0\). Substituting \((v_{z,1}-v_{z,2})\), \((v_{z,2}-v_{z,3})\) and \((v_{z,3}-v_{z,1})\) for \(rz_{1}\), \(rz_{2}\) and \(rz{}_{\mathrm {3}}\), respectively, which denote the relative depth of the edges of the triangle, we can represent the above equation differently as: \(rz_{1}+rz_{2}+rz_{3}=0\) where \(rz_{1}=v_{z,1}-v_{z,2}\), \(rz_{2}=v_{z,2}-v_{z,3}\), and \(rz_{3}=v_{z,3}-v_{z,1}\). Having the above equations for any triangle of the mesh makes a total of \(n_{tr}+n_{e}\) (the number of triangles + the number of edges) linear equations which can be jointly expressed as \(\mathbf {M}_{1_{(n_{tr}+n_{e})\times (n_{v}+n_{e})}}\mathbf {x}_{1_{(n_{v}+n_{e})\times 1}}=0\). This homogeneous system of equations must be satisfied at each time instant (i.e. for any deformation). However, finding a unique solution is not possible. More specifically, \(\mathbf {M}_{1}\) is rank-deficient by \(n_{v}\), that is, it does not have \(n_{v}+n_{e}\) linearly independent columns \((\mathrm {rank}(\mathbf {M}_{1})=n_{e})\). So, there will be a \(n_{v}\)-dimensional basis for the solution space to \(\mathbf {M}_{1}\mathbf {x}_{1}=0\). Any solution is a linear combination of basis vectors. In order to constrain the solution space and determine just one solution out of the infinite possibilities, in a way that this linear system matches only one particular deformation, it is necessary to add \(n_{v}\) independent equations. To add additional constraints, we augment this system with the z coordinate of few properly distributed feature points in this arrangement: using the method described in the previous section, we can obtain an estimate for the depth of a feature point i, indicated by \(p_{z,j}\). From the Eq. 1, we can derive \(p_{z,i}=a_{i1}v_{z,1}^{[i]}+a_{i2}v_{z,2}^{[i]}+a_{i3}v_{z,3}^{[i]}\). This non-homogeneous system of equations can be represented as \(\mathbf {M}_{2_{\left( N\times n_{v}\right) }}\mathbf {x}_{2_{\left( n_{v}\times 1\right) }}=\mathbf {p}_{z}\). It can be verified that \(\mathbf {x}_{1}=\left[ \begin{array}{c} \mathbf {rz}\\ \mathbf {x}_{2} \end{array}\right] \). \(\mathbf {rz}\) is a \(n_{e}\)-vector of the relative depth of the edges. Having the above equation for any feature point results in N linear independent equations. Putting together both sets of equations just explained, we end up with \(n_{tot}=n_{tr}+n_{e}+N\) linear equations (\(\mathbf {M}\mathbf {x}_{1}=\left[ \begin{array}{c} \mathbf {0}\\ \mathbf {p}_{z} \end{array}\right] \)) where the only unknowns are the depth of the vertices and of the edges (i.e. \(n_{v}+n_{e}\) unknowns), which means that the resulting linear system is overdetermined. In fact, we obtain \(n_{e}+N\) independent equations out of \(n_{tot}\) equations. Yet, this is not enough to find the right single solution because there are still an infinitude of further solutions that minimize \(\left\| \mathbf {M}\mathbf {x}_{1}-\left[ \begin{array}{c} \mathbf {0}\\ \mathbf {p}_{z} \end{array}\right] \right\| \) in the least-squares sense. One possible approach after the 3D coordinates are estimated is to fit an initial surface using polynomial interpolation, to the data which consists in xy-coordinates of the feature points on the reference configuration as input and their z-coordinates on the input deformation as output. Once the parameters of the interpolant have been found, we can obtain initial estimates of depth for the vertices, with their xy-coordinates on the reference configuration as input. The interpolated depth has proved to be very close to the correct one. Then, we add an equality constraint for each vertex as \(\mathbf {I}_{nv\times nv}\mathbf {x}_{2}=\mathbf {v}_{z}^{'}\) (\(\mathbf {v}_{z}^{'}\) is the interpolated depth of the vertices). The new linear system \(\mathbf {M}_{new}\mathbf {x}_{1}=\mathbf {b}\) has most likely full column-rank. So, the number of independent equations out of \(n_{tot}+n_{v}\) equations would be \(n_{e}+n_{v}\) . Since the number of independent equations is equal to the number of unknowns, there must be a unique solution, which can be computed via the normal equations. In principle, finding the least-sqaures estimate is recommended.

4 Global Metric Estimation of Structure and Motion

Next we describe two non-linear constraints applied to the estimation problem. These two constraints are used to solve for SfM so that metric reconstruction of the shape is achieved and the motion matrices lie on the appropriate motion manifold. Furthermore, when there are too few correspondences without additional knowledge (as is the case here), shape recovery would not be effective. So, we need to limit the space of possible shapes by applying a deformation model. This model adequately fills in the missing information while being flexible enough to allow reconstruction of complex deformations [3]. We assume we can model the mesh deformation as a linear combination of a mean shape \(\mathbf {s}_{0}\) and \(n_{m}\) basis shapes (deformation modes) \(\mathbf {S}=\left[ \mathbf {s}_{1},...,\mathbf {s}_{n_{m}}\right] \):
$$\begin{aligned} \mathbf {s}=\mathbf {s}_{0}+\sum _{k=1}^{n_{m}}w_{k}\mathbf {s}_{k}=\mathbf {s}_{0}+\mathbf {S}\mathbf {w} \end{aligned}$$

4.1 Constraint 1: Projected Length

Assume that the RGB camera motion relative to the world coordinate system is expressed as a rotation matrix \(\mathbf {R}\) and a translation vector \(\mathbf {t}\). A common approach to solve for the camera motion and surface structure is to minimize the image re-projection error, namely by bundle adjustment. The cost function being minimized is the geometric distance between the image points and the re-projected points. However, we are going to adapt bundle adjustment to our own problem rather than use it directly, as follows: the errors to be minimized will be the difference between the observed and the predicted projected lengths of an edge.

Orthographic Camera. Under orthographic projection, if we assume that the mesh vertices are registered with respect to the image centroid, we can drop the translation vector. The modified formulation of bundle adjustment can be specified as the following non-linear constraint:
$$\begin{aligned} e_{pl}=\sum _{i=1}^{n_{e}}\left( l_{i}-\left\| \mathbf {R}^{(1:2)}\left[ \mathbf {\mathrm { \mathbf {s} }}_{1}^{[i]}-\mathbf {\mathrm { \mathbf {s} }}_{2}^{[i]}\right] \right\| \right) ^{2} \end{aligned}$$
where the leftmost term is the measurement (observation) of the projected length of an edge (the computation of \(l_{i}\) is trivial with the help of estimated mesh depth). \(n_{e}\) is the number of edges. \(\mathbf {s}_{1}^{[i]}\) and \(\mathbf {s}_{2}^{[i]}\) denote 2 entries of the mesh, which account for the ending vertices of the edge i. \(e_{pl}\) can be also expressed as a quadratic function.
Perspective Camera. In this case, we formulate a non-linear constraint based on what we call “unnormalized projected length”, as:
$$\begin{aligned} e_{pl}=\sum _{i=1}^{n_{e}}\left( l_{i}-\left\| \mathbf {K}_{rgb}^{\circ }\left[ \mathbf {\mathbf {R}}|\mathbf {t}\right] \left[ \left[ \begin{array}{c} \mathbf {\mathrm { \mathbf {s} }}_{1}^{[i]}\\ 1 \end{array}\right] -\left[ \begin{array}{c} \mathbf {\mathrm { \mathbf {s} }}_{2}^{[i]}\\ 1 \end{array}\right] \right] \right\| \right) ^{2} \end{aligned}$$
where \(\mathbf {K}_{rgb}^{\circ }\) is a known calibration matrix equivalent to \(\left[ \begin{array}{ccc} f &{} 0 &{} 0\\ 0 &{} f &{} 0\\ 0 &{} 0 &{} 1 \end{array}\right] \). From the estimated mesh depth, \(l_{i}\) can be easily measured using simple mathematical manipulation. Since there is a subtraction in the above cost function, the translation vector \(\mathbf {t}\) can be removed. Also, note that the 2-norm is applied to the first 2 entries of a 3-vector to estimate the square of unnormalized projected length. So, only the 2 first rows of the product of \(\mathbf {K}_{rgb}^{\circ }.\mathbf {R}\) are involved in the constraint:
$$\begin{aligned} e_{pl}=\sum _{i=1}^{n_{e}}\left( l_{i}-\left\| \mathbf {f}^{[i]}(\mathbf {R}^{(1:2)},w)\right\| \right) ^{2} \end{aligned}$$

4.2 Constraint 2: Reprojection Error

Several difficulties may affect the estimation of the depths namely:
  • Errors due to the depth interpolation;

  • Irregular distribution of the feature points over the object surface.

As a result of these factors, the depth estimate for the mesh vertices may be significantly inaccurate. In addition, there are also reprojection errors, that is, errors on the image positions of the 3D feature points. We should thus account for the reprojection error by adding a term to the function to be optimized. By combining Eqs. 1 and 2, we’ll have:
$$\begin{aligned} \mathbf {p}_{i}=\sum _{j=1}^{3}a_{ij}\mathbf {s}_{j}^{[i]} \end{aligned}$$
where \(\mathbf {s}_{0j}^{[i]}\) and \(\mathbf {S}_{j}^{[i]}\) are the subvector of \(\mathbf {s}_{0}\) and the submatrix of \(\mathbf {S}\) (respectively), corresponding to the vertex j of the triangle in which the feature point i resides.

The term corresponding to the reprojection error can be obtained as indicated below.

Orthographic Camera
$$\begin{aligned} e_{re}=\sum _{i=1}^{N}\left\| \mathbf {q}_{i}-\mathbf {R}^{(1:2)}\mathbf {p}_{i}\right\| ^{2} \end{aligned}$$
Perspective Camera
$$\begin{aligned} e_{re}=\sum _{i=1}^{N}\left\| \mathbf {\lambda _{i}}\left[ \begin{array}{c} \mathbf {q}_{i}\\ 1 \end{array}\right] -\left[ \mathbf {K}_{rgb}^{\circ }\left[ \mathbf {R}|\mathbf {t}\right] \left[ \begin{array}{c} \mathbf {p}_{i}\\ 1 \end{array}\right] \right] \right\| ^{2} \end{aligned}$$
The projective depths \(\lambda _{i}\) can be determined using the estimated depth for feature’s image points on the RGB image. Subsequently, errors in \(\lambda _{i}\) (induced by the first condition mentioned above) would introduce false search directions in the \(e_{re}\)-based minimization problem. Therefore, it is advantageous to reformulate the above equations so that \(\lambda _{i}\) is removed from them. So, we take into account the equation below:
$$\begin{aligned} \lambda _{i}\left[ \begin{array}{c} \mathbf {q}_{i}\\ 1 \end{array}\right] =\mathbf {K}_{rgb}^{\circ }\left[ \sum _{j=1}^{3}a_{ij}\mathbf {R}.\mathbf {s}_{j}^{[i]}\right] +\mathbf {K}_{rgb}^{\circ }.\mathbf {t} \end{aligned}$$
After some simple algebraic manipulation, we obtain:
$$ \left[ \begin{array}{ccc} a_{i1}\mathbf {A}_{i}\,\,&a_{i2}\mathbf {A}_{i}\,\,&a_{i3}\mathbf {A}_{i}\end{array}\right] _{2\times 9}\left[ \begin{array}{c} \mathbf {R}.\mathbf {\mathrm { \mathbf {s} }}_{1}^{[i]}\\ \mathbf {R}.\mathbf {\mathrm { \mathbf {s} }}_{2}^{[i]}\\ \mathbf {R}.\mathbf {\mathrm { \mathbf {s} }}_{3}^{[i]} \end{array}\right] _{9\times 1}+\mathbf {A}_{i}.\mathbf {t}= $$
$$\begin{aligned} \left[ \begin{array}{c} e_{1}^{[i]}(\mathbf {R},\mathbf {w},\mathbf {t})\\ e_{2}^{[i]}(\mathbf {R},\mathbf {w},\mathbf {t}) \end{array}\right] _{2\times 1}=0\,\,\,\,\mathrm {where}\,\,\mathbf {A}_{i}=\mathbf {K}_{rgb}^{\circ (1:2)}-\mathbf {q}_{i}.\mathbf {K}_{rgb}^{\circ (3)} \end{aligned}$$
This equation provides 2 linear constraints as: \(e_{1}^{[i]}(.)=0\) and \(e_{2}^{[i]}(.)=0\). Thus, the modified \(e_{re}\) takes a form free of \(\lambda _{i}\) as follows: \(e_{mre}=\sum _{i=1}^{N}\left( e_{1}^{[i]}(.)^{2}+e_{2}^{[i]}(.)^{2}\right) \), where \(e_{mre}\) denotes the modified \(e_{re}\). \(e_{pl}\) is a function of \(\mathbf {R}^{(1:2)}\) and \(\mathbf {w}\) whereas \(e_{mre}\) (or \(e_{re}\)) is a function of \(\mathbf {R}\), \(\mathbf {w}\) and \(\mathbf {t}\). In order to simplify \(e_{pl}\), we modify it by considering that: 1- the translation vector \(\mathbf {t}\) is fixed and the camera setup has only rotational movement relative to the world coordinate system. 2- adding the following function to \(\mathbf {f}^{[i]}(\mathbf {R}^{(1:2)},\mathbf {w})\) in the first constraint, we are able to solve for the full matrix \(\mathbf {R}\):
$$\begin{aligned} f_{rz}^{[i]}(\mathbf {R}^{(3)},\mathbf {w})=\left( \mathbf {R}^{(3)}\left[ \mathbf {\mathrm { \mathbf {s} }}_{1}^{[i]}-\mathbf {\mathrm { \mathbf {s} }}_{2}^{[i]}\right] \right) \end{aligned}$$
$$\begin{aligned} e_{rz}=rz_{i}-f_{rz}^{[i]}(\mathbf {R}^{(3)},\mathbf {w}) \end{aligned}$$
where \(rz_{i}=v_{z,1}^{[i]}-v_{z,2}^{[i]}\). \(e_{rz}\) is actually the difference between the observed and the predicted relative depths of edge i. Combining \(\mathbf {f}^{[i]}(.)\) and \(f_{rz}^{[i]}(.)\), it yields:
$$\begin{aligned} e_{mpl}=\sum _{i=1}^{n_{e}}\left( \sqrt{\left( l_{i}^{2}+rz_{i}^{2}\right) }-\left\| \left[ \begin{array}{c} \mathbf {f}^{[i]}(\mathbf {R}^{(1:2)},\mathbf {w})\\ f_{rz}^{[i]}(\mathbf {R}^{(3)},\mathbf {w}) \end{array}\right] \right\| \right) ^{2} \end{aligned}$$
where \(e_{mpl}\) represents a modified version of \(e_{pl}\). As a result, we brought \(e_{mpl}\) and \(e_{mre}\) into a common form where both are functions of \(\mathbf {R}\) and \(\mathbf {w}\).
Fig. 2.

Representation of the approach via block diagram.

4.3 Objective Function

So far, we have derived two constraints expressed as two separate non-linear problems. However, we intend to integrate both constraints into one single objective function so that they are taken into account at once, when estimating all the parameters. To do so, we minimize the weighted summation of them in such a way that the reprojection error term is assigned a weight m that accounts for its relative influence within the combined objective function. A block diagram of the overall structure of the approach is demonstrated in Fig. 2. In our global optimization, we first consider a simplified formulation of the objective function by excluding the camera motion \(\left[ \mathbf {R}|\mathbf {t}\right] \). We include it back in the second case.

Estimation of Structure only. The constraints are simplified so that the only unknown parameter is the structure (we assume that the camera motion is set to \(\left[ \mathbf {I}|0\right] ).\)

Orthographic Camera: \(\mathrm {min_{\mathbf {w}}\,}e_{tot}=\left( e_{pl}+m.e_{re}\right) \)

Perspective Camera: \(\mathrm {min_{ \mathbf {w} }}\, e_{tot}=\left( e_{mpl}+m.e_{mre}\right) \)

Estimation of both Structure and Camera Motion. We consider now the full optimization by including the camera motion.

Orthographic Camera: \(\mathrm {min_{ \mathbf {R}^{(1:2)},\mathbf {w} }\,}e_{tot}=\left( e_{pl}+m.e_{re}\right) \)

Perspective Camera: \(\mathrm {min_{ \mathbf {R},\mathbf {w} }}\, e_{tot}=\left( e_{mpl}+m.e_{mre}\right) \)

The above optimization problems can be solved using a non-linear minimization algorithm such as Levenberg-Marquardt (LMA). The rotation estimates obtained from this optimization may not satisfy the orthormality constraints. So, the optimization algorithm must be fed with a good initialization. To provide initial estimates relatively close to the true ones, we do the following: if initial guesses for \(R^{(1:2)}\) and R are not given, they can be initialized using well-known methods that attempt to solve for SfM through non-rigid factorization of \(\left\{ q_{ij}\right\} \) and \(\left\{ \lambda _{ij}q_{ij}\right\} \) from all frames, for instance, as in [13]. In these methods, the factorization is followed by a refinement step to upgrade the reconstruction to metric. The deformation coefficients \(\mathbf {w}_{k}\) are initialised to random small values. One possible solution to further meet the rotation constraints is to subsequently apply Procrustes [19, 20].

4.4 Additional Constraint

Non-linear optimization may converge to local minima. The probability of such occurrence can be reduced by adding a new regularization term that requires the estimated depth data to be as close to the measured one as possible. So, we would have:
$$\begin{aligned} e_{z}=\sum _{i=1}^{n_{v}}\left( v_{z}^{[i]}-\left( \mathbf {R}^{(3)}\mathbf {\mathrm { \mathbf {s} }}^{[i]}+\mathbf {t}^{(3)}\right) \right) ^{2} \end{aligned}$$
where \(v_{z}^{[i]}\) is the depth of the vertex i, already recovered and \(\mathbf {s}^{[i]}\) is the 3D position corresponding to the vertex i. Notice that this regularization is very dependent on the accuracy of \(\mathbf {v}_{z}^{[i]}\).

5 Experiments

5.1 Synthetic Data

Next, we evaluate the methods described above using synthetic data. We synthetized a number of frames of a deforming circle-like paper (radius = 20 cm) approximated by a \(9 \times 9\) mesh such as the one shown in Fig. 2. The reason to use a circular mesh is that it is uniform and has a symmetric shape. Therefore, it has similar shapes (up to a rotation) for a number of different deformations, which, in fact, brings more complexity to the reconstruction of the right deformation. The inextensible meshes used for training have been built using Blender and PCA was then applied to estimate the deformation model. In order to generate the input data, we get a sparse set of 3D feature points (\(N=32\)) well-distributed on the surface of a reference planar mesh. The experiments are repeated equally for both the orthographic and the perspective cameras. For the perspective case, the camera model is defined such that the focal length is \(f=500\) pixels. The model assumes that the surface is located 50 cm in front of the cameras (along the optical axis). The 3D feature points across the surface are then projected onto the 2D camera and a zero-mean Gaussian noise with 1-pixel standard deviation (Std) was then added to these projections. The depth data of feature points is also generated by adding a zero-mean Gaussian noise with \(0.1-cm\) Std. The results of the quantitative assessment represent an average obtained from five deformations randomly selected. By performing 50 trials for each deformation, each average value was acquired from 250 trials. Two of the estimated deformations and their equivalent ground-truth are qualitatively illustrated in Fig. 3.
Fig. 3.

Left: a \(9 \times 9\) template mesh with sparse feature points - radius = 20 cm. Right: metric coordinates in cm - overlap between the ground-truth shapes (blue) and the recovered ones (red) (Color figure online).

Table 1.

Preliminary results.

Reconstruction error




Our approach - orthographic




Our approach - perspective



\(<1^{\circ }\)

Reconstruction Error. The accuracy of the method is reported in terms of reconstruction errors. The reconstruction errors are computed with respect to two measures as:

1- Point reconstruction error (PRE): The normalized Euclidean distance between the observed (\(\hat{\mathbf {p}_{i}}\)) and the estimated (\(\mathbf {p}_{i}\)) world points according to \(PRE=\frac{1}{N}\sum _{i=1}^{N}\left[ \left\| \mathbf {p}_{i}-\hat{\mathbf {p}_{i}}\right\| ^{2}/\left\| \hat{\mathbf {p}_{i}}\right\| ^{2}\right] \).

2- Mesh reconstruction error (MRE): The normalized Euclidean distance between the observed (\(\hat{\mathbf {v}_{i}}\)) and the estimated (\(\mathbf {v}_{i}\)) mesh vertices, which is computed as \(MRE=\frac{1}{n_{v}}\sum _{i=1}^{n_{v}}\left[ \left\| \mathbf {v}_{i}-\hat{\mathbf {v}_{i}}\right\| ^{2}/\left\| \hat{\mathbf {v}_{i}}\right\| ^{2}\right] \). The reprojection error of the feature points can be also regarded as another measure of precision. The accuracy of the Stiefel rotation matrix is evaluated based on the orthonormality constraint as \(RotationAccuracy=\left\| \mathbf {R}^{(2\times 3)}\mathbf {R}^{(2\times 3)T}-\mathbf {I}^{(2\times 2)}\right\| _{F}^{2}\). In case of the perspective camera, we compare the axis-angle of the recovered and ground-truth rotations as \(RotationAccuracy=\left| angle-\hat{angle}\right| ^{2}\).

The quantitative output can be seen in Table 1. Our approach takes into consideration just few feature points, though we take advantage of the ToF sensor to get the depth of them.
Fig. 4.

Orthographic camera - left: average PRE and average MRE with respect to the increasing noise in image points. Right: average PRE and average MRE with respect to the increasing noise in depth data.

Fig. 5.

Perspective camera - left: average PRE and average MRE with respect to the increasing noise in image points. Right: average PRE and average MRE with respect to the increasing noise in depth data.

Length of the Edges. When a 3D surface is reconstructed in a truly inextensible way, the length of the recovered edges must be the same as that of the template edges. So, in order to see to what extent the lengths remain the same along the deformation path, we specify a metric to figure out the discrepancy between the initial and recovered lengths as: \(IsometryExtent=\left( 1-\left( \frac{1}{n_{e}}\sum _{i=1}^{n_{e}}\left( \left| L_{i}-\hat{L_{i}}\right| /\hat{L_{i}}\right) \right) \right) \times 100\,\%\) which has been found to be 95.77 % for the proposed method, which indicates that it preserves the length of the edges greatly, confirming that isometry constraint is satisfied to a large degree.

The Impact of Noise. Different levels of noise (whether in image points or in depth data) have been simulated to demonstrate how robustly the approach reacts to the noise. Each of these 2 types of noise has been investigated separately. Figures 4 and 5 illustrate results for increasing levels of Gaussian noise in feature’s image points, where the Std varied from 0 to 4 pixels with 1-pixel increments, together with the reconstruction error for various levels of Gaussian noise in depth of feature points, with \(0.1-cm\) increments of Std, which was computed following the remark that, since the depth variation of the surface itself is small, the deviations from the true depth of every 3D point may be very close together, varying at each trial according to a Gaussian distribution. From the Figs. 4 and 5, we may draw the conclusion that the white noise does not make a dramatic impact on the output, ensuring that the performance remains pretty stable and the algorithm carries on efficiently in the face of noise.

5.2 Real Data

We performed also experiments with real data recorded using a camera setup comprising a ToF camera and a RGB camera. The camera configuration is set up in a way that makes the FOV of the ToF camera be part of the FOV of the 2D camera and the camera setup was calibrated both internally and externally. Bilinear interpolation was applied to estimate the depth of each 2D point track. We used a piece of cardboard to make real inextensible deformations and proceeded with the tracking and matching of few feature points with respect to the reference template using SIFT local feature descriptor. The same deformation model as the one acquired in synthetic experiments was employed. Some deformations and their recovered shapes are shown in Fig. 7. Although it was not possible to quantitatively assess the results and do benchmarking, the efficiency of the approach was visible from the 3D reconstruction output.

5.3 Comparative Evaluations Using Motion Capture Data

Rather than generate the training data synthetically using Blender, we take advantage of datasets recorded using Vicon which is able to capture real deformations accurately. Since the synthetically deformed meshes might not exactly overlap the real deformations, we rebuilt the deformation model based on this real data and redid the experiments. The template configuration is now composed of equal triangles and covers a \(20\times 20\)-cm square-like area. As an example, the reconstructed surfaces in Fig. 7 look better than the ones in Fig. 6. Consequently, when learned with real data, the deformation model would be more robust to the deformations.

As a general rule, two different entities can be compared only when they meet identical conditions which characterize them. To this end we analyzed the state-of-art literature and selected the approach described in [3]. In particular this approach also uses a triangular mesh and can use the same types of data sets required by our approach. As a result, to show how the real training data will influence the 3D reconstruction, we performed a set of simulations as we already did with Blender data and we compare the performance of our SfM framework to this approach (where the authors use a second-order cone program (SOCP) to accomplish the 3D reconstruction of inextensible surfaces). Their approach is known to be very robust and efficient, where a linear local deformation model integrates local patches into a global surface and requires many feature points distributed throughout the surface. To account for noise in our approach, like before, a Gaussian noise with 1-pixel Std was added to the image points and a Gaussian noise with \(0.1-cm\) Std to the depth data. The SOCP-based approach was evaluated without noise. We obtained the results for 5 deformations after having done 50 trials for each one. From the Table 2, it can be seen that the result of our approach is comparable to that of the SOCP-based method. The reconstruction errors are considerably lower than those in Table 1, which may imply that the use of good-quality real data for training might improve significantly the results.
Fig. 6.

Real deformations; A \(20 \times 20\)-cm square was selected from the intermediate part of the cardboard and the corresponding circle was reconstructed.

Fig. 7.

The reconstructed shape of the corresponding squares in Fig. 6.

Table 2.

Comparison between the proposed approach and the SOCP-based one.

Reconstruction error



Our approach



SOCP-based approach



6 Conclusions

In this paper, we have proposed a SfM framework combining a monocular camera and a ToF sensor to reconstruct surfaces which deform isometrically. The ToF camera was used to provide us with the depth of a sparse set of feature points, from which we can recover the depth of the mesh using a multivariate linear system. The key advantage of the RGB/ToF system is to benefit from the high-resolution RGB data in combination with the low-resolution depth information. As a result, our approach to inextensible surface reconstruction is formulated as an optimization problem on the basis of two non-linear constraints. Finally, we carried out a set of experiments showing that the approach generates good results. As next objective, we’ll extend the approach to deal with non-rigid surfaces which are not isometric e.g. conformal surfaces and etc.


  1. 1.
    Aans, H., Kahl, F.: Estimation of deformable structure and motion. In: Workshop on Vision and Modelling of Dynamic Scenes, ECCV, Denmark (2002)Google Scholar
  2. 2.
    Del-Bue. A., Llad, X., Agapito, L.: Non-rigid metric shape and motion recovery from uncalibrated images using priors. In: IEEE Conference on Computer Vision and Pattern Recognition, New York (2006)Google Scholar
  3. 3.
    Salzmann, M., Hartley, R., Fua, P.: Convex optimization for deformable surface 3-D tracking. In: IEEE International Conference on Computer Vision (2007)Google Scholar
  4. 4.
    Salzmann, M., Fua, P.: Reconstructing sharply folding surfaces: a convex formulation. In: IEEE Conference on Computer Vision and Pattern Recognition (2007)Google Scholar
  5. 5.
    Salzmann, M., Urtasun R., Fua, P.: Local deformation models for monocular 3D shape recovery. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1061 (2008)Google Scholar
  6. 6.
    Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of applicable surfaces from single views. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 482–496. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  7. 7.
    Prasad, M., Zisserman, A., Fitzgibbon, A.W.: Single view reconstruction of curved surfaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1345–1354 (2006)Google Scholar
  8. 8.
    Shen, S., Shi, W., Liu, Y.: Monocular 3-D tracking of inextensible deformable surfaces under L2-norm. IEEE Trans. Image Process. 19, 512–521 (2010)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Perriollat, M., Hartley, R., Bartoli, A.: Monocular template-based reconstruction of inextensible surfaces. Int. J. Comput. Vis. 95(2), 124–137 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Salzmann, M., Moreno-Noguer, F., Lepetit, V., Fua, P.: Closed-form solution to non-rigid 3D surface registration. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 581–594. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  11. 11.
    Metaxas, D., Terzopoulos, D.: Constrained deformable superquadrics and nonrigid motion tracking. PAMI 15, 580–591 (1993)CrossRefGoogle Scholar
  12. 12.
    Torresani, L., Hertzmann A., Bregler C.: Learning non-rigid 3d shape from 2d motion. In: NIPS, pp. 580–591 (2003)Google Scholar
  13. 13.
    Llado, X., Bue A.D., Agapito, L.: Non-rigid 3D factorization for projective reconstruction. In: BMVC (2005)Google Scholar
  14. 14.
    White, R., Forsyth, D.: Combining cues: shape from shading and texture. In: CVPR (2006)Google Scholar
  15. 15.
    Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In: Proceedings of NIPS (2005)Google Scholar
  16. 16.
    Yang, R., Davis, J., Nister, D.: Spatial-Depth Super Resolution for Range Images. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, Minneapolis, pp. 1–8 (2007)Google Scholar
  17. 17.
    Kim, H., Tai, Y.W., Brown, M.S.: High quality depth map upsampling for 3D-TOF cameras. In: IEEE International Conference on Inso Kweon Computer Vision (ICCV), Barcelona, pp. 1623–1630 (2011)Google Scholar
  18. 18.
    Brunet, F., Hartley, R., Bartoli, A., Navab, N., Malgouyres, R.: Monocular template-based reconstruction of smooth and inextensible surfaces. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 52–66. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  19. 19.
    Akhter, I., Sheikh, Y., Khan, S.: In defense of orthonormality constraints for nonrigid structure from motion. In: CVPR, pp. 1534–1541 (2009)Google Scholar
  20. 20.
    Xiao, J., Chai, J., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 573–587. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  21. 21.
    Zhou, H., Li, X., Sadka, A.H.: Nonrigid structure-from-motion from 2-D images using markov chain monte carlo. MultMed 14(1), 168–177 (2012)Google Scholar
  22. 22.
    Srivastava, S., Saxena, A., Theobalt, C., Thrun, S.: Rapid interactive 3D reconstruction from a single image. In: VMV, pp. 9–28 (2009)Google Scholar
  23. 23.
    Paladini, M., Bue, A.D., Stosic, M., Dodig, M., Xavier, J., Agapito, L.: Factorization for non-rigid and articulated structure using metric projections. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2905 (2009)Google Scholar
  24. 24.
    Dai, Y., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. In: CVPR, pp. 2018–2025 (2012)Google Scholar
  25. 25.
    Kim, Y.M., Theobalt, C., Diebel, J., Kosecka, J., Miscusik, B., Thrun, S.: Multi-view image and tof sensor fusion for dense 3D reconstruction. In: Computer Vision Workshops (ICCV Workshops), pp. 1542–1549 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Institute of Systems and Robotics, Department of Electrical and Computer EngineeringUniversity of CoimbraCoimbraPortugal

Personalised recommendations