A ToFAided Approach to 3D MeshBased Reconstruction of Isometric Surfaces
Abstract
In this paper, we investigate structurefrommotion (SfM) for surfaces that deform isometrically. Our SfM framework is intended for the estimation of both the 3D surface and the camera motion at one time through a templatebased approach founded on the combination of a ToF sensor and a conventional RGB camera. The objective is to take advantage of depth maps acquired by the ToF sensor so that a considerable enhancement can be achieved in the reconstruction of the nonrigid structure using the highresolution images captured by means of the RGB camera. A triangular mesh is adopted to represent isometric surfaces. The depth of a sparse set of 3D feature points spread all over the surface will be obtained with the help of the ToF camera, thereby enabling the recovery of the depth of the mesh vertices using a multivariate linear system. Subsequently, a nonlinear constraint is formed based on the projected length of each edge of the mesh. A second nonlinear constraint is then used for minimizing reprojection errors. These constraints are finally incorporated into an optimization scheme to solve for structure and motion. Experimental results show that the proposed approach has good performance even if only a lowresolution depth image is used.
Keywords
Structure from motion Isometric surface ToF camera 3D reconstruction1 Introduction
Structurefrommotion can be defined as the problem of simultaneous inference of the motion of a camera and the 3D geometry of the scene solely from a sequence of images. SfM was also extended to the case of deformable objects. Nonrigid SfM is underconstrained, which means that the recovery of nonrigid 3D shape is an inherently ambiguous problem [23, 24]. Given a specific configuration of points on the image plane, different 3D nonrigid shapes and camera motions can be found that fit the measurements. To solve this ambiguity, prior knowledge on the shape and motion should be used to constrain the solution. For example, Aanaes et al. [1] impose the prior knowledge that the reconstructed shape does not vary much from frame to frame while Del Bue et al. [2] impose the constraint that some of the points on the object are rigid. The priors can be divided in two main categories: the statistical and the physical priors. For instance, the methods relying on the lowrank factorization paradigm [1, 2] can be classified as statistical approaches. Learning approaches such as [3, 21, 22] also belong to the statistical approaches. Physical constraints include spatial and temporal priors on the surface to be reconstructed [6, 7]. A physical prior of particular interest is the hypothesis of having an inextensible (i.e. isometric) surface [8, 9, 10]. In this paper, we consider this type of surface. This hypothesis means that the length of the geodesics between every two points on the surface should not change across time, which makes sense for many types of material such as paper and some types of fabric.
3D reconstruction of nonrigid surfaces from images is an underconstrained problem and many different kinds of priors have been introduced to restrict the space of possible shapes to a manageable size. Based on the type of the surface model (or representation) used, we can classify the algorithms for reconstruction of deformable surfaces. The pointwise methods only reconstruct the 3D position of a relatively small number of feature points resulting in a sparse reconstruction of the 3D surface [9]. Physicsbased models such as superquadrics [11], triangular meshes [10], ThinPlate Splines (TPS) [9], or tensor product Bsplines [18] have been also utilized in other algorithms. In TPS, the 3D surface is represented as a parametric 2D3D map between the template image space and the 3D space. Then, a parametric model is fit to a sparse set of reconstructed 3D points in order to obtain a smooth surface which is not actually used in the 3D reconstruction process. There has been increasing interest in learning techniques that build surface deformation models from training data. More recently, linear models have been learned for SfM applications [12, 13]. There have also been a number of attempts at performing 3D surface reconstruction without using a deformation model. One approach is to use lighting information in addition to texture clues to constrain the reconstruction process [14], which has only been demonstrated under very restrictive assumptions on lighting conditions and is therefore not generally applicable. A common assumption in deformable surface reconstruction is to consider that the surface is inextensible. In [9], the authors propose a dedicated algorithm that enforces the inextensibility constraints. However, the inextensibility constraint alone is not sufficient to reconstruct the surface. Another sort of implementation is given by [4, 10]. In these papers, a convex cost function combining the depth of the reconstructed points and the negative of the reprojection error is maximized while enforcing the inequality constraints arising from the surface inextensibility. The resulting formulation can be easily turned into a SOCP problem. A similar approach is explored in [8]. The approach of [9] is a pointwise method. The approaches of [4, 8, 10] use a triangular mesh as surface model, and the inextensibility constraints are applied to the vertices of the mesh.
1.1 Model and Approach
In this work, we aim at the combined inference of the 3D surface and the camera motion while preserving the geodesics by using a RGB camera aided by a ToF range sensor. Usually, RGB cameras have high image resolutions. With these cameras, one can use efficient algorithms to calculate the depth of the scene, recover object shape or reveal structure, but at a high computational cost. ToF cameras deliver a depth map of the scene in realtime but with insufficient resolution for some applications. So, a combination of a common camera and a ToF sensor can exploit the capabilities of both. We assume that the fields of view of both the RGB and the ToF cameras mostly overlap. The surface is represented as a triangular 3D mesh and a set of correspondences between 3D feature points and 2D locations in an input image is available. In practice, they are obtained by matching SIFT features between the input image and a reference image in which the surface shape is known. The 2D points in the reference image correspond to 3D feature points on the mesh. The goal of the algorithm is to allow the 3D reconstruction of the surface mesh when matching is difficult and depth estimates are available for a limited number of points on the surface. Our approach performs SfM under the constraint that the deformation be isometric.
1.2 Outline of the Paper
This paper is organized as follows: to represent an isometric surface, a triangular mesh as well as a planar reference configuration is used. In Sect. 3, the matching between data from the range and the RGB cameras is described. Next, the estimation of the depth of the mesh vertices based on the depth of the feature points is described. The entire approach for the estimation of the 3D shape and motion is based on minimizing the sum of both the reprojection errors and the errors on the projected length of the mesh edges. Experimental results and quantitative evaluation are presented in the last section. We show that our approach is able to handle isometry indirectly without having to directly apply this constraint. In addition, it obviates the need for a dense set of 3D points lying on the surface by effective use of a ToF sensor.
2 Notation and Background
2.1 Notation
Matrices are represented as bold capital letters (\(\mathbf {A}\in \mathbb {R}^{n\times m}\), n rows and m columns). Vectors are represented as bold small letters (\(\mathbf {a}\in \mathbb {R}^{n}\), n elements). By default, a vector is considered a column. Small letters (a) represent one dimensional elements. By default, the jth column vector of \(\mathbf {A}\) is specified as \(\mathbf { \mathbf {a} }_{j}\). The jth element of a vector \(\mathbf {a}\) is written as \( a _{j}\). The element of \(\mathbf {A}\) in the row i and column j is represented as \(\mathbf {\mathrm { A }_{\mathrm {i,j}}}\). \(\mathbf {A}^{(1:2)}\) and \(\mathbf {a^{\mathrm {(1:2)}}}\) indicate the first 2 rows of \(\mathbf {A}\) and \(\mathbf {a}\). \(\mathbf {A}^{(3)}\) and \(\mathbf {a^{\mathrm {(3)}}}\) denote the third row of \(\mathbf {A}\) and \(\mathbf {a}\), respectively. Regular capital letters (\( A \)) indicate one dimensional constants. We use \(\mathbb {R}\) after a vector or matrix to denote that it is represented up to a scale factor. There might be few cases opposed to this notation, however, the aim is to comply with it as closely as possible.
2.2 Barycentric Coordinates
3 Combining Depth and RGB Images
3.1 Mapping Between Depth and RGB Images
The resolutions of the depth and the RGB images are different. A major issue that directly arises from the difference in resolution is that a pixeltopixel correspondence between the two images can not be established even if the FOVs fully overlap. Therefore the two images have to be registered so that the mapping between the pixels in the ToF image and in the RGB image can be established. The depth map provided by the ToF camera is sparse and affected by errors. Several methods can be used to improve the resolution of the depth images [15, 16, 17, 25] allowing the estimation of a dense depth image. We will use a simple approach based on linear interpolation.
To estimate depth for all the pixels of the RGB image, based on the depth provided by the ToF camera, a simple linear approach is used. We assume that the relative pose between both cameras, specified by the rotation matrix \(\mathbf {R}^{'}\) and translation vector \(\mathbf {t}^{'}\) has been estimated. We also assume that both cameras are internally calibrated, i.e., their intrinsic parameters are known. Let \(\mathbf {p}_{tof}\) and \(\mathbf {p}_{rgb}\) represent the 3D coordinates of a 3D point in the coordinate system of the Tof and the RGB cameras, respectively.
We use a pinhole camera model for both the RGB and the ToF cameras. Assume that the relative pose of the RGB camera and ToF sensor is fixed with a rotation \(\mathbf {R}^{'}\) and a translation \(\mathbf {t}^{'}\): \(\mathbf {p}_{rgb}=\mathbf {R}^{'}\,\mathbf {p}_{tof}+\mathbf {t}^{'}\) as shown in Fig. 1. The point cloud \(\mathbf {p}_{tof}\) is obtained directly from the calibrated ToF camera. Since the relative pose is known as well as the intrinsic parameters for both cameras, \(\mathbf {p}_{rgb}\) can be obtained from \(\mathbf {p}_{tof}\). To estimate depth for all points of the RGB image, a simple linear interpolation procedure is used. For each 2D point of the RGB image, we select the 4 closest neighbors whose depth was obtained from the depth image. Then, a bilinear interpolation is performed. Another possibility would be to select the 3 closest neighboring points (therefore, defining a triangle) and assume that the corresponding 3D points define a plane. An estimate for the depth of the point could then be obtained by intersecting its projecting ray with the 3D plane defined by the three 3D points.
3.2 Recovery of the Mesh Depth
4 Global Metric Estimation of Structure and Motion
4.1 Constraint 1: Projected Length
Assume that the RGB camera motion relative to the world coordinate system is expressed as a rotation matrix \(\mathbf {R}\) and a translation vector \(\mathbf {t}\). A common approach to solve for the camera motion and surface structure is to minimize the image reprojection error, namely by bundle adjustment. The cost function being minimized is the geometric distance between the image points and the reprojected points. However, we are going to adapt bundle adjustment to our own problem rather than use it directly, as follows: the errors to be minimized will be the difference between the observed and the predicted projected lengths of an edge.
4.2 Constraint 2: Reprojection Error

Errors due to the depth interpolation;

Irregular distribution of the feature points over the object surface.
The term corresponding to the reprojection error can be obtained as indicated below.
4.3 Objective Function
So far, we have derived two constraints expressed as two separate nonlinear problems. However, we intend to integrate both constraints into one single objective function so that they are taken into account at once, when estimating all the parameters. To do so, we minimize the weighted summation of them in such a way that the reprojection error term is assigned a weight m that accounts for its relative influence within the combined objective function. A block diagram of the overall structure of the approach is demonstrated in Fig. 2. In our global optimization, we first consider a simplified formulation of the objective function by excluding the camera motion \(\left[ \mathbf {R}\mathbf {t}\right] \). We include it back in the second case.
Estimation of Structure only. The constraints are simplified so that the only unknown parameter is the structure (we assume that the camera motion is set to \(\left[ \mathbf {I}0\right] ).\)
Orthographic Camera: \(\mathrm {min_{\mathbf {w}}\,}e_{tot}=\left( e_{pl}+m.e_{re}\right) \)
Perspective Camera: \(\mathrm {min_{ \mathbf {w} }}\, e_{tot}=\left( e_{mpl}+m.e_{mre}\right) \)
Estimation of both Structure and Camera Motion. We consider now the full optimization by including the camera motion.
Orthographic Camera: \(\mathrm {min_{ \mathbf {R}^{(1:2)},\mathbf {w} }\,}e_{tot}=\left( e_{pl}+m.e_{re}\right) \)
Perspective Camera: \(\mathrm {min_{ \mathbf {R},\mathbf {w} }}\, e_{tot}=\left( e_{mpl}+m.e_{mre}\right) \)
The above optimization problems can be solved using a nonlinear minimization algorithm such as LevenbergMarquardt (LMA). The rotation estimates obtained from this optimization may not satisfy the orthormality constraints. So, the optimization algorithm must be fed with a good initialization. To provide initial estimates relatively close to the true ones, we do the following: if initial guesses for \(R^{(1:2)}\) and R are not given, they can be initialized using wellknown methods that attempt to solve for SfM through nonrigid factorization of \(\left\{ q_{ij}\right\} \) and \(\left\{ \lambda _{ij}q_{ij}\right\} \) from all frames, for instance, as in [13]. In these methods, the factorization is followed by a refinement step to upgrade the reconstruction to metric. The deformation coefficients \(\mathbf {w}_{k}\) are initialised to random small values. One possible solution to further meet the rotation constraints is to subsequently apply Procrustes [19, 20].
4.4 Additional Constraint
5 Experiments
5.1 Synthetic Data
Preliminary results.
Reconstruction error  PRE  MRE  RotationAccuracy 

Our approach  orthographic  0.0608  0.0755  0.002 
Our approach  perspective  0.0603  0.0751  \(<1^{\circ }\) 
Reconstruction Error. The accuracy of the method is reported in terms of reconstruction errors. The reconstruction errors are computed with respect to two measures as:
1 Point reconstruction error (PRE): The normalized Euclidean distance between the observed (\(\hat{\mathbf {p}_{i}}\)) and the estimated (\(\mathbf {p}_{i}\)) world points according to \(PRE=\frac{1}{N}\sum _{i=1}^{N}\left[ \left\ \mathbf {p}_{i}\hat{\mathbf {p}_{i}}\right\ ^{2}/\left\ \hat{\mathbf {p}_{i}}\right\ ^{2}\right] \).
2 Mesh reconstruction error (MRE): The normalized Euclidean distance between the observed (\(\hat{\mathbf {v}_{i}}\)) and the estimated (\(\mathbf {v}_{i}\)) mesh vertices, which is computed as \(MRE=\frac{1}{n_{v}}\sum _{i=1}^{n_{v}}\left[ \left\ \mathbf {v}_{i}\hat{\mathbf {v}_{i}}\right\ ^{2}/\left\ \hat{\mathbf {v}_{i}}\right\ ^{2}\right] \). The reprojection error of the feature points can be also regarded as another measure of precision. The accuracy of the Stiefel rotation matrix is evaluated based on the orthonormality constraint as \(RotationAccuracy=\left\ \mathbf {R}^{(2\times 3)}\mathbf {R}^{(2\times 3)T}\mathbf {I}^{(2\times 2)}\right\ _{F}^{2}\). In case of the perspective camera, we compare the axisangle of the recovered and groundtruth rotations as \(RotationAccuracy=\left angle\hat{angle}\right ^{2}\).
Length of the Edges. When a 3D surface is reconstructed in a truly inextensible way, the length of the recovered edges must be the same as that of the template edges. So, in order to see to what extent the lengths remain the same along the deformation path, we specify a metric to figure out the discrepancy between the initial and recovered lengths as: \(IsometryExtent=\left( 1\left( \frac{1}{n_{e}}\sum _{i=1}^{n_{e}}\left( \left L_{i}\hat{L_{i}}\right /\hat{L_{i}}\right) \right) \right) \times 100\,\%\) which has been found to be 95.77 % for the proposed method, which indicates that it preserves the length of the edges greatly, confirming that isometry constraint is satisfied to a large degree.
The Impact of Noise. Different levels of noise (whether in image points or in depth data) have been simulated to demonstrate how robustly the approach reacts to the noise. Each of these 2 types of noise has been investigated separately. Figures 4 and 5 illustrate results for increasing levels of Gaussian noise in feature’s image points, where the Std varied from 0 to 4 pixels with 1pixel increments, together with the reconstruction error for various levels of Gaussian noise in depth of feature points, with \(0.1cm\) increments of Std, which was computed following the remark that, since the depth variation of the surface itself is small, the deviations from the true depth of every 3D point may be very close together, varying at each trial according to a Gaussian distribution. From the Figs. 4 and 5, we may draw the conclusion that the white noise does not make a dramatic impact on the output, ensuring that the performance remains pretty stable and the algorithm carries on efficiently in the face of noise.
5.2 Real Data
We performed also experiments with real data recorded using a camera setup comprising a ToF camera and a RGB camera. The camera configuration is set up in a way that makes the FOV of the ToF camera be part of the FOV of the 2D camera and the camera setup was calibrated both internally and externally. Bilinear interpolation was applied to estimate the depth of each 2D point track. We used a piece of cardboard to make real inextensible deformations and proceeded with the tracking and matching of few feature points with respect to the reference template using SIFT local feature descriptor. The same deformation model as the one acquired in synthetic experiments was employed. Some deformations and their recovered shapes are shown in Fig. 7. Although it was not possible to quantitatively assess the results and do benchmarking, the efficiency of the approach was visible from the 3D reconstruction output.
5.3 Comparative Evaluations Using Motion Capture Data
Rather than generate the training data synthetically using Blender, we take advantage of datasets recorded using Vicon which is able to capture real deformations accurately. Since the synthetically deformed meshes might not exactly overlap the real deformations, we rebuilt the deformation model based on this real data and redid the experiments. The template configuration is now composed of equal triangles and covers a \(20\times 20\)cm squarelike area. As an example, the reconstructed surfaces in Fig. 7 look better than the ones in Fig. 6. Consequently, when learned with real data, the deformation model would be more robust to the deformations.
Comparison between the proposed approach and the SOCPbased one.
Reconstruction error  PRE  MRE 

Our approach  0.0120  0.0185 
SOCPbased approach  0.0162  0.0217 
6 Conclusions
In this paper, we have proposed a SfM framework combining a monocular camera and a ToF sensor to reconstruct surfaces which deform isometrically. The ToF camera was used to provide us with the depth of a sparse set of feature points, from which we can recover the depth of the mesh using a multivariate linear system. The key advantage of the RGB/ToF system is to benefit from the highresolution RGB data in combination with the lowresolution depth information. As a result, our approach to inextensible surface reconstruction is formulated as an optimization problem on the basis of two nonlinear constraints. Finally, we carried out a set of experiments showing that the approach generates good results. As next objective, we’ll extend the approach to deal with nonrigid surfaces which are not isometric e.g. conformal surfaces and etc.
References
 1.Aans, H., Kahl, F.: Estimation of deformable structure and motion. In: Workshop on Vision and Modelling of Dynamic Scenes, ECCV, Denmark (2002)Google Scholar
 2.DelBue. A., Llad, X., Agapito, L.: Nonrigid metric shape and motion recovery from uncalibrated images using priors. In: IEEE Conference on Computer Vision and Pattern Recognition, New York (2006)Google Scholar
 3.Salzmann, M., Hartley, R., Fua, P.: Convex optimization for deformable surface 3D tracking. In: IEEE International Conference on Computer Vision (2007)Google Scholar
 4.Salzmann, M., Fua, P.: Reconstructing sharply folding surfaces: a convex formulation. In: IEEE Conference on Computer Vision and Pattern Recognition (2007)Google Scholar
 5.Salzmann, M., Urtasun R., Fua, P.: Local deformation models for monocular 3D shape recovery. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1061 (2008)Google Scholar
 6.Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of applicable surfaces from single views. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 482–496. Springer, Heidelberg (2004) CrossRefGoogle Scholar
 7.Prasad, M., Zisserman, A., Fitzgibbon, A.W.: Single view reconstruction of curved surfaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1345–1354 (2006)Google Scholar
 8.Shen, S., Shi, W., Liu, Y.: Monocular 3D tracking of inextensible deformable surfaces under L2norm. IEEE Trans. Image Process. 19, 512–521 (2010)MathSciNetCrossRefGoogle Scholar
 9.Perriollat, M., Hartley, R., Bartoli, A.: Monocular templatebased reconstruction of inextensible surfaces. Int. J. Comput. Vis. 95(2), 124–137 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Salzmann, M., MorenoNoguer, F., Lepetit, V., Fua, P.: Closedform solution to nonrigid 3D surface registration. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 581–594. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 11.Metaxas, D., Terzopoulos, D.: Constrained deformable superquadrics and nonrigid motion tracking. PAMI 15, 580–591 (1993)CrossRefGoogle Scholar
 12.Torresani, L., Hertzmann A., Bregler C.: Learning nonrigid 3d shape from 2d motion. In: NIPS, pp. 580–591 (2003)Google Scholar
 13.Llado, X., Bue A.D., Agapito, L.: Nonrigid 3D factorization for projective reconstruction. In: BMVC (2005)Google Scholar
 14.White, R., Forsyth, D.: Combining cues: shape from shading and texture. In: CVPR (2006)Google Scholar
 15.Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In: Proceedings of NIPS (2005)Google Scholar
 16.Yang, R., Davis, J., Nister, D.: SpatialDepth Super Resolution for Range Images. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, Minneapolis, pp. 1–8 (2007)Google Scholar
 17.Kim, H., Tai, Y.W., Brown, M.S.: High quality depth map upsampling for 3DTOF cameras. In: IEEE International Conference on Inso Kweon Computer Vision (ICCV), Barcelona, pp. 1623–1630 (2011)Google Scholar
 18.Brunet, F., Hartley, R., Bartoli, A., Navab, N., Malgouyres, R.: Monocular templatebased reconstruction of smooth and inextensible surfaces. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 52–66. Springer, Heidelberg (2011) CrossRefGoogle Scholar
 19.Akhter, I., Sheikh, Y., Khan, S.: In defense of orthonormality constraints for nonrigid structure from motion. In: CVPR, pp. 1534–1541 (2009)Google Scholar
 20.Xiao, J., Chai, J., Kanade, T.: A closedform solution to nonrigid shape and motion recovery. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 573–587. Springer, Heidelberg (2004) CrossRefGoogle Scholar
 21.Zhou, H., Li, X., Sadka, A.H.: Nonrigid structurefrommotion from 2D images using markov chain monte carlo. MultMed 14(1), 168–177 (2012)Google Scholar
 22.Srivastava, S., Saxena, A., Theobalt, C., Thrun, S.: Rapid interactive 3D reconstruction from a single image. In: VMV, pp. 9–28 (2009)Google Scholar
 23.Paladini, M., Bue, A.D., Stosic, M., Dodig, M., Xavier, J., Agapito, L.: Factorization for nonrigid and articulated structure using metric projections. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2905 (2009)Google Scholar
 24.Dai, Y., Li, H., He, M.: A simple priorfree method for nonrigid structurefrommotion factorization. In: CVPR, pp. 2018–2025 (2012)Google Scholar
 25.Kim, Y.M., Theobalt, C., Diebel, J., Kosecka, J., Miscusik, B., Thrun, S.: Multiview image and tof sensor fusion for dense 3D reconstruction. In: Computer Vision Workshops (ICCV Workshops), pp. 1542–1549 (2009)Google Scholar