A Benchmark and Evaluation of Non-Rigid Structure from Motion

Non-rigid structure from motion (nrsfm), is a long standing and central problem in computer vision and its solution is necessary for obtaining 3D information from multiple images when the scene is dynamic. A main issue regarding the further development of this important computer vision topic, is the lack of high quality data sets. We here address this issue by presenting a data set created for this purpose, which is made publicly available, and considerably larger than the previous state of the art. To validate the applicability of this data set, and provide an investigation into the state of the art of nrsfm, including potential directions forward, we here present a benchmark and a scrupulous evaluation using this data set. This benchmark evaluates 18 different methods with available code that reasonably spans the state of the art in sparse nrsfm. This new public data set and evaluation protocol will provide benchmark tools for further development in this challenging field.

T HE estimation of structure and motion from an image sequence, i.e. the structure from motion (SfM) or monocular simultaneous localization and mapping (SLAM) problem, is one of the central problems within computer vision.This problem has received a lot of attention, and truly impressive advances has been made over the last ten to twenty years.It plays a central role in robot navigation, self-driving cars, and 3D reconstruction of the environment, to mention a few.A central part of maturing regular SfM is the availability of sizeable data sets with rigorous evaluations, e.g.[1] [2].
The regular SfM problem, however, primarily deals with rigid objects, which is somewhat at odds with the world we see around us.That is, trees sway, faces express themselves in various expressions, and organic objects are generally non-rigid.The issue of making this obvious and necessary extension of the SfM problem, is referred to as the non-rigid structure from motion problem (NRSfM).A problem that also has a central place in computer vision.The solution to this problem is, however, not as mature as the regular SfM problem.A reason for this is the scarcity of high quality data sets and accompanying evaluations.Such data and evaluations allow us to better understand the problem domain and better determine what works best and why.To address this issue, we here introduce a high quality data set, with accompanying ground truth (or reference data to be more precise) aimed at evaluating non-rigid structure from motion.To the best of our knowledge, this data set is significantly larger and more diverse than what has previously been available -c.f.Section 3 for a comparison to previous evaluations of NRSfM.The presented data set better captures the variability of the problem, and gives higher statistical strength of the conclusions reached via it.Accompanying this data set, we have conducted an evaluation of 16 state of the art methods, hereby validating the suitability of our data set, and providing insight into the state of the art within NRSfM.This evaluation was part of the competition we held at a CVPR 2017 workshop, aimed at NRSfM.It is our hope and belief that this data set and evaluation will help in furthering the state of the art in NRSfM research, by providing insight and a benchmark.The data set is publicly available at http://nrsfm2017.compute.dtu.dk/dataset.This paper is structured by first giving an overview of the NRSfM problem, followed by a overview of related work, wrt.other data sets.This is then followed by a presentation of our data set, including an overview of the design considerations, c.f. Section 3, which is followed by a presentation of our proposed protocol for evaluation, c.f. Section 4. This leads to the result of our benchmark evaluation in Sections 5.The paper is rounded of by a discussion and conclusions in Section 6.

THE NRSfM PROBLEM
In this section, we will provide a brief introduction of the NRSfM problem, followed by a more detailed overview of ways this problem has been addressed.The intention is to establish a taxonomy to base our experimental design arXiv:1801.08388v2[cs.CV] 26 Apr 2018 and evaluation upon.For a more in-depth review of NRSfM, we recommend the survey of Salzmann et al. [3].
The standard/rigid SfM problem, c.f. e.g.[4], is an inverse problem aimed at finding the camera positions (and possibly internal parameters) as well as 3D structure -typically represented as a static 3D point set, Q -that best describe a sequence of 2D images of a rigid body.Where the 2D images are typically reduced to a sparse set of tracked 2D point features, corresponding to the 3D point set, Q.The most often employed observation model, linking 2D image points to 3D points and camera motion, is either the perspective camera model, or the weak perspective approximation here of.The weak perspective camera model is derived from the full perspective model, by simplifying the projective effect of 3D point depth, i.e. the distance between camera and 3D point.
The extension from rigid structure from motion to the non-rigid case is by allowing the 3D structure, here points Q f , to vary from frame to frame, i.e.
Where Q f,p is the 3D position of point p at frame f .To make this NRSfM problem well-defined, a prior or regularization is often employed.Here most of the cases target the spatial and temporal variations of Q f .The fitness of the prior to deformation in question is a crucial element in successfully solving the NRSfM problem, and a main difference among NRSfM methods is this prior.
In this study, we denote NRSfM methods according to a three category taxonomy, i.e. the deformable model used (statistical or physical), the camera model (affine, weak or full perspective) and the ability to deal with missing data.In the remainder of this section, this taxonomy will be elaborated upon and related to the litterature, leading up to a discussion of how the NRSfM methods we evaluate, c.f. TABLE 1, span the state of the art.

Deformable Models
The description of our taxonomy will start with the underlying structure deformation model category, divided into statistical and physical based models.

Statistical
This set of algorithms apply a statistical deformation model with no direct connection to the physical process of structure deformations.They are in general heuristically defined a priori to enforce constraints that can reduce the ill-posedness of the NRSfM problem.The most used low-rank model in the NRSfM literature falls into this category, utilizing the assumption that 3D deformations are well described by linear subspaces (also called basis shapes).This property was first used in 2000 by Bregler, Hertzmann and Biermann [5] to first instantiate the solution of NRSfM by solving a factorization problem, as analogously made by Tomasi and Kanade for the rigid case [6].However, strongly nonlinear deformations, such as the one appearing in articulated shapes, may drastically reduce the effectiveness of such models.Moreover, the low-rank model acts mainly as a constraint over the spatial distribution of the deforming point cloud and it does not restrict the temporal variations of the deforming object.
Given this observation, Akhter et al. [7] was the first to propose constraining the temporal deformations of the object, using a set of DCT bases, thus, assuming that deformations act with low-frequency components.This principle was supported by a study indicating a correlation between 3D bases extracted by PCA on MoCap sequences of human motion, i.e. the distribution of the linear weights closely resemble the DCT ones [8].Even at the expenses of introducing a new parameter, this principle of smoothing deformations in the temporal domain was able to achieve reasonable results with human motion modelling, even applied to synthetically generated sequences with a large camera motion [7].
Differently, Gotardo et al. [9] had the intuition to use the very same DCT bases to model camera and deformation motion instead, assuming those factors are smooth in a video sequence.This approach was later expanded on by explicitly modeling a set of complementary rank-3 spaces, and to constrain the magnitude of deformations in the basis shapes [10].An extension of this framework, increased the generalization of the model to non-linear deformations, with a kernel transformation on the 3D shape space using radial basis functions [11].This switch of perspective addressed the main issue of increasing the number of available DCT bases, allowing more diverse motions, while not restricting the complexity of deformations.Later, further extension and optimization have been made to low-rank and DCT bases approaches.Valmadre and Lucey [12] noticed that the trajectory should be a low-frequency signal, thus laying the ground for an automatic selection of DCT basis rank via penalizing the trajectory's response to one or more high-pass filters.Moreover, spatio-temporal constraints have been imposed both for temporal and spatial deformations [13].
Recently a new prior model, related to the Kronecker-Markov structure of the covariance of time-varying 3D point, very well generalizes several priors introduced previously [14].Another recent improvement is given by Ansari et al.'s usage of DCT basis in conjunction singular value thresholding for camera pose estimation [15].
Similar spatial and temporal priors have been introduced as regularization terms while optimizing a cost function solving for the NRSfM problem, mainly using a low-rank model only.Torresani et al. [16] proposed a probabilistic PCA model for modelling deformations by marginalizing some of the variables, assuming Gaussian distributions for both noise and deformations.Moreover, in the same framework, a linear dynamical model was used to represent the deformation at the current frame as a linear function of the previous.Brand [17] penalizes deformations over the mean shape of the object by introducing a sensible parameters over the degree of flexibility of the shape.Del Bue et al. [18] instead compute a more robust non-rigid factorization, using a 3D mean shape as a prior for NRSfM [19].In a nonlinear optimization framework, Olsen et al. [20] include l 2 penalties both on the frame-by-frame deformations and on the closeness of the reconstructed points in 3D given their 2D projections.Of course, penalty costs introduce a new set of hyper-parameters that weights the terms, implying the need for a further tuning, that can be impracticable when cross-validation is not an option.Regularization has also been introduced in formulations of Bundle Adjustment for NRSfM [21] by including smoothness deformations via l 2 penalties mainly [22] or constraints over the rigidity of pre-segmented points in the measurement [23].
Another important statistical principal is enforcing that low-rank bases are independent.In the coarse to fine approach of Bartoli et al. [24], base shapes are computed sequentially by adding the basis, which explain most of the variance in respect to the previous ones.They also impose a stopping criteria, thus, achieving the automatic computation of the overall number of bases.The concept of basis independence clearly calls for a statistical model close to Independent Component Analysis (ICA).To this end, Brandt et al. [25] proposed a prior term to minimize the mutual information of each basis in the NRSfM model.Low-rank models are indeed compact but limited in the expressiveness of complex deformations, for this reason, an over complete representation can still be used by imposing sparsity over the selected bases [26].In this way, 3D shapes in time can have a compact representation, and they can be theoretically characterized as a block sparse dictionary learning problem.In a similar spirit, Hamsici et al. propose to use the input data for learning spatially smooth shape weights using rotation invariant kernels [27].
All these approaches for addressing NRSfM with a lowrank model have provided several non-linear optimization procedures, mainly using Alternating Least Squares (ALS), Lagrange Multipliers and alternating direction method of multipliers (ADMM).Torresani et al. first proposed to alternate between the solution of camera matrices, deformation parameters and basis shapes.This first initial solution was then extended by Wang et al. [28] by constraining the camera matrices to be orthonormal at each iteration, while Paladini et al. [29] strictly enforced the matrix manifold of the camera matrices to increase the chances to converge to the global optimum of the cost function.All these methods were not designed to be strictly convergent, for this reason a Bilinear Augmented Multiplier Method (BALM) [30] was introduced to be convergent while implying all the problems constraints being satisfied.Furthermore, robustness in terms of outlying data was then included to improve results in a proximal method with theoretical guarantees of convergence to a stationary point [31].
Despite the non-linearity of the problem, it is possible to relax the rank constraint with the trace norm and solve the problem with convex programming.Following this strategy, Dai et al. provided one of the first effective closed form solutions to the low-rank problem [32].Although their convex solution, resulting from relaxation, did not provide the best performance, a following iterative optimization scheme gave improved results.In this respect, Kumar et al. proposed a further improvement on their previous approach, where deformations are represented as a spatio-temporal union of subspaces rather than a single subspace [33].Thus complex deformation can be represented as the union of several simple ones.
More recently, the Procrustean Normal Distribution (PND) model was proposed as an effective way to implicitly separate rigid and non-rigid deformations [34].This separation provides a relevant regularization, since rigid motion can be used to obtain a more robust camera estimation, while deformations are still sampled as a normal distribution as done similarly previously [16].Such a separation is obtained by enforcing an alignment between the reconstructed 3D shapes at every frame.This should in practice factor out the rigid transformations from the statistical distribution of deformations.The PND model has been then extended to deal with more complex deformations and longer sequences [35].

Physical
Physical models represents a less studied class wrt.NRSfM, which should ideally be the most accurate for modelling NRSfM.Of course applying the right physical model requires a knowledge of the deformation type and object material, which is information not readily available a priori.
A first class of physical models assume that the nonrigid object is piecewise, i.e. a collection of pre-defined or estimated patches that are mostly rigid or slightly deformable.One of the first approaches to use this strategy is Varol et al. [36].By preselecting a set of overlapping patches from the 2D image points, and assuming each patch is rigid, homography constraints can be imposed at each patch, followed by global 3D consistency being enforced using the overlapping points.However, the rigidity of a patch, even if small, is a very hard constraint to impose and it does not generalise well for every non-rigid shape.Moreover, dense point-matches over the image sequence are required to ensure a set of overlapping points among all the patches.A relaxation to the piece-wise rigid constraint was given by Fayad et al. [37], assuming each patch deforming with a quadratic physical model, thus, accounting for linear and bending deformations.These methods all require an initial patch segmentation and the number of overlapping points, to this end, Russel et al. [38] optimize the number of patches and overlap by defining an energy based cost function.The method of Lee et al. [39] instead use 3D reconstructions of multiple combination of patches and define a 3D consensus between a set of pacthes.This approach provides a fast way to bypass the segmentation problem and robust mechanism to prune out wrong local 3D reconstructions.
Differently from these approaches, Taylor et al. [40] constructs a triangular mesh, connecting all the points, and considering each triangle as being locally rigid.Global consistency is here imposed to ensure that the vertexes of each triangle coincide in 3D.Again, this approach is to a certain extent similarly to [36], which requires a dense set of points in order to comply with the local rigidity constraint.
A strong prior, which helps dramatically to mitigate the ill-posedness of the problem, is obtained by considering the deformation isometric, i.e. the metric length of curves does not change when the shape is subject to deformations (e.g.paper and metallic materials to some extent).Using an assumption that a surface can be approximated as infinitesimally planar, Chhatkuli et al. [41] proposed a local method that frame NRSfM as the solution of Partial Differential Equations (PDE) being able to deal with missing data as well.As a further update [42] formalizes the framework in the context of Riemannian geometry, which led to a practical method for solving the problem in linear time and scaling for a relevant number of views and points.Furthermore, a convex formulation for NRSfM with inextensible deformation constraints was implemented using Second-Order Cone Programming (SOCP), leading to a closed form solution to the problem [43].Vincente and Agapito implemented soft inextensibility constraints [44] in an energy minimization framework, e.g. using recently introduced techniques for discrete optimization.
Another set of approaches try to directly estimate the deformation function using high order models.Del Bue and Bartoli [45] extended and applied 3D warps such as the thin plate spline, to the NRSfM problem.Starting from an approximate mean 3D reconstruction, the warping function can be constructed and the deformation at each frame can be solved by iterating between camera and 3D warp field estimation.Finally, Agudo et al. introduced the use of Finite Elements Models (FEM) in NRSfM [46], [47].As these models are highly parametrized, requiring the knowledge of the material properties of the object (e.g. the Young modulus), FEM needs to be approximated in order to be efficiently estimated, however, in ideal conditions it might achieve remarkable results, since FEM is a consolidated technique for modelling structural deformations.

Missing Data
The initial methods for NRSfM assumed complete 2D point matches among views, when observing a deformable object.However, given self and standard occlusions, this is rarely the case.Most approaches for dealing with such missing data in NRSfM were framed as a matrix completion problem, i.e. estimate the missing entries of the matrix W given known constraints (mainly matrix low-rank).Torresani et al. [48] first proposed removing rows and lines of the matrix corresponding to missing entries in order to solve the NRSfM problem.However, this strategy suffers greatly from even small percentages of missing data, since the subset of know completely entries can be very small.Dai et al. [32] complete the missing entries via convex optimisation by relaxing the rank constraint using a matrix trace norm.Indeed, this method can be robust to more missing entries even do being computationally viable only for smaller scale problems.Most of the iterative approaches indeed include an update step of the missing entries [29], [30] where the missing entries become an explicit unknown to estimate.Gotardo et al. [9] instead strongly reduce the number of parameters by estimating only the camera matrix explicitly under severe missing data.This variable reduction is know as VARPRO in the optimization literature.It has been recently revisited in relation to several structure from motion problems [49].

Camera Model
Most NRSfM method research focus on modelling and optimization aspects, and most assume a weak perspective camera model.However, in cases where the object is close to the camera and undergoing strong changes in depth, time-varying perspective distortions can significantly affect the measured 2D trajectories.
As low-rank NRSfM is treated as a factorization problem, a straightforward extension is to follow best practices from rigid SfM for perspective camera.Xiao and Kanade [50] have e.g.developed a two step factorization algorithm for reconstruction of 3D deformable shapes under the full perspective camera model.This is done using the assumption that a set of basis shapes are known to be independent.Vidal and Abretske [51] have also proposed an algebraic solution to the nonrigid factorization problem.Their approach is, however, limited to the case of an object being modelled with two independent basis shapes and viewed in five different images.Wang et al. [52] proposed a method able to deal with the perspective camera model, but under the assumption that its internal calibration is already known.They update the solutions from a weak perspective to a full perspective projection by refining the projective depths recursively, and then refine all the parameters in a final optimization stage.Finally, Hartley and Vidal [53] have proposed a new closed form linear solution for the perspective camera case.This algorithm requires the initial estimation of a multifocal tensor, which the authors report is very sensitive to noise.Llado et al. [54], [55] proposed a non-linear optimization procedure.It is based on the fact that it is possible to detect nearly rigid points in the deforming shape, which can provide the basis for a robust camera calibration.

Evaluated Methods
We have chosen a representative subset of the aforementioned methods, which are summarized according to our taxonomy in TABLE 1.This gives us a good representation of recent work, distributed according to our taxonomy with a decent span of deformation models (statistical/physical) and camera models (orthographic, weak perpsective or perspective).This also takes into account in-group variations such as DCT basis for statistical deformation and isometry for physical deformation.Even lesser used priors, such as compressibility, are represented.While this is not a full factorial study, we think this reasonably spans the recent state of the art of NRSfM.Our choice has, of course, also been influence by method availability, as we want to test the author's original implementation, to avoid our own implementation bias/errors.All in all, we have included 16 methods in our evaluation.Note, that we chose not to include the method of Taylor et al. [40], despite it being publicly available, due to it poor performance, i.e. it failed approx.two thirds of the time.

DATASET
As stated, in order to compare state of the art methods for NRSfM, we have compiled a larger data set for this purpose.Even though there is a lack of empirical evidence w.r.t.NRSfM, it does not imply, that no data sets for NRSfM exist.
As an example in [39], [9], [10], [11], [33], [8], [27] and [32], a combination of two data sets are used.Namely seven sequences of a human body from the CMU motion capture database [56], two MoCap sequences of a deforming face [57], [58], a computer animated shark [57] and a challenging flag sequence [37].To the best of our knowledge, this list represents the most used evaluation data sets for NRSfM with available ground truth.
The CMU data set [56] captures motion of humans.Since the other frequently used data sets are also related to animated faces [57], [58], this implies that there is a high over representation of humans in this state of the art, and that a higher variability in the deformed scenes viewed is deemed beneficial.In addition, the shark sequence [57] is not based on real images and objects, but on computer graphics and pure simulation.As such there is a need for new data sets, with reliable ground truth or reference data 1 , and a higher variability in the objects and deformations used.
As such, we here present a data set consisting of five widely different objects/scenes and motions.Based on mechanically -and therefor repeatable -object motions, we have defined six different camera motions employing two different camera models.This setup, all in all, gives 60 different sequences organized in a factorial experimental design, thus, enabling a more stringent statistical analysis.In addition to this, since we have tight 3D surface models of our objects or scenes, we are able to determine occlusions of all 2D feature points.This in 1.With real measurements like ours the 'ground truth' data also include noise, why 'reference data' is a more correct term.turn gives a realistic handling off missing data, which is often due to object self occlusion.
As indicated, these data sets are achieved by stop-motion using mechanical animatronics.These are recorded in our robotic setup previously used for generating high quality data sets c.f. e.g.[59].We will here present details of our data capture pipeline, followed by a brief outline and discussion of design considerations.
The goal of the data capturing is to produce 3 types of related data: Ground Truth: A series of 3D points that change over time.

Input Tracks:
2D tracks used for input for NRSfM.Missing Data: Binary data representing which tracks are occluded at what frame.We record the step-wise deformation of our animatronics from K static views, obtaining both image data and dense 3D surface geometry.We obtain 2D point features by applying standard optical flow tracking to the image sequence obtained from each of the K views, which is then reprojected onto the recorded surface geometry.The ground truth is then the union of these 3D tracks.By using optical flow for tracking instead of MoCap markers, we obtain a more realistic set of ground truth points.We create input 2D points by projecting the recorded ground truth using a virtual camera in a fully factorial design of camera paths and camera models.
In the following we will detail some of the central parts of the above procedure.

Animatronics & Recording Setup
Our stop-motion animatronics are five mechatronic devices capable of computer controlled gradual deformation.They are shown in Fig. 2, and cover five types of deformations: Articulated Motion, Bending, Deflation, Stretching, and Tearing.We believe this covers a good range of interesting and archetypal deformations.It is noted, that NRSfM has previously been tested on bending and tearing [40], [44], [43], [39], but without ground truth for quantitative comparison.Additionally, elastic deformations, like deflation and stretching, are quite commonplace, but hasn't appeared in any previous data sets, to the best of our knowledge.
The animatronics can hold a given deformation or pose for a large extent of time, thus, allowing us to

Recording Procedure
The recording procedure acquires for each shape a series of image sequences and surface geometries of its deformation over F frames.We record each frame from K static views with our aforementioned structured light scanner.As such we obtain K image sequences with F images in each.We also obtain F dense surface reconstructions, one for each frame in the deformation.The procedure is summarized in pseudo code in Algorithm 1. Fig. 3 illustrates a sample images of three views obtained using the above process.

3D Ground Truth Data
The next step is to take acquired images I f,k and surfaces S f , and extract the ground truth points.We do this by applying optical flow tracking [61] to obtain 2D tracks, which are then reprojected onto S f .The union of of these reprojected tracks gives us the ground truth, Q.This process is summarized in pseudo code in Algorithm 2.

Projection using Virtual Camera
To produce the desired input, we project the ground truth Q using a virtual camera, similar to what has been done in [39], [9], [32], [58].This step has two factors related to the camera that we wish to control for: path and camera model.To keep our design factorial, we define six different camera paths, which will all be used to create the 2D input.They are illustrated in Fig. 4. We believe these are a good representation of possible camera motion with both linear motion and panoramic panning.The camera model can be either orthographic or perspective.The factorial combination of these elements yields to 12 input sequences for each ground truth.Additionally, as we have previously recorded the dense surface for each frame (see Sec. 3.2), we estimate missing data via self-occlusion.Specifically, we create a triangular mesh for each S f and estimate occlusion via raycasting into the camera along the projection lines.This process is summarized in pseudo code in Algorithm 3.

Discussion
While stop-motion does allow for diverse data creation, it is not without drawbacks.Natural acceleration is easily lost when objects deform in step-wise manner and recordings are unnaturally free of noise like motion blur.However, without this technique, it would have been prohibitive to create data with the desired diversity and accurate 3D ground truth.
The same criticism could be levied against the use of a virtual camera, it lacks the shakiness and acceleration of a real world camera.On the other hand, it allows for us to precisely vary both the camera path and camera model.This enables us to perform a factorial analysis,  in which we can study the effects of both on NRSfM.As we show in Sec. 5 some interesting conclusions are drawn from this analysis.Most NRSfM methods are designed with an orthographic camera in mind.As such investigating the difference between data under orthographic and perspective projection is of interest.Such investigation is only practically possible using a virtual camera.

EVALUATION METRIC
In order to compare the methods of TABLE 1 w.r.t.our data set, a metric is needed.The purpose is to project the high dimensional 3D reconstruction error into (ideally) a one dimension measure.Several different metrics have been proposed for NRSfM evaluation in the past literature, e.g. the Frobenius norm [62], mean [27], variance normalized mean [10] and RMSE [40].All of the above mentioned evaluation metrics are based on the L2-norm in one form or another.A drawback of this is, that the L2-norm is very sensitive to large errors, often letting a few outliers dominate an evaluation.To address this, we incorporate robustness into our metric, by introducing truncation of the individual 3D point reconstruction errors.In particular, our metric is based on a RMSE measure similar used in Taylor et al. [40].
The robust truncation is achieved in a manner similar to the widely used box plot's outlier detection strategy [63].Consider E being the set of point-wise errors (||X f,p − Q f,p ||) and E 1 , E 3 as the first and third quartile of that set.Now define the whisker as w = 3  2 (E 3 − E 1 ), then any point that is more than a whisker outside of the interquantile range (E 3 -E 1 ) is considered as an outlier.This strategy works well for approximately normally distributed data [64].With this in mind, our truncation function is defined as follows, Thus the robust RMSE is defined as, A NRSfM reconstruction is given in an arbitrary coordinate system, thus we must align the reference and reconstruction before computing the error metric.This is typically done via Procrustes Analysis [65], but as it minimizes the distance between two shapes in a L2norm sense it is also sensitive to outliers.Therefore we formulate our alignment process as an optimization problem based on the robust metric of Eq. 3. Thus the combined metric and alignment is given by, where s = scale, R = rotation and reflection, t = translation.
An implication of using a robust, as opposed to a L2norm, is that the minimization problem of (4) cannot be achieved by a standard Procrustes alignment, as done in [40].As such, we optimize (4) using the Levenberg-Marquardt method, where s, R and t have been initialized via Procrustes alignment [66].
In summary, (4) defines the alignment and metric that has been used for the evaluation presented in Sec. 5.
Since the choice of metric, always has a streak of subjectivity to it, we wanted to investigate the sensitivity of our choice.We did this by repeating our evaluation with another robust metric, where minimum track-wise distance between the ground truth and reconstruction was used.The major findings and conclusions, as presented in Sec. 5, were the same.As such we conclude that our conclusions are not overly sensitive to the choice of metric.Note, that due to space limitations and clarity of presentation this sensitivity study is not treated further in this text.

EVALUATION
With our data set and robust error metric, we have performed a thorough evaluation and analysis of the stateof-the-art in NRSfM, which is presented in the following.This is done in part as an explorative analysis and in part to answer some of what we see as most pressing, open questions in NRSfM.Specifically: -Which algorithms performs the best?-Which deformable models have best performance/generalization?-How well can the state-of-the-art handle data from a perspective camera?-How well can the state-of-the-art handle occlusionbased missing data?To answer these questions, we perform our analysis in a factorial manner, alligned with the factorial design of our data set.To do this, we view a NRSfM reconstruction as a function of the following factors: Algorithm a i : Which algorithm was used.Camera Model m j : Which camera model was used (perspective or orthographic).Animatronics s k : Which animatronics sequence was reconstructed.

Camera Path p l :
How the camera moved.

Missing Data d n :
Whether occlusion based missing data was used.We design our evaluation to be almost fully crossed, meaning we obtain a reconstruction for every combination of the above factors.The only missing part is that the authors of MultiBody [33] only submitted reconstructions for orthographic camera model.
Our factorial experimental design allow us to employ a classic statistical method known as ANalysis Of VAriance (ANOVA) [67].The ANOVA not only allow us to deduce the precise influence of each factor on the reconstruction, but also allows for testing their significance.To be specific, we model the reconstruction error in terms of the following bilinear model, (5) + as ik + ap il + ad in + ms jk + mp jl + md jn + sp kl + sd kn + pd ln , where, y = reconstruction error, µ = overall average error, xy i,j = interaction term between factor x i and y j .
This model, Eq. ( 5), contains both linear and interaction terms, meaning the model reflects both factor influence as independent and as cross effects, e.g. as ik is the interaction term for 'algorithm' and 'animatronics'.For each term, we test for significance by choosing between two hypotheses: with c n being a term from (5) e.g. a i or md jn .Typically, H 0 is referred to as the null hypothesis, meaning the term c n has no significant effect.ANOVA allows for estimating the probability of falsely rejecting the null hypothesis for each factor.This statistic is referred to as the p-value.A term is referred to as being statistically significant if it's p-value is below a certain threshold.In this paper we consider a significance threshold of 0.0005 or approximately 3.5σ.As such, we clearly evaluated which factors are important for NRSfM and which are not.
Another interesting property of the ANOVA is that all coefficients in a given factor sums to zero, So each factor can be seen as adjusting the predicted reconstruction error from the overall average.It should be noted that the 'algorithm'/'camera model' interaction am ij has been left out of (5) due to MultiBody [33] only being tested with one camera model.The error model of ( 5) is not directly applicable to the error of all algorithms as not all state-of-the-art methods from TABLE 1 can deal with missing data.As such we perform the evaluation in two parts.One where we disregard missing data and include all available methods from TABLE 1, and one where we use the subset of methods that handle missing data and utilize the full model of ( 5).The former is covered in Sec.5.1 and the latter is covered in Sec.5.2.

Evaluation without missing data
In the following, we discuss the results of the ANOVA without taking 'missing data' into account, using the model as in Eq. ( 5) without terms related to d n : The results of the ANOVA using Eq. ( 8) is summarized in TABLE 2. All factors except ms jk and mp jl are statistically significant.As such, we can conclude that all the aforementioned factors have a significant influence on the reconstruction error.Therefore, we will explore the specifics of each factor in the following, starting with 'algorithm'.
TABLE 3 shows the average reconstruction error for each algorithm.The method MultiBody [33] has the lowest average reconstruction error over all experiments followed by KSTA [11] and RIKS [27].For more detailed insights refer to TABLE 4 showing the 'algorithm'/'animatronic' dependent reconstruction error.As it can be seen, MultiBody [33] does not have the lowest error for all animatronics, as e.g.KSTA [11] has significantly lower error on the Tearing and Articulated deformations.Both of these can roughly be described as rigid bodies moving relative to each other, and it would seem KSTA [11] is the best at handling these deformations.
Methods with a physical prior, like MDH [43] and SoftInext [44], seems not to perform very well, as is evident from tables 1, 4 and 5. MDH [43] is designed with an isometry prior, therefore one would expect it to perform well in the bending deformation.Indeed, while its interaction term as ik has its lowest value for the bending deformation, the average reconstruction error is simply too high.
A similar trend can be observed in TABLE 5, which  shows the 'algorithm'/'camera path' dependent reconstruction error.While MultiBody [33] has the lowest average error, it is surpassed in the Half Circle and Tricky 'camera path' by RIKS [27].On the other hand, MultiBody has the lowest error under the Circle path by quite a significant margin.
From this analysis we can conclude that MultiBody performs the best on average, but is surpassed w.r.t. to certain camera paths and animatronic deformations by algorithms such as RIKS [27] and KSTA [11].This also clearly indicates that one needs to control for both deformation type and camera motion in future NRSfM comparisons, as the above conclusion could be changed by choosing the right combination of camera path and deformation.On the other hand, these findings show that NRSfM performance can be optimized by choosing the right camera path (e.g.Zigzag) and the right algorithm for the deformation in question.
The camera model has a significant impact on reconstruction error, a trend that can be observed from TABLE 5. Two factors relate to the camera, 'camera path' and 'camera model'.TABLE 8 shows that there is a significant difference in average error w.r.t.'camera path'.It is interesting to note, that the Circle path has one of the highest average errors, only surpassed by the Tricky camera path.The latter was specifically designed to be challenging, as such, it is surprising to find that the Circle and Tricky path's average error only differ by 3.08mm.In fact MultiBody [33] seems to be the only method that benefits from the circle type of camera path, as can be seen in TABLE 5. TABLE 6 shows the average error of reconstructions for an orthographic and a perspective camera model.As it can be seen, there is a difference of 7.20mm, which is significant but not as large as the difference w.r.t.'algorithm' (TABLE 3) or 'camera path' (TABLE 8).This suggests that, while the error increases the state-of-the-art in NRSfM can still operate under a perspective camera model.This is quite interesting as most NRSfM are not designed with a perspective camera in mind.It would seem that an orthographic or weak-perspective camera acts a reasonable approximation on the scale of our animatronics.
There is also a significant difference between the average reconstruction error of each animatronic which TABLE 7 shows.Articulated has by far the highest average reconstruction error, making it the most difficult to reconstruct for the current state-of-the-art in NRSfM.Since most approaches use low-rank methods, a highly structured motion such as an Articulated is difficult to handle with a low-rank prior, especially if points are densely sampled on all joints.On the other hand, Deflation seems to be quite easy to handle for most of the state-of-the-art methods.

Evaluation with Missing Data
As previously mentioned, we are interested in 'missing data' and its effect on NRSfM.We, thus, here use Eq. ( 5), which is use to evaluate the subset of methods capable of handling missing data, as shown in TABLE 1.It should be noted that while MDH [43] is nominally capable of handling missing data, it has not been included in this part of the study.The reason being code provided only reconstructs frames with minimum ratio of visible data, thus our error metric cannot be applied.As such, we have 8 methods in total in this category.
We treat 'missing data' as a categorical factor having two states: with or without missing data.This is because the missing percentage of our occlusion-based missing  This would make it difficult to distinguish between the influence of the 'missing data' factor and the animatronic factor.The results of the ANOVA is summarized in TA-BLE 9 and all factors except ms jk , mp jl and md jn are statistically significant.This means that 'missing data' has a significant influence on the reconstruction error.TABLE 12 shows the interaction between 'algorithm' and 'missing data'.As expected, the mean error without missing data is very similar to the averages in TABLE 3 with KSTA [11] having the lowest expected error.However, with missing data, MetricProj [29] actually has a lower average reconstruction error.This is due to its low increase in error of 5.85mm when operating under occlusion-based missing data.In comparison, KSTA [11], CSF2 [10] and CSF [9] are much more unstable with average increases in error of 9.65mm, 18.15mm and 13.49mm respectively.Common for the three methods is that they assume a Discrete Cosine Transform (DCT) as their prior.Indeed, we see a similar increase for ScalableSurface of 16.52mm and this method also uses a DCT basis.
These results suggests that while DCT-based approaches are quite accurate without missing data, they are not very robust when operating under occlusionbased missing data.And, thus, they would likely not be very robust when applied to real-world deformations, where occlusion-based missing data is unavoidable.This indicates that, future research should focus on making DCT basis methods more robust or to modify the DCT model to better generalize for 'missing data'.Finally, BALM [30] method exhibit some peculiar behavior as its average error actually decreases by 3.33mm, contrary to expectation.
TABLE 11 shows the average error as an interac- .The main difference between the two groups is that the ratio of missing data is consistently low for the in-plane deformations.This would suggest that the ratio of missing data has an impact on the reconstruction error.TABLE 10 shows the average error as interaction between 'camera path' and 'missing data'.The Tricky path has by far the highest average error.This is expected, as the small camera movement ensures that a portion of the tracked points is consistently hidden.As such, while Tricky and Circle were approximately equally difficult without missing data, this is no longer the case with missing data as Circle's average error only increases by 4.9mm.Indeed, all other camera paths have approximately the same increase in error with missing data.These paths also ensure that all observed points are equally visible.So while the camera paths nominally have approximately the same percentile of missing data as the Tricky path, the spatio-temporal distribution is different.These results suggests that the distribution of missing data is as important as the ratio in affecting the reconstruction error.Indeed this is in line with the observations made by Paladini et al. [29].
The aforementioned observations demonstrates the importance of testing against occlusion-based missing data as it contains a spatio-temporal structure of missing data that a randomly removed subset lacks.Many NRSfM methods treats missing data as a matrix fill-in problem, meaning recreating missing values from interpolation of spatio-temporally close observations.Thus, it is easy to see that conceptually it is much easier to interpolate random, evenly distributed missing data, compared to the spatio-temporally clustered structure of occlusion-based missing data.It is noted, that KSTA [11] and CSF [9] were both evaluated using random subset missing data in the original works, and was found to approximately have the same reconstruction whether from 0% to 50% missing data.These results are obviously quite different from the conclusion of our study and we hypothesize, that the spatio-temporal structure of our occlusion-based missing is probably the primary cause for this.

DISCUSSION AND CONCLUSION
To summarize our findings, we would like to firstly mention that, the algorithm with the lowest error on average without missing data was found to be Multi-Body [33].There is, however, a large variation between the different algorithms performance depending on the factors chosen.As such our study does not conclude that Multibody [33] is definitively better than all other methods in general.As an example, for some camera paths RIKS [27] had lower average error than MultiBody [33].Also, with missing data MetricProj [62] has the lowest reconstruction error.Other observations include that methods with a DCT basis were found to have a great increase in error with occlusion-based missing data.
Our study also has findings that support and form hypotheses of where future NRSfM work could head.In relation to this, it should be mentioned, that even though some of these hypotheses have been stated before elsewhere, it is a strength of this work and our data set that it confirms these.Firstly, it is clear that methods using the weak perspective approximation to the perspective camera model only incur a small penalty for doing so on average.This camera model seems like a good approximation, although it should be noted, that our data set does not challenge the algorithms extremely in this regard, with only an average 1.6 fold change in the depth change.
Another main avenue of investigation was the effect of missing data.Here we found, that that this aspect has a large impact on on the reconstruction error.This is somewhat at odds with previous findings, and we speculate that this has to do with our missing data having structure originating from object self occlusion, as opposed to generate missing data with random sampling.In particular, occlusion-based missing data increases the reconstruction error of all methods except BALM [30].Our study thus indicates this area to be a fruitful area of investigation for NRSfM research.
Another observation is that the physically based methods did quite poorly compared to the methods using a statistically based deformation model.This is in a sense counter intuitive, provided that the physical models capture the deformation physics well.This in turn, lead us to the observation that stronger efforts could be beneficial as far as better physical based deformation models.
As stated, many of these observations, support hypothesis held in the NRSfM community, and it strengths them, that we have here provided empirical support for them.On the other hand, this study also helps to validate the suitability of our compiled data set.In regard to which, it should be noted, both deformation types and camera paths have a statistically significant impact on reconstruction error, regardless of the algorithm used.This indicated that our proposed taxonomy and the data set design has value.
All in all, we have here presented a state of the art data set for NRSfM evaluation.We have applied 16 different NRSfM method to this data set.Methods that span the state of the art of NRSfM.This evaluation validates the usability of our proposed, and publicly available data set, and gives several insights into the current state of the art of NRSfM, including directions for further research.Henrik Aanaes is associate professor in computer vision at the Technical University of Denmark, where he is, among others, heading and effort for making large high quality data sets for 3D computer vision.His interests mainly lye in 3D computer vision, and their application, where he has e.g. also worked with NRSfM.

Fig. 1 .
Fig. 1.Mobile structured light scanner used to acquire 3D data for the data set.

Fig. 2 .
Fig. 2. Animatronic systems used for generating specific types of non-rigid motion.

Algorithm 1 :1 4 Deform animatronic to pose f 5 for k ∈ K do 6 Move scanner to view k 7 Acquire image I f,k 8 Acquire structured light scan S f,k 9 end 10
Process for recording image data for tracking and dense surface geometry for an animatronic.Let F be the number of frames 2 Let k be the number of static scan views K 3 for f ∈ F do Combine scans S f,k for full, dense surface S f 11 end

Algorithm 3 : 4 3 10 for p ∈ P do 11 for f ∈ F do 12 Set camera pose to p f 13 Project 16 W
Creation of input tracks W c,p and missing data D c,p from ground truth Q for each combination of camera path p and model c. 1 Let F be the number of frames 2 Let P be the set of camera paths shown in Fig.Let C be the either perspective or orthographic 4 Let Q f be the ground truth at frame f 5 Let S f be the surface at frame f 6 for S f ∈ {S 1 . . .S F } do 7 Estimate mesh M f from S f 8 end 9 for c ∈ C do Q f using model c to get points w f 14 Do occlusion test q f against M f to get missing data d f 15 end c,p = {w 1 . . .w F } 17 D c,p = {d 1 . . .d F }

Fig. 4 .
Fig.4.Camera path taxonomy.The box represents the deforming scene and the wiggles illustrates the main direction of deformation, e.g. the direction of stretching.

Sebastian
Hoppe Nesgaard Jensen is a Ph.D. student employed at the Image Analysis and Computer Graphics Department at the Technical University of Denmark.A technical expert that has worked extensively to build different datasets and with the robotic setup used for data acquisition.Alessio Del Bue is the head of the Visual Geometry and Modelling (VGM) Lab at Istituto Italiano di Tecnologia (IIT), Genova, Italy.Starting his research on NRSfM in 2004, he contributed to the field with novel non-linear optimization methods, shape priors and studies over the motion manifold of rigid, non-rigid and articulated objects.Mads Emil Brix Doest is a Ph.D. student employed at the section for Image Analysis and Computer Graphics, at the Technical University of Denmark.His research is focused on optical scanners and appearance acquisition.

TABLE 1
Methods included in our NRSfM evaluation with annotations of how they fit into our taxonomy.

TABLE 2
(5)VA table forNRSfM reconstruction error without missing data.Sources as as defined in(5).All factors are statistically significant at a 0.0005 level except ms jk and mp jl .

TABLE 3
Linear term µ + a i sorted in ascending numerical order, this is the average error for the given algorithm.Algorithms are referred to by their alias in TABLE1.All numbers are given in millimeters.

TABLE 4
Interaction term µ + a i + s k + as ik .This is equivalent to the algorithms average error on each animatronic.Lowest error for each animatronic is marked with bold text.Algorithms are referred to by their alias in TABLE1.All numbers are given in millimeters.

TABLE 5
Interaction term µ + a i + p l + ap il .Algorithms are referred to by their alias in TABLE 1.All numbers are given in millimeters.

TABLE 6
Linear term µ + m j sorted in ascending numerical order, this is the average error for the given camera model.All numbers are given in millimeters.

TABLE 7
Linear term µ + s k sorted in ascending numerical order, this is the average error for the given animatronic.All numbers are given in millimeters.

TABLE 8
Linear term µ + p l sorted in ascending numerical order, this is the average error for the given camera path.All numbers are given in millimeters.

TABLE 9
(5)VA table forNRSfM reconstruction error with missing data.Factors as as defined in(5).All factors are statistically significant at a 0.0005 level except ms jk , mp jl and md jn .

TABLE 10
+ p l + d n + pd ln .Numbers are given in milimeters.

TABLE 11
+ s k + d n + sd kn .Numbers are given in milimeters.

TABLE 12
Interaction between 'algorithm'/'missing data'; µ + a i + d n + ad in .This is the average error for each algorithm either with or without occlusion-based missing data.