Robust deformable shape reconstruction from monocular video with manifold forests

Existing approaches to recover structure of 3D deformable objects and camera motion parameters from an uncalibrated images assume the object’s shape could be modelled well by a linear subspace. These methods have been proven effective and well suited when the deformations are relatively small, but fail to reconstruct the objects with relatively large deformations. This paper describes a novel approach for 3D non-rigid shape reconstruction, based on manifold decision forest technique. The use of this technique can be justified by noting that a specific type of shape variations might be governed by only a small number of parameters, and therefore can be well represented in a low-dimensional manifold. The key contributions of this work are the use of random decision forests for the shape manifold learning and robust metric for calculation of the re-projection error. The learned manifold defines constraints imposed on the reconstructed shapes. Due to a nonlinear structure of the learned manifold, this approach is more suitable to deal with large and complex object deformations when compared to the linear constraints. The robust metric is applied to reduce the effect of measurement outliers on the quality of the reconstruction. In many practical applications outliers cannot be completely removed and therefore the use of robust techniques is of particular practical interest. The proposed method is validated on 2D points sequences projected from the 3D motion capture data for ground truth comparison and also on real 2D video sequences. Experiments show that the newly proposed method provides better performance compared to previously proposed ones, including the robustness with respect to measurement noise, missing measurements and outliers present in the data.


Introduction
Simultaneous recovery of three-dimensional (3D) sparse feature points representing evolving non-rigid 3D object (simply referred to as 3D structure or shape in the rest of the paper) and a relative camera motion over time from images obtained from a single uncalibrated camera is a challenging, underconstrained problem. The complexity of this problem can be made apparent by realising that reconstruction of the landmarks' 3D positions cannot be uniquely derived based on the knowledge of the locations of their corresponding projections in a single image alone. This can be seen from a simple observation that any 3D point along the projection line, linking the optical centre of the camera and the selected image point, can be equally considered as a valid 3D landmark if no additional information is available. In the stereovision the additional information comes from another image of a static scene taken from a different position, the knowledge of the correspondence between feature points in both images and the known camera motion. In that case the 3D landmarks' reconstruction is obtained by triangulation. For the problem described in this paper not only the camera motion is not known but also, in general, the 3D shape (represented by 3D landmarks) is changing between successive images. This is a hard problem because, as explained in Sect. 3, the number of unknowns defined by the 3D landmark's coordinates and the camera motion parameters is increasing faster than the size of the measurement data, consisting of the corresponding 2D image points. Therefore, when more measurements are available the harder the problem is as the difference between number of unknown and known is increasing. This is an example of the so-called ill-posed problem. To solve this kind of problems additional information need to be embedded into the problem or/and the problem needs to be reformulated. This process is called regularisation. The fundamental objective of the regularisation is to limit the number of feasible solutions by introducing constrains reflecting our prior knowledge about the problem, e.g. by forcing the solution to have a specific form or belong to a specific subspace or a manifold. Methods addressing deformable shape reconstruction from a monocular video differ essentially by the way such regularisation is introduced.
The methods proposed for dealing with this problem can be categorised, by the type of the regularisation technique used, into three major classes: The low-rank shape models [22], shape trajectory approaches [2,14,15], and template-based methods [17,27,39]. In all these methods the regularisation is achieved by reduction in the problem dimensionality. For example, in the low-rank models, the rank is defined by the number of elementary "shape's building blocks" or basis shapes from which the reconstructed shape is constructed. In principle, higher rank increases the flexibility of the model, leading to possibly more accurate reconstructions, but at the cost of the method increased susceptibility to the observation noise. Low-rank shape was firstly introduced in [6], where the factorisation algorithm is adopted to solve deformable shape reconstruction problem. As a time-varying object usually cannot arbitrarily deform, the idea of this model is to represent a deformable shape as a linear combination of basis shapes. Due to its simplicity, shape basis model has been widely used to tackle the problem of Non-Rigid Structure from Motion (NRSfM) [1,4,42]. However, the shape bases are different in each sequence, thus need to be estimated for every sequence. Besides, for relatively complex deformable shapes, a large number of bases are required to fit the model. Considering those drawbacks, a trajectory-based algorithm was proposed in [2] exploiting a duality theorem in 3D structure representation which models independent 3D point trajectories. The main advantage of this representation is that the basis trajectories can be predefined, thus removing a large number of unknowns from the estimation problem. Template-based reconstruction is an alternative method which usually relies on a known reference frame and works well especially for inextensible surfaces. Since this is still an ill-conditioned problem [21], the most commonly used constraints in the reconstruction involve preserving Geodesic distances as the surface deforms and thus regularise the problem by solving either convex optimisation problem [7,29] or in closed-form sets of quadratic equations [19,30]. The existing 3D reconstruction technologies have been successfully used in many different areas, ranging from medical imaging and biometrics to computer gaming and film production. A variant of that methodology, mostly dealing with static scenes, called simultaneous localisation and mapping (SLAM) is used for robot navigation where reconstruction is used to build a 3D representation of the environment and the camera pose estimation is equivalent to robot positioning in that environment [12,20].
However most of the existing approaches are restricted by the fact that they try to explain the complex deformations using a linear model. To move away from the linear representations of deformable shapes, recent methods have integrated the manifold learning algorithm [28] to regularise the shape reconstruction problem by constraining the shapes as to be well represented by the learned manifold. In simple terms, a manifold can be thought of as a smooth surface/curve embedded in a relevant multi-dimensional space. The advantage of using a manifold constraint, if such constraint adequately represents properties of the reconstructed objects, is a compact representation leading to robust regularisation. If for example, for a hypothetical problem, it is known that a valid shape could be accurately represented by points on a curve, the manifold method would effectively have the dimensionality of one, whereas the linear method would still require, possibly high-dimensional subspaces, to accurately model all turns and twists of the curve. In this case, there is also no guarantee that the reconstructed object would belong to that curve. Rabaud and Belongie firstly claimed in their work [24] that the possible 3D shapes of an object may not lie in a linear lowdimensional manifold. Based on a low-rank shape model, the work assumed that shapes lie on a d-dimensional manifold, and every neighbourhood of shape approximately lies on a d-dimensional linear subspace. In order to minimise the cost function which consists of the reprojection error and smoothing terms. The initial values are calculated by Rigid Shape Chain also introduced in [24], with sequences clustered into several rigid shapes. After initialisation, the optimisation on the shapes is performed using two criteria: the cost function and the shape manifold dimensionality constraint for which the locally smooth manifold learning technique has been used. Later they proposed a method focusing on a globally linear manifold and use shape embedding as initialisation [25]. Similarly, the work in [45] attempts to learn the 3D reconstruction of human motion with an assumption that human poses lie in a union of subspaces.
Other manifold-based methods were inspired by the basis trajectory model. Gotardo and Martinez demonstrated the "kernel trick" which used for nonlinear dimensionality reduction [31] can also be applied to standard NRSfM problem [15]. Recently Hamsici et al. [16] modelled the shape coefficients in a manifold feature space. This method has ability to recover shapes from a newly observed image. The mapping is learned from the corresponding 2D measurement data of upcoming reconstructed shapes, rather than a fixed set of trajectory bases. They introduced rotation-invariant kernels (RIK) to provide similarity measure for two 3D shapes based on their 2D projections. The problem still remains though that the 2D observations can be completely different when the images are taken from different view angles. Similarly because of different depths, similar 2D images may not represent similar 3D shapes. In comparison, [15] defines a nonlinear model while [16] models 3D shapes in a linear space; [15] uses point trajectory bases as input data for building a kernel function while [16] uses shapes directly from 2D images.
The problem becomes more difficult when the observations are incomplete. Most algorithms assume that all feature points are detected in all images. This is unlikely to happen in practice as some of the feature points will not be detected in all images. This could be because of the feature point detection problems or because some parts of the 3D object may not be visible from all the camera positions, which means some of the entries in the measurement matrix may be unknown. The methods addressing this problem can be divided into three categories: imputation, alternation and nonlinear optimisation. The problem was firstly addressed by filling the missing entries using complete subset of the data in rigid [36] and non-rigid reconstruction problem [44]. These imputation algorithms are simple but cannot handle well real data, which often tend to be very noisy. To overcome this, the alternation algorithms solve the problem based on a closed-form solution using a rank constraint imposed on the measurement matrix without estimating the missing values in advance [18,23]. The idea is to iteratively update the motion and shape in terms of observed measurements. Another commonly used method for addressing the missing data problem is to employ nonlinear minimisation of suitably designed cost function. The measurements can be gradually refined to produce jointly optimal 3D structures and camera motion. This is known as bundle adjustment which has been studied for many years [38]. The major benefit of this method is that the additional constraints can be effectively included in the cost function, though the inherently high number of degrees of freedom may lead to failure of obtaining reliable reconstruction results.
Despite many years of research, one significant problem still remains, the most existing approaches cannot cope well with outliers. Earlier work suggested that outliers have to be removed before doing any further processing [8]. Vidal et al. presented a geometrical algorithm for 3D motion segmentation which dealt with the data by using RANSAC to detect the outliers [40]. The outliers are usually caused by matching errors between two frames, and this may severely affect the trajectory methods because the trajectories passing through the feature points do not belong to any of the trajectory spaces. Additionally, most methods have been using least-square estimator, which is well known not to be robust to outliers. Simply removing the outliers in advance may not be feasible in practice, especially when real-time processing is required. Developing robust estimations is necessary for obtaining reliable solutions. The work in [13] presented a rank-constrained factorisation algorithm that effectively calculates a low-rank approximation of a measurement matrix in the presence of the data outliers. The problem is solved by replacing squared residual error function by L 1 norm which is often used to reduce sensitivity of the model. Figure 1 provides an overview of the proposed reconstruction system. In this paper, it is assumed that the feature points have been detected in the images and the 2D point correspondences are provided as input to the reconstruction algorithms. Although the paper is focused on the solution of the reconstruction problem, when both points and their correspondences are given, the imperfection in both point position and correspondence estimation are indirectly addressed by introduction of a robust metric. The reminder of the paper is organised as follows: Sect. 2 highlights the novel contributions of the paper. Section 3 describes the problem of deformable shape reconstruction and presents the notation used throughout the paper. Section 4 introduces the generic concept of the manifold forest. Subsequently Sect. 5 provides detailed description of the proposed manifold forest implementation aiding the shape reconstruction. In both these sections, an effort has been made to explain advantages of using manifold forest in shape reconstruction. Section 6 describes a novel robust approach dealing with missing data and outliers. Finally, Sects. 7 and 8 present experimental results and conclusions.

Novelty
Although the problem of deformable shape and motion recovery has been studied for many years, one of the severe limitations of the most existing approaches is that they mainly address the problem of small deformations. The main reason for their failure when recovering objects with large, complex deformations can be attributed to the reliance on a linear shape model. This paper focuses on modelling nonlinear deformable objects with large complex deformations, such as deformable cloth. In this case, most existing methods based on linear space are no longer suitable.
Contrary to other techniques using manifold in the shape reconstruction, our manifold is learned based on the 3D shapes rather than on 2D observations. The proposed implementation is based on the manifold forest method described in [11]. The main advantage of using manifold forest as com- Fig. 1 The pipeline of a complete 3D objects reconstruction system pared for example to standard diffusion maps [9] is the fact that in the manifold forest the neighbourhood topology is learned from the forests data clustering rather than being defined by the Euclidean distance. To the best of authors' knowledge, random forests technique has never been applied in the context of non-rigid shape reconstruction using. The idea of integrating nonlinear manifold-based approaches into 3D deformable reconstruction was firstly introduced by the authors in [34], where the shape prior is introduced in the form of the diffusion maps. In that work, the structure of data is estimated using Euclidean distances between pairs of data items, whereas the method proposed in this paper learns the structure from the data, based on random forests techniques.
This paper updates and extends the work in [33] with the following four main differences: (a) The method presented in this paper has an additional step in the algorithm, solving the problem when some elements of the measurement matrix are missing; (b) Considering the majority of algorithms are based on minimising squared residual of an error function which makes them sensitive to outliers, another improvement is to reduce the effect of outliers by replacing the L 2 estimator by robust M-estimator [26]; (c) A modification of the method is described when only a relatively small number of training shapes is available. This was firstly introduced by the authors in [32] but without random forests manifold learning technique; (d) More comprehensive set of experiments is presented in the experimental section.

Basic formulation
Throughout this paper, vectors and matrices are denoted as lower-and upper-case bold letters, whereas sets are repre-sented by calligraphic letters. We assume that 2D points (features) are obtained from F frames under an orthographic camera projection model. The problem consists of the recovery of shapes S = {S 1 , S 2 , . . . S F } and camera . . , Y F }, thus can be formulated as the 2D reprojection error minimisation problem: where, P represents orthographic camera projection matrix, Y t is a 2× P matrix of detected 2D feature points' coordinates in the t th image, and S t ∈R 3×P contains coordinates of P 3D points describing shape represented in the t th frame, · indicates Frobenius norm. The camera translation has been eliminated by expressing 2D observations Y with respect to the data points centroid calculated in each observed image.
The goal is to recover camera orientations R and the concatenated time-varying shapes S, based only on the 2D measurement Y. It is an under-constrained problem since the shape and motion are both changing with time. The number of unknown variables (3F + 3FP) is higher than the size of observed data (2FP). To deal with this, a low-rank shape model has proved to be successful as shown in [37], where the shape S t is represented as a linear combination of K unknown but fixed basis shapes B = {B 1 , B 2 , . . . B K }: where K F, P. The deformation coefficients θ tl are adjustable over time t. This low-rank shape model can be obtained by performing singular value decomposition (SVD) of the measurement matrix Y, for which the measurement matrix can be decomposed and represented by pose R, basis shapes B and time-varying coefficients θ tl , and it can be rearranged as: Since basis shapes B ∈ R 3K ×P and M ∈ R 2F×3K , the rank of the measurement matrix Y is 3K at most, in the absence of noise. The matrices M and B are computed by factorising the measurement matrix Y. The solution is not unique and is defined up to an ambiguity matrix Q ∈ R 3K ×3K . According to [41], the limitation of the closed-form solution in this approach is that the motion matrix is nonlinear, when an inaccurate set of basis shapes have been chosen, it may not be possible to remove the affine ambiguity. Our model departs from the linear shape model. The shape basis in the proposed method are selected from the learned shape manifold. The shape S t is represented as a linear combination of n +1 (where n is the dimension of the manifold introduced in Sect. 4.1) basis shapes B tl , adaptively selected from the learned manifold: S t = n+1 l=1 θ tl B tl . Unlike the low-rank shape model, where all the reconstructed shapes are represented as a linear combination of unknown but fixed K basis shapes, in the proposed method, the basis shapes may be different for each frame. Such approach adds an extra flexibility to the reconstruction process allowing a better adaptation of the method to the temporal shape changes. Although it may seem that this increases the number of parameters in the model, it should be clarified that all the basis shapes are selected from the manifold and are not estimated as a part of the optimisation process. The parameters to be estimated in the proposed approach include only the camera motion and shape coefficients, representing the shape in the local linear barycentric coordinates system approximating the manifold at the location corresponding to the current estimate of S t .

Manifold forests
In this paper, the manifold forests are constructed upon diffusion maps with the neighbourhood topology learned through random forest data clustering. It generates efficient representations of complex geometric structures even when the observed samples are non-uniformly distributed. This section gives an introduction to diffusion maps and randomised decision forests first, and then describes the application of random forests in learning diffusion map manifolds.

Diffusion maps
In many problems, data are difficult to represent or analysed due to their high-dimensional structure. However, in same cases, complex data might be governed by a small number of parameters. The goal of the manifold learning is to find the embedding function, mapping the data set X form a high, N = 3P-dimensional space to a reduced, n-dimensional space. The diffusion map is a graph-based nonlinear technique with quasi-isometric mapping from original shape space onto a lower-dimensional diffusion space. Unlike linear methods, nonlinear approaches are able to handle a wider range of data variability, preserving local structures at the same time. The problem with linear manifold methods is that the input data may have complex nonlinear dependencies and preserving global or indeed local structures in the data may not be possible utilising linear projections.
Assuming X is a dataset with M samples, the goal of dimensionality reduction problems is to find a mapping of δ is a kernel scale. k-nearest neighbour (kNN) sparsification scheme can also be applied, retaining k edges for each graph vertex and removing other connections to avoid outliers. Coifman et al. presented a justification behind using normalised graph Laplacian [9] by connecting them to diffusion distance. Each entry of the diffusion operator G is con- W is a renormalised version of the affinity matrix W using an anisotropic normalised graph Laplacian, such that The convergence of optimal embedding for diffusion maps is proven in [9] and is found via eigenvectors ϕ and their corresponding n biggest eigenvalues λ of the operator G, such that 1 = λ 0 > λ 1 ≥ · · · ≥ λ n , where ϕ j (X i ) represents ith element (corresponding to the ith training sample X i ) of the jth eigenvector of G.

Randomised decision forest
Random forest [10] has become a popular method given its capability to handle high-dimensional data, avoid overfitting, and enabling simple parallel implementation. The decision trees in our method are built by making decisions in each node of the tree based on randomly selected features. A random decision forest is an ensemble of such decision trees. The trees are different and independent from each other. Although other choices are possible, this paper is focused only on the binary decision forest. Given a set of training data X with M samples: The trees are randomised, by randomly selecting a single feature at each internal node. The decision function at the internal node is used to decide whether the data X i reaching that node should be assigned to its left or right child node. The threshold α m of the decision function at node m is selected as result of the maximisation of the information gain: with the generic information gain I m defined as: where |·| indicates a cardinality for the dataset. X m denotes the training data X reaching node m. X L m , X R m are the subsets assigned to the left and right child nodes of node m. In this paper it is assumed that data is adequately represented by the Gaussian distribution [11]. In that case the differential entropy H (X m ) can be calculated analytically as: where | (X m )| is the determinant of the covariance matrix estimated from the X m training data. By substituting (7) into (6), the information gain can be rewritten as: The trees are trained until the number of samples in a leaf is less than the pre-specified limit or the depth of the tree has exceeded the pre-defined depth.
Once the random forest has been trained, the new sample can be simply put through each tree. Depending on the result of the decision function at each internal node, the new data is sent to the left or right child node until it arrives at a leaf. The samples ending up in the same leaf are likely to be statistically similar and are expected to represent the same neighbourhood of the manifold. As such similarity measure is statistical in nature, thus the results is averaged over many decision trees. If the samples end up in the same leaf for the majority of the trees they are considered to be drawn from the similar location on the manifold.

Forest model for manifold learning
In the proposed method, the affinity model in manifold learning is built by applying random forest clustering. The data partition is defined based on the leaf node l(·) the input data X i would reach. The entries of the affinity matrix W t for tree t are calculated as, where the distance L can be obtained using different models. For instance, Gaussian or binary affinity for the data ending up in the same leaf node are defined as follows [11]: Gaussian affinity model Binary affinity model In this paper binary affinity model is used as it is simple and efficient, and can be considered to be parameter free. In a forest, each randomly trained clustering tree produces a disjoint partition of the data, and they are independent and different with respect to each other. A graphic representation of building affinity matrix using binary model is shown in Fig. 2. However, as affinity matrix computed for a single randomly trained tree is not representative of the correct similarities of the data, the ensemble of T trees is used to calculate a much smoother affinity matrix W. The affinity matrix representing a forest of size T is calculated by averaging over all affinity matrices from each single tree: The comparison results of diffusion maps and manifold forests are shown in Fig. 3. Figure 3a illustrates synthetically generated data of a 3D parabola surface given by the equation f (x, y) = x 2 +y 2 2 . Its corresponding embeddings in 2D reduced space using diffusion maps and random forest manifold with the Gaussian affinity model are shown in   Fig. 3b, c, respectively. It can be seen that the embedding obtained based on the manifold forest achieves somewhat better representation of the data in the lower-dimensional space. This is specifically well illustrated when comparing the distribution of the embedded points representing the base and the rim of the parabola, as these points seem to be more equally distributed when random forest manifold embedding is used.
Moreover, in diffusion maps, the method of sparsifying an affinity matrix is based on retaining the k-nearest neighbourhoods among the data in the graph. However, this may cause two problems: Choosing appropriate number of nearest points is not easy since it depends on the data structure, and it can create connected edges in the graph for the points which may be outliers. Forming affinity matrix using random forest technique would efficiently solve such problems as the points are only connected if they are in the same cluster. Figure 4 illustrates the embedding of shapes using diffusion maps and manifold forests from cardboard data [39] together with corresponding representative shapes extracted from 1000 training samples. The embedding results obtained by applying manifold forests seem to be more evenly distributed than points embedded using diffusion maps, especially for the shapes located along the rim of the manifold.
One of the main advantages of using random forest manifold is that it implicitly addresses one of the main difficulties in the manifold learning, namely it optimally defines the data neighbourhood structure. The optimality criterion is defined through the splitting decision used in the nodes of the trees, in this paper optimality is defined through maximisation of the information gain. Additionally, random forest implicitly selects optimal features for the data clustering, where the optimum criterion is defined by the node decision rule. In the case of the random forest implemented in this paper the maximum information gain decision rule is used which favours features for which data splitting at a node leads to compact class distributions. This may explain a better performance of the random forest manifold when compared to the original diffusion maps with the neighbourhoods defined by the k-nearest neighbours. This can be seen in Fig. 4 (cardboard  (a) (b) Fig. 4 The reduced space of cardboard dataset. a The embedding of shapes using diffusion maps only. b The embedding of shape using manifold forests dataset) where the mapping to the lower-dimensional space, using the random forest manifold, produces visibly more uniformly distributed points. The uniformity of the data points distribution can be also measured quantitatively by estimating the entropy characterising the data distribution in the parametric space. To start, for the parametric space, a histogram representing areas of all the Delaunay triangles for the training dataset is calculated. For the well-distributed data, this histogram is expected to be compact as all triangles would have similar area. The compactness of the histogram is measured here using entropy, with entropy equal to 6.85 and 5.08 for the data embedding in 2D space using diffusion maps and manifold forests, respectively. Both, the subjective and objective measures show that using the random forest manifold, produces more uniformly distributed points and therefore could supports more accurate data interpolation.

Random forests in 3D reconstruction
Once the manifold has been built from the training dataset, the shape reconstruction can be obtained from the learned shape manifold and the observed 2D measurements. In this section, an overview of the proposed manifold-based reconstruction algorithm is given first, followed by a description of out-of-sample and inverse mapping problems.
As known from [41], enforcing only the rotation constraints cannot guarantee the unique solution for the camera motion and the basis shapes. To solve this, the designed shape prior can help to attract a shape towards the manifold and therefore avoid incorrect reconstructions.
A summary of the algorithm for the shape recovery of a non-rigid object and estimation of camera motion is given in Algorithm 1. Initial shapes S and camera motion R are estimated by running a few iterations of the optimisation process using linear basis shapes model [35]. The method is not significantly sensitive to the initial solution as it can iteratively update the shapes by projecting them on the learned manifold until convergence. For each initial shape, Nyström extension [3] is used for embedding these new samples into the reduced space. Intuitively, if the points in the reduced space are relatively close, the corresponding shapes in the highdimensional space should represent similar shapes. Based on this observation, the reconstructed shape at each frame can be represented as weighted sum of n + 1 basis shapes from the learned manifold. The coefficients of corresponding basis shapes are calculated as barycentric coordinates of n +1 neighbouring points from Delaunay triangulation of the training dataset. Once the basis shapes and their coefficients have been obtained, an optimisation is applied to minimise the image reprojection error with an additional smoothing term and basic rotation constraint over all frames (see Eq. 15). However, the quality of the reconstruction depends on the accuracy of initial shapes. Updating basis shapes in each iteration can help to circumvent the problem. The basis shapes are being kept updated as long as 2D measurement error r t exceeds the predefined threshold r T (10 −3 in our case) or the error between two adjacent frames is relatively large which implies that the current results are unlikely to explain the shapes well.

Mapping out-of-sample points
The manifold forests method briefly described in Sect. 4 is used to find a meaningful representation of the data, but the mapping is only able to provide an embedding for the data present in the given training set. In our algorithm, it is necessary to calculate embedding for shapes which are not presented in the training set. Suppose a new shape S t ∈ R N becomes available after the manifold had been learned, instead of re-learning the manifold, as it is too com-Algorithm 1 Outline of manifold forest-based reconstruction Input: Stream of 2D observations, manifold forest Ψ of training dataset X (Sect. 4.3) Output: 3D deformable shapes S and camera motion R for each frame. 1: Initialisation of estimating initial shapes S and camera motion R . 2: while ( r > r T ) or r t − r t−1 > 10 −3 do 3: Shape projection onto manifold (shape Embedding) (Sect. 5.1) 4: Find n + 1 closest points b l , l = 1 . . . n + 1 in low-dimensional space, where n is the dimensionality of the reduced space. 5: Shape update (Sect. 5.2) 6: Nonlinear optimisation by minimising 2D measurement error and shape smooth term to obtain updated shapes S t and camera motion R t ,t = 1 . . . F (Sect. 5.3) 7: end while putationally expensive, an efficient way is to interpolate the shape based on the already learned manifold. For each new shape, such embedding is calculated based on the Nyström extension. Knowing that for every sample in the training dataset: Having a shape S t , not present in the training set X , an embedding of this new shape is calculated from: where G(S t , X j ) is calculated in the same way as the diffusion operator (see Sect. 4.1). The distance between unseen sample and the training samples is calculated using the binary affinity model, (11). In principle these calculations still could be expensive as the summation in (14) is done over all training samples. Random forest approach provides very effective way for implementing the out-of-sample mapping. For this the unseen sample is put through the forest, and subsequently only the training samples from leafs where the unseen sample ended up are used in the sum in (14).

Inverse mapping for shape update
Given a point s t ∈ R n in the reduced space, finding its inverse mapping S t = −1 (s t ) from the reduced space back to the input space is a typical pre-image problem. As claimed in [3], the exact pre-image might not exist if the shape S t is not included in the training set. However, according to the properties of diffusion maps, if the points in the reduced space are relatively close, the corresponding shapes in high-dimensional space should represent similar shapes since they have small diffusion distances. Based on this, the point s t can be approximated as a linear combination of its weighted neighbouring points in reduced space, such that s t = n+1 l=1 θ tl x tl , where x tl is the l th neighbouring points of s t obtained by computing Delaunay triangulation [5] on the training dataset, and the weights θ tl are computed as the barycentric coordinates of s t . Once the weights are estimated, the shape S t can be calculated as well based on a set of weighted training samples S t = n+1 l=1 θ tl X tl , where the training samples X tl are the pre-images of x tl , and are equivalent to the basis shapes in (2).

Nonlinear refinement
The cost function E() to be minimised consists of the reprojection error, shape smoothing term and rotation constraint, where ε rot = R t · R t T − I 2 penalises deviation from orthonomality of all R t . γ S and γ R are regularisation constants, and S t is expressed as linear combination of weighted neighbouring training shapes X tl (see Sect. 5.2). A nonlinear optimisation using Levenberg-Marquardt algorithm is applied to minimise the cost function with analytically calculated Jacobian. However, the underlying problem is that the quality of the optimisation result strongly depends on the accuracy of initial shapes. To avoid this, we update the basis shapes in each iteration until 2D measurement error is less than the defined threshold (10 −3 in our case) and the error between two adjacent frames is relatively small. This effectively means that for any given t the basis shapes {X tl } can change during the iteration minimising E().

Nonlinear refinement with reduced training set
Building a dense manifold requires large number of training samples. In practice it may be difficult to obtain such dense training set. To address this problem, this section briefly describes a variant of the proposed method which can handle situation when only limited training data is provided. The idea is to modify the cost function with additional term.
In this case the basis shapes will be estimated, rather than matched to the local training samples. The similar method was introduced in [32], but manifold was learned without using the random forest.
The main idea is to partition a set of estimated shapes into K clusters, in which the shapes have similar structure, with each shape cluster denoted by T i , i ∈ 1 . . . K . The clusters are obtained by performing the Delaunay triangulation in the reduced space. The points in the reduced space belong to the same Delaunay triangle (i.e. cluster), can be modelled in the same linear manifold embedded in R N , and therefore all corresponding reconstructed shapes (represented by that cluster) can be approximated by a linear combination of the same set of unknown but fixed basis shapes. Thus all the shapes in the cluster i can be represented as is spanning the tangent linear subspace representing all the shapes from the cluster i.
The parameters θ tl , B i l and R t are optimised simultaneously by minimising the following modified cost function, where the additional constraint applied to the i th set of basis shapes is, Figure 5 shows the embedding using reduced number of training samples. 40 shapes are randomly selected from the cardboard dataset. In the figure, the result of the Delaunay triangulations is visualised by blue line segments. One frame from the testing sequence is chosen to demonstrate how the shape is updated in each iteration. Yellow lines illustrate the trajectory of the embedding of the shape moving through different triangles. Red and black dots represent the embedding of reconstructed and the ground truth shape respectively.

Reconstruction with missing data and outliers
The fact that the measurements can be affected by outliers and may not be complete, means that the reconstruction algorithm must be robust to measurement data deficiencies in real application. The algorithm described so far assumes the measurements Y are complete, all the feature points are identified in all the images in the sequence. In real sequences, some of the points cannot be detected in all the images due to the occlusions, feature detection problems, or tracking failures and therefore acquiring complete set of measurements is unlikely, and the measurement may be affected by outliers due to errors in the correspondence search.
Two methods are introduced in this section, for solving missing data and outliers problems, respectively. A nonlinear approach for missing data is presented first. This method efficiently solves the problem by simultaneously optimising the missing entries, shape and motion. In the second part, a method based on applying robust M-estimator reduces the effect of outliers by replacing the L 2 norm in cost function (Eq. 15) by Cauchy function.

Nonlinear approach for missing data
If the point tracks are not visible in all images and the object has relatively small deformations, it has been proposed in [35] that instead of considering more complex and timeconsuming optimisation algorithms, using a linear method based on principal component analysis (PCA) can recover the missing entries before shape and motion estimation. The major benefit of this imputation algorithm is its simple implementation and fast computation. The PCA as a linear method, is only able to cope well with simple deformations. Although the method is not suitable when the deformations are relatively large or complex, it still can be used for providing a starting point for the optimisation when the following nonlinear approach is applied.
The diffusion maps based method can be easily extended to handle the case with missing data. To facilitate this, Eq. 15 is modified with the cost function rewritten as E({R t }, {θ tl }, {Y t * }), depending explicitly on the missing Cauchy L2 Fig. 6 Graphic representations of L 2 estimator and Cauchy function observations Y t * . As the results, the cost function is simultaneously minimised with respect to rotation, shape coefficients and the missing observations. It should be pointed out that only the missing observation Y t * are optimised not all 2D measurements Y t .

Robust estimator for data with outliers
It is well known that least-squares methods are sensitive to outliers as even a single outlier in the observations can strongly influence the values of the estimated parameters.
Assuming that in the proposed model outliers affect only observations, the analysis can be restricted to the 2D reprojection error (the first term in Eq. 15). Defining the mismatch between the real observation (measurement) and the projection of the shape estimate at time t as a residual E t : θ tl X tl (18) the camera motion and 3D shape are calculated by minimising the sum of squared residuals, F , where i is the image coordinate index and j is a feature point index with P reconstructed points. In practice, the outliers will be presented in the observations and therefore the use of L 2 norm may lead to significant reconstruction errors. In this paper it is proposed to address this problem by using a function which penalises the large residual errors less than the L 2 norm and in this way "desensitise" the re-projection error with respect to outliers. The adopted approach uses the socalled M-estimators, in this case the new re-projection error is defined as: The results presented in the next section were obtained for c = 1. Figure 6 illustrates the L 2 estimator and Cauchy function. The L 2 estimator is non-robust as it strongly depends on large errors, while Cauchy function considerably reduce their influence which make it less sensitive to outliers.

Results and discussion
A number of experiments were carried out to evaluate the proposed method. Several state-of-the-art algorithms were evaluated and compared in these experiments:

RF:
The proposed random forest method; DM: The diffusion map-based method. The DM method is similar to the RF except the manifold learning was implemented without random forest. [34]; PTA: The discrete cosine transform (DCT)-based point trajectory approach [2]; CSF: The column space fitting method [14]; KSFM: The kernel non-rigid structure from motion approach [15]; IPCA: The incremental principal components analysisbased method [35].
The testing data used for evaluation include: two articulated face sequences, surprise and talking, both captured using 3D scanner with 3D tracking of 83 facial landmarks and two surface models, cardboard and cloth [39]. This paper does not focus on feature detection and tracking. In the experiments described here the 3D points are known and these were projected onto the image sequences under the orthographic camera model and subsequently used as features. Diffusion maps require training process, so training dataset for face sequences were taken from the BU-3DFE [43] and for surface sequences the data were obtained from [39]. All the training data have been rigidly co-registered, and the same testing data have been used with the methods which do not require training.
During the experiments ten trials are run for each level of noise, missing data and outliers in different sequences. The reconstructed shapes are aligned using a single global rotation based on Procrustes alignment [2]. For evaluating the results, the same procedure as in [15] is used. The normalised means of the 3D error are compared over all frames and all points: where Δ t x ,Δ ty ,Δ tz are the standard deviations of x,y and z coordinates of ground truth shape at t th frame and e t p is the Euclidean distance between point p at frame t in the reconstructed and ground truth data. Table 1 shows the 3D reconstruction error for PTA, CSF, KSFM, IPCA, DM and RF. The manifold-based method DM  The optimal number of bases n, for which the 3D errors are shown in the table, is given in brackets for each tested method The best results for each sequence are in bold and RF on average provide better results than other trajectorybased methods. The relative normalised means (Eq. 19) of the 3D error [14] are compared over all frames and all points. For RF method the initialisation error and the error produced by the proposed algorithm with and without nonlinear refinement are given. The errors shown in the table correspond to the selected optimal dimensionality parameter n, in case of RF method this corresponds to the dimensionality of the estimated manifold. This selection is achieved by running the trials with n varying from 2 to 15. The best selected n for each tested method is shown in brackets. IPCA uses different number of basis shapes for constructing offline and online shapes, thus n is not provided in the table. As shown in the table, the previously proposed methods are able to provide comparable results, to the DM and RF nonlinear manifold-based methods, for objects with small deformations, e.g. faces. This is because these objects exhibit mostly a rigid motion, the deformations are only seen around the lips and chin. But those methods provide relatively large error on highly deformable shape sequences (e.g. Cloth). As expected, the proposed RF method delivers the most accurate reconstruction in all tested cases. This mainly, can be explained as being due to the fact that the estimated affinities generate relatively uniform distribution of the training shapes in the reduced space. This subsequently effects the interpolation of the new shapes in the manifold, leading to more accurate reconstructions. Note that even though the initial error is big, after optimisation process, the results demonstrate good convergence since the 3D errors are relatively small. An important observation is that, in the trajectory-based methods, the optimal number of bases n has to be independently estimated for each sequence. Choosing too big n may lead to an ill-conditioned problem, but the point trajectory cannot be comprehensively represented if n is too small, while the results from the proposed method are more predictable. However, it should be noticed that the comparison for the Walking, IndianDance, Capoeira, Stretch and Dance sequences between the proposed method and the other methods may be seen as unfair, as better reconstruction accuracy of the proposed method comes at the cost of required availability of a representative training dataset.

The influence of embedding dimensionality
The accuracy of reconstruction is affected by the dimensionality of the reduced space n, corresponding to number of shape basis. The test described in this section looked at the relation between manifold dimensionality and the shape reconstruction error. The four sequences are tested individually with various dimensionalities (n = 3, 5, 7, 10, 15) of the reduced space. The forests have been trained with 600 trees. The results in Fig. 7a show that the shape reconstruction error is reduced with increasing dimension n of the reduced space. As expected, a higher number of bases is required to describe a complex shape deformation, e.g. cloth sequences. Figure 7b shows the results obtained on the cloth sequence, comparing performance of the proposed method against previously proposed methods. The error calculated for PTA, CSF and KSFM varies with the number of bases, and overfitting occurs when n > 10 which indicates that the problem becomes ill-conditioned. DM and RF methods are "more stable" as the solution is strongly constrained by the requirement that it belongs to the manifold.

Sensitivity to noise and missing data
In order to assess the performance of the reconstruction algorithms when the observed data is corrupted by noise, the next experiment compared the RF method against previously proposed methods in terms of shape reconstruction error expressed as a function of level of noise in the observed data. We follow the process in [41] to simulate the noisy data, for which the measurements were perturbed by Gaussian  Fig. 8a. It can be noticed that although the 3D reconstructed error of all five algorithms increases with the higher level of noise, the nonlinear method RF are obviously superior and achieve much smaller standard deviations, whereas others are quite sensitive with large mean error and error dispersion. Missing data problem happens very often in real cases due to feature points track loss or occlusion. To simulate the measurement with missing data, 10,20,30,40 and 50 % of the 2D entries in Y were randomly discarded. The results shown in Fig. 8b are calculated with the missing data ratio of up to 50 %, the average (maximum) 3D error using RF method is 0.1798 (0.2018) which is still acceptable. As the missing data problem is not addressed in [2], PTA is not used for comparison in this experiment.
Observing that in practice, measurements are likely to be affected by missing data and noise at the same time, the following experiment aims to evaluate the performance with measurement noise and different percentage of missing data. Figure 9a shows the results obtained by the nonlinear approach presented in Sect. 6.1. For comparison Fig. 9b shows results of the same experiment using the adaptive linear approach [35].

Sensitivity to outliers in the measurements
For real reconstruction cases, outliers present in the measurements is an inevitable problem that many previous algorithms did not address. It is well known that the outliers can severely affect the results. A small number of outliers may lead to completely meaningless reconstructed shapes. The robust method for mitigating the affect has been presented in Sect. 6.2. The experiment described in this section is designed to test the robustness of the proposed method using both original (Eq. 15) and the improved cost function applying robust influence function (M-estimator) for the reprojection error. To simulate the data corrupted by the outliers, a number of  Fig. 10 Comparison results using original and improved RF methods on Cardboard sequence with outliers the feature points are randomly selected in a test sequence as percentage of the total number of points, subsequently these points are replaced with randomly selected points in the same image frame. The effects of outliers on the reconstruction accuracy were tested at 4, 8, 12, 16 and 20 %. Figure 10 shows the 3D reconstructed errors as a function of the percentage of outliers given by the improved method using robust estimator and the original RF method. The robust method is able to handle the outliers and provide relatively accurate reconstructed shapes. Note that the outliers may also corrupt the training data, but the training is usually done offline, and therefore it is easier to filter the data and remove any outliers in this case.

Effect of forest model parameters
The analysis in this section is focused on how different choices of forests design parameters effect reconstructed results. The model has been evaluated in terms of two parameters: Tree depth and the forest size. For each experiment, the same testing sequences Cardboard and Cloth are used to represent small and large deformation shapes respectively.
The effect of varying tree depth has been investigated first. Here, all the forests have been trained with the fixed number of trees T = 500, and the maximum tree depths are varying from 2 to 7. As the forest size is sufficiently large, the variability due to randomness of parameter selection for each tree is averaged out, therefore we only show the results from one trial for each tree depth in Table 2 as the repeated experiments produce very similar results. The results show that in general increasing the tree depth decreases the error. Larger trees can better separate the shapes, thus the shapes ending up at the same leaf node are more similar. However for the data with small deformation (e.g.cardboard), the 3D error levels off and does not strongly depend on the tree depth. This is because that data exhibit relatively small deformations, and is governed by a small number of degrees of freedom, thereby even a small tree can well separate the shapes. It should be noticed that although larger trees can improve the results, it may lead to costly computation.
The proposed method has been also tested with respect to varying number of trees (T =10, 50, 100, 500, 1000) with fixed maximum tree depth of 5 in each experiment. The results for two sequences are shown in Table 3. The average 3D error and the standard deviation were estimated based on ten trials. As observed from the table, for both sequences applying more trees in the training process produces more accurate reconstructed results. Increasing number of trees in the forest can help the affinity Matrix W to better approximate to the true pairwise graph affinity.

Qualitative evaluation
The objective of this section is to provide a qualitative evidence for the assessment of the shape reconstruction quality. Following on the experiment summarised in Table 2, for com-  Fig. 11 shows three randomly selected reconstructed shapes from the Cloth sequence using PTA, CSF, KSFM, IPCA, DM and RF methods. The proposed algorithms have been also tested on a real video sequence showing paper being bended. A frames' sample from this video is shown in the top row of Fig. 12. In the video, 81 features were tracked along 61 frames showing approximately two periods of bending movement. The training data in this experiment is the cardboard dataset obtained from [39], which is the same as used in the previous evaluations when using cardboard sequence. The results show a comparison of our reconstructed shapes with the results obtained from MP, PTA, KSFM methods. Figure 13 shows 3D reconstruction results obtained for a facial expression sequence. The first row indicates the input images with

Results of using reduced training set
This section presents experiments of 3D reconstruction when only small number of training samples are being used. The method is labelled as RF2 to distinguish it from RF. Two sequences are used for testing, they are: cloth and an articulated human motion sequence IndianDance from CMU motion capture database. 1 Since no separate training data are provided for human motion sequence, every 10th frame is selected to build training dataset (e.g. frame 1,11,21…are 1 The data were obtained from http://mocap.cs.cmu.edu. selected), whereas frames 5,15,25…are selected as a testing set. It should be stressed that the results provided for the IndianDance sequence are indicative only, as an objective tests should use independent training and test data, which are currently not available. In this experiment, the number of training shapes for RF2 and RF for the cloth sequence are 100 and 1000 respectively, and 100 for IndianDance sequence when RF2 is used. For the cloth sequence the reconstruction error for RF2 with the optimal selection of the number of basis shapes (9 in this case) is 4.71 %. This should be compared with the results reported in Table 1. It can be concluded that RF and RF2 are somewhat comparable in terms of the reconstruction accuracy for Cloth sequence. As expected, RF outperform the RF2, but RF2 uses much smaller training set than RF.  The performance of the RF2 method strongly depends on the selection of the shapes included in the reduced training set. It is beneficial when selected shapes generate well-shaped triangles in the Delaunay triangulation. In the performed test on the IndianDance sequence the selection of the training samples was not optimised with respect to results of the Delaunay triangulation. As human movement contains large number of degrees of freedom, the reconstruction results are affected if corresponding shapes are being clustered in badly shapes triangles (e.g. 'skinny triangles') in the reduced space. Even so the reconstructed error for Indian-Dance is 12.95 % obtained with seven basis shapes, which is still acceptable.
The visualised results of reconstructed shape extracted from the IndianDance sequence using KSFM (58.86 % overall 3D reconstruction error) and RF2 methods are illustrated in Fig. 14.

Limitation
While the proposed method is able to reconstruct complex deformable shapes with some success, some limitations still (1) (2) Fig. 15 The reconstruction success (1) and failures (2) examples obtained for the Capoeria sequence. The ground truth and the reconstructed shapes are shown as dots and circles, respectively remain. For example, the method could fail when the shape to be reconstructed is not adequately represented in the available training dataset. Figure 15 shows examples of the successful and failed reconstructions. In this case, the Capoeria sequence is used for testing, whereas the above mentioned IndianDance sequence is used for training. The shapes are accurately reconstructed when they are sufficiently well represented by the training samples, but the reconstruction can fail when the true shape is significantly different from the shapes in the training set (see Fig. 15).

Conclusions
A new approach for recovery non-rigid shape based on manifold forests is described in the paper. The nonlinear manifold has been build upon diffusion maps with random forests used to estimate local manifold neighbourhood topology. The method achieves good performance especially for large and complex deformable objects, when compared with the existing approaches. In many practical applications there are only limited number of shape examples to be used for training. To address this problem, a modification of the described RF method has been also proposed which is able to accurately reconstruct shapes even though only a small number of training shapes could be available. Additionally, the existence of outliers in the observations could be a limitation imposed by practical applications on some of the previously proposed reconstruction methods, as often outliers are not modelled explicitly. To address this problem, a further extension of the prosed RF method has been described in this paper, with a robust cost function used to measure the re-projection error. The evaluation results on simulated and real data presented in the paper demonstrate the validity of the proposed methods.
It should be mentioned that the comparison of the proposed method with respect to the other methods may be seen as unfair, as better reconstruction accuracy of the proposed method comes at the cost of required availability of a representative training dataset. As manifold learning has shown to be a very powerful approach for analysis of the shapes, we believe the manifold-based method is a suitable groundwork for reconstruction of deformable shapes.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.