Learning-Based Non-rigid Video Depth Estimation Using Invariants to Generalized Bas-Relief Transformations

We present a method to locally reconstruct dense video depth maps of a non-rigidly deformable object directly from a video sequence acquired by a static orthographic camera. The estimation of depth is performed locally on spatiotemporal patches of the video, and then, the full depth video is recovered by combining them together. Since the geometric complexity of a local spatiotemporal patch of a deforming non-rigid object is often simple enough to be faithfully represented with a parametric model, we artificially generate a database of small deforming rectangular meshes rendered with different material properties and light conditions, along with their corresponding depth videos, and use such data to train a convolutional neural network. Since the database images are rendered with an orthographic camera model, linear deformations along the optical axis cannot be recovered from the training images. These are known in the literature as generalized bas-relief (GBR) transformations. We address this ambiguity problem by employing the invariant-theoretic normalization procedure in order to obtain complete invariants with respect to this group of transformations, and use them in the loss function of a neural network. We tested our method on both synthetic and Kinect data and experimentally observed that the reconstruction error is significantly lower than the one obtained using conventional non-rigid structure from motion approaches and state-of-the-art video depth estimation techniques.


Introduction
The human visual system has a remarkable ability of discerning the three-dimensional shape of an observed scene. Although its internal mechanisms are still not entirely clear, it is known that our visual system relies simultaneously on several cues to perceive shape [15,40], such as the local changes in shading and texture of an object, or the temporal change in appearance of an object while seen from different angles. Researchers in the last few decades have attempted with varying degrees of success to formulate mathematical models that would approximately describe such mechanisms in order B Matteo Pedone matteo.pedone@oulu.fi Abdelrahman Mostafa abdelrahman.mostafa@oulu.fi Janne Heikkilä janne.heikkila@oulu.fi 1 Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland to emulate them on machines. Some of the most prominent outcomes of these efforts are: algorithms for structure from motion(SfM) [45], monocular depth estimation [39], photometric stereo (PS) [48], shape from shading [21], and shape from texture [5]. Each of these methods has a vast literature of its own. Interesting surveys can be found in [1,30,35,37].
The aforementioned methods have been typically approached under the strong assumption that the observed scene is static. In recent years, we have witnessed an increasing interest in extending such techniques to the case of dynamic scenes made of rigidly and non-rigidly moving objects. However, there are currently still many challenges to be faced due to the ill-posedness of the inverse problems related to the estimation of three-dimensional shape of a dynamic scene from two-dimensional images. Research efforts in this direction have led to the introduction of methods for dynamic photometric stereo [18,23,47], non-rigid structure from motion (NRSfM) [43], and more recently video depth estimation [31].
Although NRSfM methods differ significantly in terms of the inputs they rely on and in terms of the outputs they The corresponding depth maps estimated directly from the video with our method, after training a CNN with a synthetically generated database. The depth maps are shown both as planar images displayed with a hot colormap, and as surfaces in 3D space, texture mapped with the original video frames and illuminated with artificial a light source for the sake of improved visibility produce, compared to dynamic PS or to video depth estimation, we can nonetheless argue that NRSfM algorithms are often preceded by tracking algorithms that in fact rely on information extracted from the pixel values of a video sequence. Given the recent success of deep-learning algorithms in many different settings, it is natural to investigate the feasibility of deep-learning-based approaches that operate directly on a video sequence acquired from a static camera as its input, and that would automatically yield a dense representation of the scene depth for each frame as its output. This has potentially the advantage of overcoming the frequently imposed assumptions on the illumination conditions of the scene and on the reflectance properties of the observed object (as in PS), and that of avoiding to explicitly solve ill-posed geometric problems, as in NRSfM.
The main challenge is clearly represented by the difficulty of acquiring large scale training data in which each sample would consist of a video sequence of an object (or a portion of it), along with its corresponding 3D representation in space. Although nowadays it is theoretically possible to acquire RGB video sequences along with their corresponding depth maps using inexpensive equipment, producing a sufficiently large database involving non-rigid motion of many different subjects, materials, textures, and light conditions would be a daunting and time consuming process.
In this paper, we try to address this problem by showing that it is possible to train a convolutional neural network to perform the direct estimation of a depth video from the pixel intensities of a video sequence by generating a database consisting of short, rendered, synthetic video sequences, each depicting the movement of a small patch of a non-rigidly deforming object, along with the corresponding depth video (Fig. 1). Potential applications can be found in industry, computer graphics, and safety systems, where the method could be used, respectively, for folds/wrinkles detection and analysis in automated garment manufacturing processes, acquisition of dynamic wrinkle maps for realistic rendering of animated human/animal skin, and sea state monitoring with cameras mounted on buoys, saildrones, boats, or naval ships.
The contributions of this paper are the following ones: -We propose a network architecture to estimate the depth map for each frame of a video sequence of a deforming object acquired by a static orthographic camera, as well as a computationally fast mathematical model to synthetically generate the training data. -Since the training samples are often ambiguous with respect to certain linear transformations along the optical axis, known as generalized bas-relief (GBR) transformations, we derive two novel sets of complete invariants for this particular subgroup of affine transformations: pointcloud GBR-invariants and differential GBR-invariants. We use GBR-invariants to formulate a loss function in the training stage. -We utilize the theory of the moving frame to obtain GBRinvariants in a relatively simple and constructive manner, therefore following the opposite strategy often found in the computer vision literature, in which unmotivated formulas are first introduced, and then proven to satisfy an invariance property.
-We obtain better performance than common state-ofthe-art non-rigid structure from motion methods when reconstructing depth videos of deforming objects that are representable as surfaces without holes.
The rest of this paper is organized as follows. In Sect. 2, we give a brief overview of related work in the literature. Section 3 describes the proposed method from a general point of view and introduces the main assumptions that are invoked in the next sections. Section 4 describes the ambiguity scenario that emerges in our application, where each training sample may correspond to an infinitely large class of ground truths. In Sects. 5 and 6, we provide the mathematical background to describe such problem, and in Sects. 7 and 8, we propose a strategy to address it. In Sect. 9, we describe in details the generation of our database of spatiotemporal patches with their corresponding ground truth depth video. The architecture of the convolutional neural network (CNN) utilized for 3D shape estimation is described in Sect. 10. Section 11 contains implementation details related to the estimation of the depth video. Sections 12 and 13, respectively, show the experimental results and give concluding remarks.

Related Work
We now give an overview of some of the works in the literature that can be considered related to ours. In SfM, it is assumed that the features in the image plane of some interest points are acquired from several points of view, and under the assumption of a static scene, the geometric relationships between corresponding points can be utilized to recover the original 3D structure of the scene, as well as the camera pose. The problem of estimation of non-rigid structure from motion (NRSfM) considers the case where the scene itself can move non-rigidly; hence, the geometric relationships that are utilized in ordinary SfM, where the scene is static, are more difficult to exploit. Researchers have introduced priors on the camera model [7], on the type of trajectories of points in space-time [3], or on the shape of the object [14,26,28,44,49] in order to make the problem mathematically tractable. Algorithms for NRSfM mainly differ from each other in terms of the prior they impose on the shape and motion of the objects. However, most current methods rely on the assumption that the scene is observed through an orthographic camera (see Chapter 5 of [38] for some exceptions using the perspective camera model), and they are essentially generalizations of the popular factorization method, which was originally employed for traditional orthographic SfM [43].
Agudo et al. [2] show that it is feasible to solve the NRSfM problem from video frames of an orthographic monocular camera if one assumes that the non-rigid motion can be accurately described by the equations governing the motion of elastic bodies. After determining the 3D locations of the tracked points, the authors propose a scheme to construct a triangular mesh approximating the global shape of the surface. An extensive comparison between NRSfM methods can be found in [22].
In photometric stereo, one or more images of the same subject need to be acquired under different illuminations in order to estimate its 3D shape, and some authors have addressed the problem of moving objects by using specialized hardware [18,23,47]. In addition to that, the physical surface of the observed object is assumed to have certain reflectance properties (usually Lambertian).
In monocular depth estimation, one solely relies on the information carried by a single image in order to extract the depth at each pixel, which is generally insufficient to discern far pixels from near pixels; therefore, such methods typically rely on statistics about the appearance of the image at several scales and on training data [39].
Attempts at generalizing these techniques to the case of spatiotemporal data resulted in some promising works. Kumar et al. [27] recently proposed a method based on superpixels and motion constraints in order to produce animated depth maps for a video sequence of moving subjects. In contrast with other recent approaches in the same category, their method does not rely on machine learning; moreover, the estimated depth maps are always piecewise planar within the superpixel boundaries, affecting the accuracy of the reconstruction.
In [46], a machine learning system is presented that uses pairs of consecutive frames of a video sequence to produce an accurate perspective depth map and camera position. The method works on static scenes and therefore is not able to handle the more general problem of NRSfM.
In [25], the authors show a deep-learning-based approach to perform NRSfM from 2D hand-annotated landmarks and obtain good performance on articulated motion and when recovering the different structures of rigid and static objects of the same type photographed from different perspectives.
In [31], the authors propose a deep learning based technique that is capable of obtaining good quality video depths from challenging scenes taken with a handheld camera. Their method is able to handle a moderate amount of non-rigid motion in the scene.
Despite the promising results of existing methods, to the best of our knowledge, there is still a lack of available methods in the literature explicitly targeted at recovering the three-dimensional geometry of a non-rigidly deforming object directly from a video sequence.

Overview of the Proposed Method
We consider the problem of reconstructing the spatiotemporal depth map (depth video) from the video sequence of a non-rigidly moving object seen from a static orthographic camera. Our approach relies on the following assumptions: 1. The scene/object is observed by a static orthographic camera. 2. The deformation of the object observable within fixed spatiotemporal windows of the video sequence is nonnegligible. 3. Locally, the 4D structure of the object (i.e., its 3D shape evolving in time) can be approximated with a parametric model controlled by a relatively small amount of parameters.
The orthographic camera assumption is justified in those scenarios where the field of view is narrow and the observed object is sufficiently far from the camera [29]. Note that these assumptions, despite being restrictive, are analogous to the assumptions made by current state-of-the-art NRSfM methods [28]. The proposed method essentially consists in estimating the depth video locally, within local "patches" in the space-time domain, and then in reconstructing the entire depth map of the scene by "stitching" together the local depth videos. The estimation is done by training a neural network to learn the animated depth map within small spatiotemporal patches of a video that have size 64 × 64 × 16 pixels in our implementation. Given the practical difficulty of acquiring a large amount of real data consisting of small video clips of deforming subjects, along with their corresponding depth videos, we generate a database of synthetic data. The data consists of deforming planar meshes rendered with variable textures, reflectance properties, and light conditions, and we obtain their respective depth videos by setting the gray intensity of each pixel to the corresponding value of the z−coordinate of the surface in camera coordinates. The database is utilized to train a CNN to directly estimate 64 × 64 × 16 depth videos from RGB video sequences of the same size. In the final step, the depth videos of each patch are combined together in order to reconstruct the entire depth video. The reconstruction scheme that we utilize is inspired by the well-known method of signal reconstruction using the constant overlap-add (COLA) decomposition. Since the training data does not contain videos in which the surface presents holes, our method is mainly suitable to recover surfaces with genus 0. Note also that the assumption of orthographic camera removes the need to train the network with different camera parameters; however, it introduces also an ambiguity problem where same video sequences may correspond to different depth maps. This issue is addressed in detail in Sect. 4.

Ambiguity-invariant Representation of Depth Maps
Adopting an orthographic camera model has the advantage of avoiding the need to train the network with different camera parameters; however, scenes rendered with orthographic projection are prone to ambiguities. In the particular case of Lambertian surfaces illuminated by directional light sources, the resulting ambiguity is the generalized bas-relief ambiguity (GBR) [6], that has been studied extensively in the context of shape from shading, and which is essentially a composition of shears and stretches along the optical axis z. Although the rendered surfaces in our database are not always perfectly Lambertian, the reflectance model adopted in the rendering stage is not sufficient to resolve entirely the ambiguity [11], even in presence of texture and specular highlights. Therefore, any rendered scene in our database is virtually indistinguishable from all the other hypothetical rendered versions undergoing arbitrary GBR transformations ( Fig. 2). This issue would surely impair the training stage; hence, it is crucial to find a representation of smooth depth map surface that is invariant to the group of GBR transformations, which form exactly the subgroup of affine transformations that lead to ambiguities in the orthographic camera images.
Once such an invariant representation is obtained, it is possible to utilize it in the loss function of the training stage of the network. This has the big advantage of letting the network treat all the GBR-transformed versions of one depth map as an entire equivalence class: in fact, the invariants of two depth maps related by a GBR transformation will be exactly the same; hence, the loss function will yield always zero in such cases.
In the following sections, we provide a self-contained explanation on how to derive GBR-invariant representations for depth maps. Since there is no unique way of defining this representation, we consider two main types of invariants that are calculated by, respectively, treating depth maps as Fig. 2 Example of generalized bas-relief ambiguity. From left to right, two versions of the same surface, of which the second one is a GBR transformed version of the first one, and their corresponding views from an orthographic camera located at (0, 0, 1). The GBR transformation changes the orientation of the surface normals, which in turn slightly changes the albedo pattern of the surface. However, the second image can be mistakenly interpreted as its non-transformed version rendered with the same texture with slightly modulated pixel intensities point clouds, and as smooth 2D functions in 3D space. Finding complete (i.e., maximally discriminative) invariants for a given transformation are more than often a non-trivial mathematical problem. In order to simplify considerably this step, we employ the normalization construction, which is a wellknown technique in the mathematical literature of invariant theory [33] aimed at deriving algorithmically the invariants with respect to a given transformation. Before proceeding, we provide the unfamiliar reader with basic concepts needed to apply the normalization construction. The material presented in Sects. 5 and 6 is a brief review of selected topics in the mathematical literature of invariant theory, needed for the scope of this paper. Readers who are familiar with the theory can skip directly to Sects. 7 and 8 that contain novel material, and in which we give the formulation of the invariants used in this paper for our applications (Theorems 1-2).

Brief Review of Group Actions
A group is a set G equipped with an associative binary operation between elements of G, and such that an identity element and inverse elements exist for each g ∈ G. The types of groups we are concerned with in this article are mainly groups where the elements are matrices that represent linear geometric transformations, and where the binary operation is given by the ordinary matrix multiplication. To be more precise, we will deal with certain types of Lie groups, that are essentially groups with a manifold structure, and therefore, their elements can be uniquely identified through a smooth parametrization. A typical basic example of a Lie group representation is the group of 2 × 2 rotation matrices parametrized by the angle θ . Groups can act on a given set X by producing transformed versions of the elements of X .

Definition 1
A group action of G on a set X is a map * : G × X → X satisfying the following properties: for all x ∈ X and g 1 , g 2 ∈ G, where the juxtaposition g 1 g 2 denotes the group operation.
A concrete example is that of the group of 2 × 2 rotation matrices acting on points x ∈ R 2 , given by R * x = Rx, where x is assumed to be a column vector. Such action produces new rotated versions of x that lie on a circle centered at the origin and passing through x. In general, given an element x ∈ X , the subset Gx {} {y ∈ X : y = g * x} is called orbit of x under the action of G. Given a group action on R 3 , we can always define three important types of induced actions: the induced action on point clouds, the induced action on smooth functions, and the prolonged action on smooth functions. Definition 2 Given a group action of G on R 3 , the induced action on point clouds is the map * : where p 1 , . . . , p N ∈ R 3 and N is the number of points.

Definition 3
Given a group action of G on R 3 , the induced action on smooth functions is the map * : Before introducing the next type of induced action, it is important to recall that the n−th order jet(prolongation)of a smooth function f at a point x is the vector ..,n whose entries are given by all the partial derivatives of f with multi-indices α ∈ N m 0 up to order n, evaluated at x. Note that the n−th order prolongation of a function at a certain point is therefore an element of a space denoted as J n , whose dimension is higher than that of the codomain of f for n > 1.

Definition 4
Given a group action of G on R 3 , the n−th order prolonged action on smooth functions is the map * : Whenever one considers the action of a generic element g ∈ G, it is often unnecessary to explicitly include g in the formulas; hence, it is common practice to adopt the notation x = g * x.

The Moving Frame and the Normalization Procedure
In this section, we give a brief review of the theory of the moving frame and the normalization procedure, which are powerful tools by which invariant quantities can be obtained in a constructive manner. When a group action on a set is considered, it is often useful to derive an invariant operator with respect to that action. This is essentially equivalent to finding an operator that maps each element of the set to its corresponding orbit. Finding such an operator is usually not straightforward; therefore, we now introduce to the reader some well-known techniques that will allow us to derive complete invariants in a constructive manner. One key observation is that it is often easier to find an anti-equivariant map with respect to the group action than directly finding an invariant map. The invariant can then be constructed easily from the knowledge of the anti-equivariant map. We clarify this fact in details in the next lines.
Definition 5 (moving frame) Given a group action of G on a set X , a map ρ : X → G is called moving frame if it is anti-equivariant with respect to the action, i.e., if it satisfies the following property: The term moving frame originates from the work of the French mathematician Elie Cartan [9,10] who developed the theory underlying the normalization technique in the context of differential geometry. Later, his theory and techniques were given a rigorous foundation and were extended to a more abstract setting by Fels and Olver [16,17], although the original term used by Cartan "repére mobile" (i.e., "moving frame") survived also in contemporary mathematical literature. Assuming one has found a moving frame for a given action, it is straightforward to find a corresponding invariant operator.

Lemma 1 (invariantization)
Given a moving frame ρ : X → G for the action of G on X , the operator ι : X → X defined as follows: is invariant with respect to the group action, i.e., ι(g * x) = ι(x) for all g ∈ G, and (5) is called invariantization formula.

Proof
The proof of the above statement follows easily from the identities below: ( by Eq. (5)).
Now that we can constructively derive invariants from the knowledge of the moving frame, it is natural to ask if it is in turn possible to obtain the moving frame constructively. This can be often accomplished by using the normalization procedure, which consists in defining a system of k equations where k is the number of parameters of the Lie group in question: The element x ∈ X is mapped by the group action to points in its orbit Gx. The antiequivariance property of the moving frame is illustrated by the blue and orange arrows that depict, respectively, the action of the left and right terms of (4) on the element g * x. It is evident that x under the action of ρ(x) is projected to the same element as g * x under the action of ρ(g * x). Such element ι(x) is the G-invariant of x and lies in the cross section C and solving for the group parameter g. Geometrically speaking, the normalization equations define a cross section for the action of G, which is essentially an implicit surface that intersects all the G−orbits of X . Since the equations are expressed in terms of g and x, solving for the variable g yields the group element that "projects" any x ∈ X onto its corresponding point on the cross section (Fig. 3). Since the choice of the cross section is arbitrary, there is typically no unique way of defining the normalization equations. One guiding principle is that of defining a set of equations ψ i that makes the calculations as simple as possible. However, defining the normalization equations and finding the moving frame is typically easier than trying to find an invariant operator by trial and error and proving its completeness.
As a toy example, one can consider the action p = R θ p of the 2 × 2 rotation matrices on the points in R 2 . Since the group of rotation matrices is parametrized only by θ , it is sufficient to define only one normalization equation. One sensible choice is: p x = 0 that yields p x cos θ − p y sin θ = 0, and whose solution is given by θ = arctan 2 ( p x , p y ), which represents the explicit form for the moving frame ρ(p). As an exercise, the reader can verify that by the invariantization equation we obtain ι(p) = R arctan 2 ( p x , p y ) p = , which is a complete invariant to rotation, as expected. In the following sections, we utilize the normalization techniques to derive complete invariants for the action of the group of GBR transformations on point clouds and on smooth 2D functions in R 3 . Since it is outside of the scope of this paper to give an exhaustive overview on the theory of Lie group actions and the moving frame, the interested reader can refer to [32] for an accessible introduction to the subject, and to [33,34] for a more detailed treatment of the subject.

GBR-invariants for Point Clouds
Let us consider the Lie group of GBR transformations parametrized by (α, β, λ, τ ), where α, β ∈ R are the two parameters for shearing along the z−axis, λ ∈ R + is the parameter for scaling (stretching) along the z−axis, and τ ∈ R is the parameter for z−translation. Consider a point cloud with N points (p 1 , . . . , p N ) ∈ R 3×N and the action of the GBR transformations on R 3×N . For convenience of notation, we will denote a given point cloud in the following three equivalent ways: A GBR-transformed point cloud will then be given by the concatenation of z-translation, z-shear, and z-scaling: where 1 denotes the N × 1 column-vector of 1's. From Equation (8), it is possible to verify that the GBR group is equivalent to the (semidirect) product of the three aforementioned subgroups of transformations. In such cases, it possible to use a result from [24] that essentially states that a complete invariant for the action of a semidirect product is given composing the invariants for each subgroup action.
Since it is well known that complete invariants with respect to translation and scaling can be easily obtained by, respectively, translating the z-centroid of the point cloud to 0, and by normalizing the standard deviation of the z-coordinates of the point cloud, the original problem now reduces to that of finding a point cloud invariant for the action of the subgroup of z-shears parametrized by (α, β). Let us consider a z-sheared point cloud: We now employ the normalization procedure described in Sect. 6 and define a cross section with the following normalization (linear) equations: where σ ·,· stands for the sample covariance operator. One might argue that it would have been possible to replace the two normalization equations in (10) with z 1 = z 2 = 0 which is considerably simpler than our current formulation involving cross-covariances; however, note that such a system of normalization equations would lead to solutions which only contain information about two arbitrarily chosen points of the point cloud. Clearly this would have an impact in terms of robustness, and for this reason we opted for two normalization equations in which all the points of the point cloud appear. By plugging the corresponding terms in (9) into X, Y, Z, and solving for (α, β), one obtains: where Σ X,Y stands for the covariance matrix of X and Y. We can plug the right term of Equation (11) into the invariantization formula (5) and obtain: Lemma 1 guarantees that ι shear ( P shear ) = ι shear (P); thus, the point cloud ι shear (P) is shear-invariant. In order to obtain the GBR-invariant representation of the point cloud, it is sufficient to apply (12) to the z-centered version of P and then standardize the result across z. This is stated more precisely in the following theorem: Theorem 1 Given a 3D point cloud P ∈ R 3×N with N points, the resulting point cloud obtained by: where: and ι shear is defined as in Equation (12), is invariant to the action of the group of GBR transformations.  (13). Note that the invariant point cloud obtained from the noisy data appears multiplicatively biased (squeezed) along z, compared to its non-noisy version, due to an over-estimation of the scaling parameter in the moving frame equations. (Bottom row) The corresponding differential invariants obtained by (23) and visualized as normalized-Hessian maps, in which the three components z xx z yy z xy ∇ 2 z −1 F were, respectively, assigned to the L * , a * , b * channels of the CIEL * a * b * color space Note that the quantity ι(P) is singular whenever the determinant of the covariance matrix Σ X,Y vanishes, or when σ Z = 0. Note, however, that if the point cloud is an arbitrary depth map scattered on a regular grid, Σ X,Y is always constant and non-singular, thus the only degenerate case happens when the point cloud P is entirely scattered across a plane perpendicular to the z-axis. For this reason, when σ Z ≈ 0 small perturbations in P can make the computation of the normalized point cloud ι(P) unstable. Note also that ι implic-itly calculates four global parameters (the moving frame ρ) which are then applied to all points in order to project P on the cross section. If the point cloud P is perturbed, then all the moving frame estimates for α, β, λ, θ may contain noise and the point cloud does not get projected exactly on the cross section. From Equations (12)-(15), we can deduce that ι might be affected by additive and/or multiplicative bias (see Fig. 4). For this reason, in the next section, we consider also another type of invariants (differential invariants) in which the moving frames are defined locally and the invariant operators are calculated for each location of the surface.

GBR-invariants for Smooth 2D Functions
Let us consider the space S ⊂ C ∞ (R 2 , R 3 ) of smooth functions (spatial patches) f : R 2 → R 3 defined as follows, as well as f , the GBR-transformed version of f : where z ∈ C ∞ (R 2 ) is the corresponding depth map, α, β, τ ∈ R, and λ ∈ R + . We use the normalization construction described in Sect. 6 when considering the prolonged action of the GBR transformations on f [n] , the n−th order jets of f . Note that to alleviate the notation, we will write partial derivatives with the subscript notation, e.g., f xy = ∂ xy f . The prolonged action on the n−jets of f is thus given by: where the vector in the right term contains partial derivatives up to the n−th order. By plugging the definition of f into (17), we obtain: x 1 0 0 0 0 · · · y 0 1 0 0 0 · · · z z x z y z x x z xy z yy · · · ⎞ ⎠ .
Since the GBR group is parameterized by four parameters, we consider the following system of four normalization equations: The reader could argue that the normalization equation z 2 x x = 1 would have led to simpler calculations; however, including the remaining terms of the Hessian matrix both increases the robustness and avoids the degenerate cases in which only z x x = 0. By plugging the explicit formulas for the partial derivatives in (18) into (19) yields: and solving for the parameters α, β, λ, τ , one obtains the moving frame: By the invariantization lemma, plugging the parameter values of the moving frame into (18) yields a differential invariant for the GBR-action: It is possible to verify that the entries of I f that contain derivatives of degree higher than 2 are obtainable from the first three nonzero invariants in (21). Furthermore, we recognize z x x , z xy , z yx , z yy as the entries of the 2 × 2 Hessian matrix of z; therefore, we can conclude that a complete differential invariant for the GBR-action on S is given by: We can now make the following statement.

Theorem 2
Given a smooth function f ∈ S as defined in (16), the normalized Hessian where · F denotes Frobenius norm, is a complete differential invariant for f with respect to the action of the group of GBR transformations.
Note that ι f (x, y) has singularities only in those point where all the second derivatives vanish. The only surfaces for which ι f is singular at every point are the planar ones. In our implementation, where the domain of z is discretized, we estimate the partial derivatives of z in (23) with second derivative filters. It is interesting to note that if z is locally perturbed in the neighborhood of a point (x, y), then only the corresponding invariants ι f (x, y) at those locations that are spatially close to (x, y) will be affected by the perturbation (see Fig. 4).

Dataset Generation
In this section, we give a detailed explanation on how we generate a database of video sequences to train a neural network in estimating depth information. Consider a deforming surface in space-time f : (−1, 1) 2 × (0, 1) → R 3 having the following form: where the term E t represents Euclidean transformations, i.e., roto-translations, smoothly parameterized by t, whose effect is that of changing the orientation and position of the coordinate frame across time, and the quantity inside brackets in (24) can be seen as a 2D surface in R 3 obtained by applying a displacement d to each point of the bi-unit square (−1, 1) 2 (Fig. 5). Note that the function f can be interpreted as an homotopy (continuous deformation) between the surfaces f (·, 0) and f (·, 1). In order to define the vector displacement function d, we utilize an approach inspired by some ideas found in the literature of computer graphics for simulating in real-time physical phenomena like ocean waves [42], or turbulent flows in fluids [8]. These methods typically aim at physical accuracy; however, since we are only interested in producing a reasonably realistic-looking appearance of a non-rigidly deforming surface, we adopt the following simple strategy. Let us consider the channel-wise Fourier transform operator T for vector valued functions v : R 2 → R 3 , the two-dimensional Gaussian function G˝(u) = e −u T Σu , where Σ is a 2×2 symmetric positive definite matrix, and a vector-valued function ϕ(u, t) as: where ReF −1 is the real part of the inverse 2D Fourier transform operator applied "channel-wise," and w : R ≥0 2 → [0, 1] is a 2-dimensional circular symmetric function whose radial profile is rapidly increasing, and such that w(0) = 0 , to guarantee that the integral of d is zero. The intuition behind the term ϕ in (25) is that of modulating the phase of each component of the frequency spectrum to generate a wavy surface that deforms across the time dimension. The scalar parameters κ (intensity), ν (constraint), ζ (folding), ξ  (24), for the case where E t is a translation of 0.3 along z. Intuitively, the bi-unit square is morphed into a deforming surface by the map x + d(x, t) which is then re-oriented in space through a rigid transformation E t , resulting in the green surface in the figure. The vector displacement d(·, t) is depicted e as a color texture whose red, green, and blue components correspond, respectively, to the x, y, z components of the displacement (middle gray denotes zero displacement). The appearance of the vector displacement texture varies smoothly in time thanks to the effect of the term ϕ in (25)- (26) (flexibility), respectively, control the overall strength of the displacement, how much the square patch remains anchored at its border, how much it tends to create folds, and how much it behaves like a soft or hard surface. The matrix Σ controls the distribution of orientations of the displacements, while φ, θ : R 2 → R 3 are functions that determine the speed by which each frequency component changes in time and their initial phase angle (Fig. 6). In our implementation, φ and θ are initialized randomly.
Our approach of defining a moving surface by means of (26) can be seen as a computationally cheap way to emulate the appearance of a planar surface with given physical properties, animated by means of realistic cloth simulation. We stress that in our application, it is not strictly necessary to obtain physical correctness of the movement, and for this reason, we adopted colored noise to generate the geometry of the patch and its movement. Note that in a practical scenario, the domain of f would be obviously discretized into a 3D array.
After the geometry of a moving surface is defined, we render 153600 video clips of space-time surfaces f by varying the parameters in (26), the light source parameters, the texture of the surface, its reflectance properties, and the variance of the additive noise that is added after the render. For rendering, we adopt the Phong reflectance model [36]. More accurate reflectance models could be used to increase real- Fig. 6 Illustration of the effect of the parameters of (26) on the surfaces used as training data. From left to right: a patch obtained by setting ν, ζ , ξ to very low values and Σ = I ; the same patch with increased constraint; with non-identity Σ; with increased flexibility;with increased folding. Fig. 7 Two examples of scenes rendered with our method. From left to right, the 3D mesh, rendered object, and depth map are depicted ism, at the expenses of computational speed and parameters to handle. The textures that we use in our implementation are 256×256 crops extracted at random position from the images of the DTD database [13]. Each of those cropped images is then texture mapped onto the bi-unit square which is then deformed according to the above procedure. The render pass produces for each clip a stack of 16 images of 64 × 64 pixels that we call f render . In addition to the render pass, we produce another 64 × 64 × 16 array f depth of corresponding depth maps (Fig. 7). A single database entry is represented by a pair ( f i, render , f i, depth ), where the index i ranges from 1 to 153600 in our implementation. The number of frames used in the training stage is fixed to 16, and such number was chosen empirically by running different experiments in which we trained the network using different frame numbers (8, 16 and 32) and found 16 to be a reasonable compromise between depth estimation accuracy and space requirements. Note also, that excessively increasing the number of frames would mean populating the database with videos that have a very long duration in time, and which arguably would not be suitable representations of local information. On the other hand, videos containing only few frames may not efficiently capture the nature of slow movement.

Network Architecture
In order to estimate a depth video from its corresponding sequence of grayscale video frames, we use an architecture similar to the 3D U-net [12] (see Fig. 8). The input for the network is a 16 frames video and each frame size is 64 × 64 pixels. The output of the network is the corresponding depth video, which has size 64 × 64 × 16. To enlarge the receptive field and get richer features, a context module [50] with atrous spatial pyramid pooling is introduced which performs parallel dilated convolutions with different dilation rates. These feature maps are then concatenated, and the output of the module is the convolution of these feature maps. For the first two stages in the network, three dilation rates are used (1,2,3). For the following two stages, two dilation rates are used (1, 2) as the feature maps dimensions get smaller. We use 3×3×3 kernels for all convolutions. The leaky ReLU activation function is used for all layers, except for the last one, which has a linear activation function to predict the depths. The motivation behind this choice of architecture was that of using the 3D U-Net to get feature maps across time domain to learn local motion, which we observed to be better than using a 2D U-Net and consider the frames as channels or features. Also, using atrous convolution effectively enlarged the receptive field, and experimentally determined an increase in performance.

Reconstruction of a Depth Map from Local Patches
The proposed network architecture is designed to recover only depth videos whose frames have dimension 64 × 64. When the input video has larger dimensions, an additional step is required in order to reconstruct the full-sized depth map from the local patches. To address this issue, we split a frame of the input video into 64 × 64 squares with constant overlap and run our algorithm to estimate the depth in each block. Since the recovered depth maps are ambiguous with Fig. 8 The network architecture: the input is a grayscale video and the output is the corresponding sequence of depth maps. Below each layer, the number of channels is reported  respect to a GBR transformation, we also run our algorithm on a version of the input video downsized along the spatial dimension to 64 × 64. This yields depth maps that represent a coarse representation of the full-sized depth map. We upscale the coarse depth map to the original size of the video frames, and to each 64 × 64 patch we apply the best (in least square sense) GBR transformation that aligns it with the corresponding patch in the coarse level. The resulting patches are then multiplied by a two-dimensional triangle function (i.e., the outer product of a triangle function with itself) and added together. Note that such a reconstruction scheme has strong analogies with the well-known constant overlap-add (COLA) reconstruction often seen in the context of reconstruction of signals from their short-time Fourier transform [4]. The temporal dimension is processed in an analogous way.

Experiments
In this section, we describe the experiments that were carried out to validate our method. We split this discussion in two parts in which we, respectively, test the performance of our method on synthetic images and on real data.

Experiments on Synthetic Data
Two main experiments were run on synthetic images. We generated a database of 153600 entries as explained in Sect.  9 and split it into three partitions: 80% for training, 10% for validation, 10% for testing. We used this database to train a CNN with the architecture described in Sect. 10.
In the first experiment, we considered two different loss functions during the training stage, which we call ∂ and :: . The first one is based on differential invariants and is defined as: ∂ = MSE(ι f , ι f * , where f , f * are, respectively, the estimated and the ground truth depth maps, and ι is the differential invariant operator as defined in Equation (23). The second loss function, which we call :: is defined analogously, and the only difference with ∂ is that we replace the differential invariant ι with the point cloud invariant of Equation (13). Since the differential invariants can be singular at some locations, we consider only those pixels where the quantities of ι f and ι f * are well defined. To quantitatively evaluate the performance, we align the estimated depth maps with the corresponding ground truth by a linear transformation and calculate the mean absolute spatially normalized error: where T is the number of frames (16 in our experiments) in each spatiotemporal patch, and σ f * (·,t) denotes the standard deviation of the ground truth depth values of the frame t. The use of Equation (27) is justified by the fact that the depth videos estimated with our method are scale-ambiguous; thus, a scale-invariant error metric like MAE sn allows one to mea-  sure the quality of the reconstruction with respect to the ground truth depth, up to a multiplicative scalar. We repeat the same experiment by re-using the parameters for the alignment of the first frame for all the other frames. In a practical scenario, one could alternatively recover the parameters for the alignment by tracking points on a planar surface of the scene. The results of the above experiment are summarized in Table 1.
In the second experiment, we synthetically generate 1000 samples ( f i,render , f i,depth ) having size 256 × 256 × 16 using the method in Sect. 9 while using significantly different parameters in Equation (26) than the ones used in the training stage. We then automatically select 1089 points scattered uniformly on the 3D mesh and track their x y−coordinates on the image plane. We use these coordinates as inputs for two popular state-of-the-art non-rigid structure from motion methods: CSF2 [20], and KSTA [19], and obtain the z-coordinates of the tracked points for each frame. Note that according to the benchmark published in [22], the most recent current stateof-the-art method [28] performs very similarly to CSF2 [20]. To obtain a dense 256 × 256 depth map, we use scattered interpolation at each frame and then apply a linear transformation to the estimated depth maps to align them with the respective ground truths and calculate the MAE sn of each depth video. We finally compare the results with the depth videos obtained directly from the video sequences using our method and the current state-of-the-art method (CVDE) for consistent video depth estimation [31]. Since the videos have larger dimensions than the ones used in training, we apply the reconstruction strategy described in Sect. 11 to obtain fullsized depth videos. The results are summarized in Table 2. Figure 9 shows a visual comparison of the output generated by the considered algorithms.

Experiments on Real Data
In this experiment, we used real data, consisting of a grayscale video along with the corresponding depth map, acquired using Microsoft Kinect v2, which is equipped with a pair of RGB and depth perspective cameras. The video sequence was obtained by recording a video in an indoor environment of a curtain moved by a person located at about 2 meters from the camera. The video clip contained 60 frames with 512 × 424 pixel resolution recorded at 30 frames per second. For each corresponding pixel, a finite depth value was recorded by the sensor. As in the previous section, we compare our method, trained with the same database as in the previous experiment, with three other methods in the literature, CSF2, KSTA, and CVDE. In our method, the depth videos are estimated directly from the input video sequence. For the two NRSfM methods, CSF2 and KSTA, we performed point tracking on the two video sequences using the minimum eigenvalue algorithm [41], which yielded the highest amount of correctly tracked points and manually calibrated their parameters to obtain best results. The depth videos were then obtained as in the previous experiment, i.e., by recovering for each frame a dense depth map from scattered interpolation of the estimated depth values of the tracked points. CVDE needs a similar pre-processing step in which correspondences between frames are automatically found using optical flow. In this experiment, we had to establish 30 correspondences manually in the non-static region of interest of the video, since the automatic algorithm was unable to detect them. The results are given in Table 3, and the estimated depth maps are shown in Fig. 10.

Conclusion
We presented a deep-learning-based approach to recover a depth video directly from a video sequence of a non-rigidly deforming object acquired with a static orthographic camera. The problem of training the network was addressed by synthetically generating a large database of short videos depicting the local shape of deforming 3D meshes rendered with realistic textures, and paired with their corresponding depth videos. Since with an orthographic camera model there are potentially infinitely large classes of depth maps corresponding to the same rendered image, we addressed this issue by mathematically deriving maximally discriminative representations of surfaces that are invariant to the group of generalized bas-relief transformations, and used them in the loss function during the training stage. We tested our method on both synthetic and real videos and experimentally observed that the quality of the estimated depth videos outperforms that of state-of-the-art NRSfM and video depth estimation algorithms for scenes that are mostly dominated by non-rigid motion.