Keywords

1 Introduction

Spatial transformers [1] represent a class of differentiable layers that can be inserted in a standard convolutional neural network architecture to enable invariance to certain geometric transformations on the input data and warping for reconstruction error [2]. In this work, we build upon the 2D transformation layers originally proposed in the spatial transformer networks [1] and provide various novel extensions that perform geometric transformations which are often used in geometric computer vision. These layers have mostly no internal parameters that need learning but allow backpropagation and can be inserted in a neural network for any fixed differentiable geometric operation to be performed on the data. This opens up an exciting new path to blend ideas from geometric computer vision into deep learning architectural designs allowing the exploitation of problem-specific domain knowledge.

Geometric computer vision has heavily relied on generative parametric models of inverse computer graphics to enable reasoning and understanding of real physical environments that provide rich observations in the form of images or video streams. These fundamentals and principles have been very well understood and form the backbone of large-scale point cloud reconstruction from multi-view image data, visual odometry, and image registration. In this work, we provide a comprehensive library that allows implementation of various image registration and reconstruction methods using these geometric transformation modules within the framework of convolutional neural networks. This means that certain elements in the classic geometric vision based methods that are hand-engineered can be replaced by a module that can be learnt end-to-end within a neural network. Our library is implemented in Torch [3] and builds upon the open source implementation of spatial transformer networks [4].

2 gvnn: Geometric Vision with Neural Networks

We introduce gvnn, a Torch package dedicated to performing transformations that are often used in geometric computer vision applications within a neural network. These transformations are implemented as fixed differentiable computational blocks that can be inserted within a convolutional neural network and are useful for manipulating the input data as per the domain knowledge in geometric computer vision. We expand on various novel transformation layers below that form the core part of the library built on top of the open source implementation [4] of spatial transformer networks.

Let us assume that \(\mathcal {C}\) represents the cost function being optimised by the neural network. For a regression network it can take the following form e.g. \(\mathcal {C} = \frac{1}{2}||\mathbf {y}_{pred} - \mathbf {y}_{gt}||^{2}\) where \(\mathbf {y}_{pred}\) is a prediction vector produced by the network and \(\mathbf {y}_{gt}\) is the corresponding ground truth vector. This allows us to propagate derivatives from the loss function back to the input to any layer in the network.

2.1 Global Transformations

We begin by extending the 2D transformations introduced in the original spatial transformer networks (STN) to their 3D counterparts. These transformations encode the global movement of the whole image i.e. the same transformation is applied to every pixel in the image or any 3D point in the world.

SO3 Layer. Rotations in our network are represented by the so(3) vector (or \(\mathfrak {so}(3)\) skew symmetric matrix), which is compact \(3\times 1\) vector representation, \(\mathbf {v} = (v_{1}, v_{2}, v_{3})^{T}\), and is turned into a rotation matrix via the SO3 exponential map, i.e. \(\mathsf {R}(\mathbf {v}) = \exp ([\mathbf {v}]_{\times })\). The backpropagation derivatives for \(\mathbf {v}\) can be conveniently written as [5]

$$\begin{aligned} \frac{\partial \mathcal {C}}{\partial \mathbf {v}}= & {} \frac{\partial \mathcal {C}}{\partial \mathsf {R}(\mathbf {v})}\cdot \frac{\partial \mathsf {R}(\mathbf {v})}{\partial \mathbf {v}} \end{aligned}$$
(1)

where

$$\begin{aligned} \frac{\partial \mathsf {R}(\mathbf {v})}{\partial v_{i}}= & {} \frac{v_{i}[\mathbf {v}]_{\times } + [\mathbf {v} \times (\mathsf {I}-\mathsf {R})e_{i}]_{\times } }{||\mathbf {v}||^{2}} \mathsf {R} \end{aligned}$$
(2)

\([]_{\times }\) turns a \(3\times 1\) vector to a skew-symmetric matrix and \(\times \) is a cross product operation. \(\mathsf {I}\) is the Identity matrix and \(e_{i}\) is the \(i^{th}\) column of the Identity matrix. We have also implemented different parameterisations e.g. quaternions and Euler-angles for rotations as additional layers. Below we show the code-snippet that performs backpropagation on this layer.

figure a

SE3 Layer. The SE3 layer adds translations on top of the SO3 layer where translations are represented by a \(3\times 1\) vector \(\mathbf {t}\), and together they make up the \(3\times 4\) transformation, i.e. \(\mathsf {T} = [\mathsf {R} | \mathbf {t}] \in \) SE3.

Sim3 Layer. Sim3 layer builds on top of the SE3 layer and has an extra scale factor s to allow for any scale changes associated with the transformations \(\mathsf {T} = \begin{bmatrix} s\mathsf {R}&\mathbf {t} \\ \mathsf {0}&1 \end{bmatrix}\).

3D Grid Generator. The 3D grid generator is an extension of the 2D grid generator proposed in the original STN. It takes additionally a depth map as input, to map the image pixels to corresponding 3D points in the world and transforms these points with \(\mathsf {T}\) coming from the SE3 layer. Note that we have used a regular grid in this layer, but it is possible to extend this to the general case where the grid locations can also be learnt.

Projection Layer. Projection layer maps the transformed 3D points, \(\mathbf {p} = (u, v, w)^{T}\), onto 2D image plane using the focal lengths and the camera centre location. i.e.

$$\begin{aligned} \pi \left( \begin{array}{c} u \\ v \\ w \end{array} \right)= & {} \left( \begin{array}{c} f_x\frac{u}{w} + p_x \\ f_y\frac{v}{w} + p_y \end{array}\right) \end{aligned}$$
(3)

where \(f_x\) and \(f_y\) represent the focal lengths of the camera along X and Y axes and \(p_x\) and \(p_y\) are the camera center locations. The backpropagation derivatives can be written as

$$\begin{aligned} \frac{\partial C}{\partial \mathbf {p}}= & {} \frac{\partial C}{\partial \pi (\mathbf {p})}\cdot \frac{\partial \pi (\mathbf {p})}{\partial \mathbf {p}} \end{aligned}$$
(4)

where

$$\begin{aligned} \frac{\partial \pi \left( \begin{array}{c} u \\ v \\ w \end{array} \right) }{\partial \left( \begin{array}{c} u \\ v \\ w \end{array} \right) }= & {} \left( \begin{array}{ccc} f_x\frac{1}{w} &{} 0 &{} -f_x\frac{u}{w^2}\\ 0 &{} f_y\frac{1}{w} &{} -f_y\frac{v}{w^2} \end{array} \right) \end{aligned}$$
(5)

In fact, if focal lengths are also involved in the optimisation, it is straightforward to include them in the network for any geometric camera calibration style optimisations. Note that special care must be taken to ensure that w is not very small. Fortunately, in many geometric vision problems w corresponds to the z-coordinate of a 3D point and is measured in metres — when using Kinect or ASUS xtion cameras this happens to be always greater than 10 cmFootnote 1.

2.2 Per-pixel Transformations

In many computer vision problems, particularly related to understanding dynamic scenes, it is often required to have per-pixel transformations to model the movements of the stimuli in the scene. In the following, we propose different layers for modelling per-pixel transformations for both RGB and RGB-D inputs.

RGB Based. In the context of RGB data, the classic optic flow problem is a case of per-pixel transformation to model the movement of pixels across time. We implement both the well-known minimal parameterisation in the form of translation as well as more recently studied over-parameterised formulations that encapsulate the knowledge of scene geometry into the flow movement.

Mimimal Parameterisation Optic Flow. In its minimal parameterisation, optic flow \((t_x, t_y)\) models the movement of pixels in the 2D image plane i.e.

$$\begin{aligned} \left( \begin{array}{c} x^{\prime } \\ y^{\prime } \\ \end{array} \right)= & {} \left( \begin{array}{c} x + t_x\\ y + t_y \end{array} \right) \end{aligned}$$
(6)

This is the most well-known and studied parameterisation of optic flow in the literature and needs only 2 parameters per-pixel. In general, an extra smoothness penalty is imposed to ensure that the gradient of the flow varies smoothly across a pixel neighbourhood. Patraucean et al. [2] implement exactly this to model the optic flow and use Huber penalty for smoothness. We include this as a part of our library together with recent extensions with over-parameterised formulations.

Over-Parameterised Optic Flow. Attempts to use the popular differential epipolar constraint [6] and the recent over-parameterised formulations of [7] and [8] have shown that if knowledge about the scene geometry and motion can be used, it can greatly improve the flow estimates per-pixel. For instance, if the pixel lies on a planar surface, the motion of the pixel can be modelled by an affine transformation. Although [8] use a 9-DoF per-pixel transformation that includes the knowledge about the homography, we describe the affine parameterisation used in [7].

$$\begin{aligned} \left( \begin{array}{c} x^{\prime } \\ y^{\prime } \\ \end{array} \right)= & {} \left( \begin{array}{ccc} a_0 &{} a_1 &{} a_2 \\ a_3 &{} a_4 &{} a_5 \end{array} \right) \left( \begin{array}{c} x \\ y \\ 1 \\ \end{array} \right) \end{aligned}$$
(7)

It is interesting to note that popular 2-DoF translation optic flow describe earlier happens to be a special case of affine transformation.

$$\begin{aligned} \left( \begin{array}{c} x^{\prime } \\ y^{\prime } \\ \end{array} \right)= & {} \left( \begin{array}{p{2.0mm}p{2.0mm}c} 1 &{} 0 &{} t_x\\ 0 &{} 1 &{} t_y \end{array} \right) \left( \begin{array}{c} x \\ y \\ 1 \\ \end{array} \right) \end{aligned}$$
(8)

We provide implementations of 6-DoF affine transformation as well as SE(2) transformation per-pixel but extensions to 9-DoF paramterisation [8] are straightforward.

figure b

Slanted Plane Depth Disparity. Similar ideas have been used in [9] to obtain disparity of a stereo pair. They exploit the fact that scenes can be decomposed into piecewise slanted planes and consequently the disparity of a pixel can be expressed by the plane equation. This results in a over-paramterised 3-DoF formulation of disparity.

$$\begin{aligned} d= & {} ax + by + c \end{aligned}$$
(9)

Again, this over-parameterisation greatly improves the results. Note that this formulation can be easily generalised and lifted to higher dimensions in the spirit of Total Generalised Variation (TGV) [10], but we have only implemented the 3-DoF formulation.

We would like to stress that these layers are particularly tailored towards warping images which could be used as a direct signal for feedback loop in image reconstruction error in unsupervised training [2, 11].

RGB-D Based. Our layers can be easily adapted to RGB-D to enable 3D point cloud registration and alignment via per-pixel rigid transformations. Such transformations have been used extensively in the computer graphics community for some time and exploited by [1214] for non-rigid alignment. We extend similar ideas and implement 3D transformations for each pixel containing a 3D vector \(\mathbf {x}\), the 3D spatial coordinates coming from a depth-map. In principle, such alignment is general and not limited to just 3D spatial points i.e. any 3D feature per-pixel can be transformed. This is particularly useful when aligning feature maps as used in sketch and style transfer using deep learning [15].

Per-pixel Sim3 Transformation. We extend the global Sim3 transformation that models scale s, Rotation \(\mathsf {R}\), and translation \(\mathbf {t}\) to a per-pixel Sim3 transformation i.e. \(\mathsf {T}_{i} = \begin{bmatrix} s_{i}\mathsf {R}_{i}&\mathbf {t}_{i} \\ \mathsf {0}&1 \\ \end{bmatrix}\) where \(\mathsf {R} \in \) SO3.

$$\begin{aligned} \left( \begin{array}{c} x_i^{\prime } \\ y_i^{\prime } \\ z_i^{\prime } \end{array} \right)= & {} \mathsf {T}_i\left( \begin{array}{c} x_i \\ y_i \\ z_i \\ 1 \\ \end{array} \right) \end{aligned}$$
(10)

This allows for the attention like mechanism of [1] in 3D, as specific voxel areas can be cropped and zoomed, and also modelling any 3D registrations that require scale.

Per-pixel 10 DoF Transformation. In many non-rigid alignments the rotation need not happen around the origin but around an anchor point \(\mathbf {p}_i\) which is also jointly estimated. In this case, the transformation extends to 10 degrees of freedom [12].

$$\begin{aligned} \begin{array}{c} \mathbf {x}_{i}^{\prime } \end{array}= & {} \begin{array}{c} s_i(\mathsf {R}_i(\mathbf {x}_i - \mathbf {p}_i) + \mathbf {p}_i) + \mathbf {t}_i \end{array} \end{aligned}$$
(11)

Additionally, smoothness constraints can be added to ensure that transformations are locally smooth in just the same way as Huber penalty is imposed for smoothing 2D optic flow.

Fig. 1.
figure 1

Our Siamese network is inspired by the popular VGG-16 network [16] where \(3\times 3\) convolutions are used in most layers and works for \(320\times 240\) image resolution. Each convolution layer is followed by PReLU non-linearity [17]. We explicitly avoid any pooling and use a stride of 2 in every convolution layer for any downsampling.

2.3 M-Estimators

The standard least-squares loss function often employed in parameter fitting greatly affects the quality of the solution obtained at convergence. Built on the assumption that noise in the data follows Gaussian distribution, the least-squares function treats both the inliers and outliers in the data uniformly. This is undesirable because even one bad sample in the data can sway the optimisation to an unexpected convergence point. Therefore, outlier samples should be culled or down-weighted accordingly to maintain the optimisation and estimation process from getting influenced by them. Fortunately, in computer vision this has been long studied since the early 90s by Black et al. [1820] who pioneered the use of robust cost functions, often termed M-estimators for estimating a statistically robust mean of the data. We adapted the standard \(\mathcal {L}_2^{2}\) loss function with various popular M-estimators. The table below shows various M-estimators, \(\rho \)(x) and their corresponding derivatives, \(\psi \)(x).

figure c

The use of M-estimators has already started to trickle down in the deep learning community e.g. Patraucean et al. [2] use a Huber loss function in the smoothness term to regularise the optic flow. We believe our library will also continue to encourage people to use different loss functions that are more pertinent to the tasks where Gaussian noise assumptions fall apart.

3 Application: Training on RGB-D Visual Odometry

We perform early experiments on visual odometry for both SO3 as well as SE3 motion that involves depth based warping. We believe this is the first attempt towards end-to-end system for Visual Odometry with deep learning. Since we are aligning images à la dense image registration methods, this allows us to do sanity checks on different layers e.g. SE3 layer, 3D Grid Generator, and Projection layer all within the same network and optimisation scheme. Note that we could have also chosen to do minimisation on re-projection error of sparse keypoints as in classic Bundle Adjustment. However, this approach does not lend itself to generic iterative image alignment where each iteration provides a warped version of the reference image and can be fed back into the network for an end-to-end RNN based visual odometry system. Moreover, our approach is also naturally suited for unsupervised learning in the spirit of [2, 11].

3.1 Network Architecture

Our architecture is composed of a siamese network that takes in a pair of consecutive frames, \(\mathcal {I}_{ref}\) and \(\mathcal {I}_{live}\), captured at time instances t and \(t+1\) respectively, and returns a 6-DoF pose vector, \(\delta _{pred}\) — where the first three elements correspond to rotation and the last three to translation — that transforms one image to the other. In case of pure rotation, the network predicts a \(3 \times 1\) vector. It is assumed that the scene is mostly static and rigid, and the motion perceived in the image is induced only via the camera movement. However, instead of naïvely comparing the predicted 6-DoF vector, \(\delta _{pred}\), with the corresponding ground truth vector, \(\delta _{gt}\), we build upon the work of Patraucean et al. [2], to warp the images directly using our customised 3D Spatial Transformer module, to compute the image alignment error as our cost function. This allows us to compare the transformations in the right space: naïve comparison of 6-DoF vectors would have involved a tunable parameter beforehand to weigh the translation and rotation errors appropriately to define the cost function since they are two different entities. Searching for the right weighting can quickly become tedious and may not generalise well. Although [21] are able to minimise a cost function by appropriately weighing the rotation and translation errors within optimal hand-eye coordination loop, this is not possible all the time. Discretising the poses as done in [22] may hamper the accuracy of pose estimation. On the other hand, computing pixel error via warping, as often done in classic dense image alignment methods [23, 24], allows to compare the transformations in the space of pixel intensities without having to tune any external parameters. Moreover, dense alignment methods have an added advantage of accurately recovering the transformations by minimising sum of squared differences of pixel values at corresponding locations in the two images i.e.

$$\begin{aligned} \mathcal {C}= & {} \frac{1}{2} \sum _{i=1}^{N}\bigg (\mathcal {I}_{ref}(\mathbf {x}) - \mathcal {I}_{live}(\pi (\mathsf {T}_{lr} \mathsf {\hat{p}}(\mathbf {x})))\bigg )^{2} \end{aligned}$$

where \(\mathbf {x}\) is a homogenised 2D pixel location in the reference image, \(\mathsf {\hat{p}}(\mathbf {x})\) is the \(4\times 1\) corresponding homogenised 3D point obtained by projecting the ray from that given pixel location (xy) into the 3D world via classic inverse camera projection and the depth, \(\mathsf {d}(x,y)\), at that pixel location.

$$\begin{aligned} \mathbf {x} = \left( \begin{array}{c} x \\ y \\ 1 \end{array} \right) ,&\mathsf {\hat{p}}(x,y) = \left( \begin{array}{c} \mathsf {K}^{-1}\mathbf {x} \cdot \mathsf {d}(x,y)\\ 1 \end{array} \right) \end{aligned}$$
(12)
$$\begin{aligned} \mathsf {K} = \begin{bmatrix} f_x&0&p_x \\ 0&f_y&p_y \\ 0&0&1 \end{bmatrix} ,&\pi \left( \begin{array}{c} u \\ v \\ w \end{array} \right) = \left( \begin{array}{c} f_x\frac{u}{w} + p_x \\ f_y\frac{v}{w} + p_y \end{array} \right) \end{aligned}$$
(13)

\(\mathsf {K}\) is the camera calibration matrix, \(f_x\) and \(f_y\) denote the focal lengths of the camera (in pixels) while \(p_x\), \(p_y\) are the coordinates of the camera center location. \(\pi \) is the projection function that maps a 3D point to a 2D plane and \(\mathsf {T}_{lr}\) (or \(\mathsf {T}_{pred}\)) is a \(3\times 4\) matrix that transforms a 3D point in the reference frame to the live frame. In this work, we bridge the gap between learning and geometry based methods with our 3D Spatial Transformer module which explicitly defines these operations as layers that act as computational blocks with no learning parameters but allow backpropagation from the cost function to the input layers.

Figure 2 shows an example of our customised STN for 3D transformation. The siamese network predicts a \(6\times 1\) vector that is turned into a \(3\times 4\) transformation matrix \(\mathsf {T}_{pred}\) via SE3 layer. This matrix transforms the points generated by the 3D grid generator that additionally takes depth image as input and turns it into 3D points via inverse camera projection with \(\mathsf {K}^{-1}\) as in Eq. 1. These transformed points are then projected back into the 2D image plane via the Projection layer (i.e. the \(\pi \) function) and further used to bilinearly interpolate the warped image as in the original STN [1].

Fig. 2.
figure 2

We train a siamese network to regress to the relative pose vector between the two consecutive frames, \(I_{ref}\) and \(I_{live}\). This pose vector is turned into a \(3\times 4\) transformation matrix that transforms 3D points coming from the 3D grid generator and further projected into a 2D plane via projection layer which are used to generate a warped image. Additionally, the 3D grid generator needs an explicit depth-map as input to generate 3D points for any 3D warping.

Our siamese network is inspired from the popular VGG-16 network [16] and uses \(3\times 3\) convolutions in all but the last two layers where \(2\times 1\) and \(2\times 2\) convolutions are used to compensate for the \(320\times 240\) resolution used as input as opposed to the \(224\times 224\) used in original VGG-16. Figure 1 shows our siamese network where two heads are fused early to ensure that the relevant spatial information is not lost by the depth of the network. We also avoid any pooling operations throughout the network, again to ensure that the spatial information is preserved. All convolutional layers, with the exception of the last three, are followed by a non-linearity. We found PReLUs [17] to work better both in terms of convergence speed and accuracy than ReLUs for our network and therefore used them for all the experiments. We also experimented with recently introduced ELUs [25] but did not find any significant difference in the end to PReLUs. Weights of all convolution layers are initialised with MSRA initialisation proposed in [17]. However, the last layer has the weights all initialised to zero. This is to ensure that the relative pose between the consecutive frames is initialised with Identity transformation, as commonly used in many dense image alignment methods.

While one could use the pixel difference between the predicted live image, using the transformation returned by the siamese network, and the live image as the cost function, we chose instead to take the pixel difference between the predicted live image with the predicted transformation and the predicted live image with the ground truth transformation. This is because if there is significant motion between the input frames, warping may possibly lead to missing pixels in the predicted image which will get unnecessary penalised if compared against the live image directly since there is no explicit way to block out the corresponding pixels in the live image. However, if the predicted images from the predicted and ground truth transformations are compared, at optimal predicted transformations both should have the same missing pixels which would allow implicitly blocking out those pixels. Moreover, any external artefact in the images in the form of motion blur, intensity changes, or image noise would affect the registration since the cost function is a pixel-wise comparison. On the other hand, our way of comparing the pixels ensures that at convergence, the cost function is as close to zero as possible and is able to handle missing pixels appropriately. Ultimately, we only need a way to compare the predicted and ground truth transformations in the pixel space. We show early results of training on SO3 (pure rotation) and SE3 motion (involving rotation and translation).

SO3 Motion: Pure Rotation. To experiment with pure rotation motion, we gathered IMU readings of a camera undergoing rapid hand-held motion: we used [26] to capture an outdoor dataset but dropped the translation readings. This is only to ensure that the transformation in the images correspond to the real hand-held motion observed in real world. We use the rotation matrices to synthetically generate new images in the dataset and feed the corresponding pair through the network. We perform early experiments that serve as sanity checks for different layers working together in a network. Figure 3 shows how our system is able to register the images over a given training episode. The first row shows a high residual in the image registration but as the network improves with the training, the residual gradually starts to decrease: last row shows that the network is capable of registering images involving very large motion. Note that the prediction images at the start of training have no missing pixels (since the network is initialised with Identity transformation) but gradually start moving towards the ground truth image.

Fig. 3.
figure 3

Training results on pure rotation motion. The graphs show how the training error decreases as number of epochs increase. This serves as a sanity check for our network that includes many new layers that we propose in this library. The improvement in the training is qualitatively evident from the difference images: early stages in the optimisation show high residual in the registration but as more epochs are thrown to the optimisation, the residual error gracefully decreases.

SE3 Motion: Rotation and Translation. SE3 motion needs depth to enable registration of two images involving both rotation and translation. This is possible with our SE3 layer that additionally takes in depth-map as input and produces the interpolation coordinates to be further used by the bilinear interpolation layer. We use ICL-NUIM [27] and generate a long trajectory of 9.5K frames and use this as our training set. Figure 4 shows samples of generated frames in this new trajectory. Since we need per-pixel depth for this experiment we opted for synthetic dataset only for convenience. In future, we would like to test our approach on real world data.

Similar to the pure rotation (SO3) motion, we show early results on SE3 motion involving rotation and translation. Figure 5 shows the network’s ability to learn to align the predicted image with the ground truth image using depth that is given as an additional input to the 3D grid generator.

Fig. 4.
figure 4

Sample frames from our new ICL-NUIM trajectory.

Fig. 5.
figure 5

Sample results on the new trajectory generated with ICL-NUIM dataset. The SE3 layer allows warping image with 3D motion and this is evident in the registration error in the residual image. Note that the relative motion between consecutive frames is generally slow in the whole trajectory.

4 Future Work

We have only shown training on visual odometry as sanity checks of our layers and their ability to blend in with the standard convolution neural network. In future, we would like to train both feed-forward as well as feedback connections based neural network on large training data. This data could either come from standard Structure from Motion [28], large scale synthetic datasets e.g. SceneNet [29] or large scale RGB or RGB-D videos for unsupervised learning.

5 Conclusions

We introduced a new library, gvnn, that allows implementation of various standard computer vision applications within a deep learning framework. In its current form, it allows end-to-end training for optic flow, disparity or depth estimation, visual odometry, small-scale bundle adjustment, super-resolution, place recognition with geometric invariance all with both supervised and unsupervised settings. In future, we plan to extend this library to include various different lens distortion models, camera projection models, IMU based transformation layers, sign distance functions, level-sets, and classic primal-dual methods [30] as RNN blocks to allow embedding higher order priors in the form of TGV [10]. We hope that our library will encourage researchers to use and contribute towards making this a comprehensive and complete resource for geometric computer vision with deep learning in the same way the popular rnn package [31] has fostered research in recurrent neural networks in the community. Upon publication, we will release the full source code and sample application examples at https://github.com/ankurhanda/gvnn.