Keywords

1 Introduction

We are experiencing a tremendous success of deep learning in almost all research areas of computer vision. However, for most of the time, deep models are trained by relying on man-made supervising signals, which are all too often prepared through a tedious, expensive manual labeling process. Many researchers therefore believe that a more promising paradigm is given by unsupervised learning, as most of the readily available data simply comes in unlabeled form. This work contributes to this direction by providing an unsupervised solution to the ubiquitous vision problem of image matching. Specifically, relying on only natural video sequences, the present model is able to learn the ability of establishing 2D-2D correspondences across consecutive frames.

Fig. 1.
figure 1

We train a deep convolutional network for frame interpolation, which can be done without manual supervision by exploiting the temporal coherence that is naturally contained in real-world video sequences. The learned CNN is then used to compute a sensitivity map for each output pixel. This sensitivity map, i.e. the gradients w.r.t. the input, indicates how much each input pixel influences a particular output pixel. The two input pixels (one per input frame) that have the maximum influence are considered as an image correspondence (i.e. a match). Though indirect, the resulting model learns how to perform dense correspondence matching by simply watching video.

Our key insight lies in the understanding that frame interpolation implicitly solves for dense correspondences between the input image pair. It is well known that dense matching can be regarded as a sub-problem of frame interpolation, as the interpolation could be immediately generated by correspondence-based image warping once dense inter-frame matches are available [3]. It then comes as no surprise that if we were able to train a deep neural network for frame interpolation, its application would implicitly also generate knowledge about dense image correspondences. Retrieving this knowledge is known as analysis by synthesis [42], a paradigm in which learning is described as the acquisition of a measurement synthesizing model, and inference of generating parameters as model inversion once correct synthesis is achieved. In our context, synthesis simply refers to frame interpolation. We then, for the analysis part, show that the correspondences can be recovered from the network through gradient back-propagation, which produces sensitivity maps for each interpolated pixel. The procedure is summarized in Fig. 1, explaining how the reciprocal mapping between frame interpolation and dense correspondences is encoded in the forward and backward propagation through one and the same network architecture. We call our approach MIND, which stands for Matching by INvertingFootnote 1 a Deep neural network.

The key benefit of MIND lies in the fact that the deep convolutional network for frame interpolation can be trained from ordinary video sequences without any man-made ground truth signals. The training data in our case is given by triplets of images, each one consisting of two input images and one output image that represents the ground-truth interpolated frame. A correct example of a ground truth output image is an image that—when inserted in between the input pair of images—forms a temporally coherent sequence of frames. Such temporal coherence is naturally contained in regular video sequences, which allows us to simply use triplets of sequential images from almost arbitrary video streams for training our network. The first and the third frame of each triplet are used as inputs to the network, and the second frame as the ground truth interpolated frame. Most importantly, since the inversion of our network returns frame-to-frame correspondences, it therefore learns how to do image matching without any requirement for manually designed models or expensive ground truth correspondences. In other words, the presented approach learns image matching by simply “watching videos”.

The paper is organized as follows. Section 2 reviews relevant prior work. Section 3 explains the present analysis-by-synthesis approach, including both the analysis part of how MIND works and the synthesis part of the deep convolutional architecture for frame interpolation. Section 4 demonstrates the surprising performance for the present purely unsupervised learning approach, which is comparable to several traditional empirically designed methods. Section 5 finally discusses our contribution and provides an outlook onto future work.

2 Related Work

Deep learning meets image matching: Image matching is a classical problem in computer vision. Here we limit the discussion to recent works that address image matching through learning based approaches. Roughly speaking, there exist two lines of research for this topic: the first one consists of making use of features or representations learned by deep neural networks, which are either originally trained for other tasks such as object recognition [13, 26], or specially designed and trained for the purpose of image matching [1, 21, 33]. The second major line of research employs deep neural networks to compute the similarity between image patches [30, 43, 44]. In contrast to our work, the cited contributions mainly address sub-modules of image matching (feature extraction or matching cost computation), rather than providing end-to-end solutions. An exception is given by FlowNet [14], which presents an interesting deep learning based approach for dense optical flow computation. It does however depend on ground truth flow for training the network.

Temporal coherence learning: Unsupervised learning is a broad topic in the field of machine learning. Our discussion here focuses on works that exploit temporal coherence in natural videos, sometimes also called temporal coherence learning [4, 29, 41]. As a recent representative work, Wang et al. [39] exploit temporal coherence by visual tracking in videos, and report that the learned representation achieves competitive performance compared to some supervised alternatives. While temporal coherence learning mostly aims at learning features or representations, some recent works on reconstructing and predicting video frames in an unsupervised setting [31] are closely related to our work as well. Srivastava et al. [35] use an encoder LSTM to map input sequences into a fixed length representation, and use the latter for reconstructing the input or even predicting future frames. Goroshin et al. [17] consider videos as one-dimensional, time-parametrized trajectories embedded in a low dimensional manifold. They train deep feature hierarchies that linearize the transformations observed in natural video sequences for the purpose of frame prediction. Though related to our work, these works are not aiming at image matching. It will be interesting to apply our concept of matching by inverting to the above models for temporal coherence learning.

Inversion of artificial neural network: Note that inverting a learned network is traditionally defined as reconstructing the input from the output of an artificial neural network [22]. Mahendran et al. [27] and Dosovitskiy et al. [10] apply this concept to understand what information is preserved by a network. In our context, inverting a network means back-propogation through a learned network in order to obtain the gradient map with respect to the input signals. Interestingly, the idea has already been introduced in the work of Simonyan et al. [34], emphasizing that the retrieved sensitivity maps may serve to identify image-specific class saliency. Similarly, Bach et al. [2] employ gradient maps as a measure for the contribution of single pixels to nonlinear classifiers, thus helping to explain how decisions are made.

3 Methodology

The analysis by synthesis approach for dense image matching is described in this section: we first explain the analysis part, i.e. how to obtain correspondences given the trained neural network and the interpolated image. For the synthesis part, we describe here the detailed architecture of the deep convolutional network designed for frame interpolation.

3.1 Matching by Inverting a Deep Neural Network

Assuming that we have a well trained deep neural network for frame interpolation in our hand, the core technical question behind our work is how to recover the correspondences between the input pair of images from there. As explained previously, dense correspondence matching may be regarded as a sub-problem of frame interpolation, which is why we should be able to trace back the matches starting from the interpolated frame generated during the forward-propagation through the trained network. Our task then consists of back-tracking each pixel in the output image to exactly one pixel in each of the two input images. Note that this back-tracking does not mean reconstructing input images from the output one. Instead, we only need to find the pixels in each input image which have the maximum influence to each pixel of the output image.

We perform back-tracking by applying a technique similar to the one adopted by Simonyan et al. [34]. For each pixel in the output image, we compute the gradient of its value with respect to each input pixel, thus telling us how much it is under the influence of individual pixels at the input. The gradient is computed based on back-propagation, and leads to sensitivity or influence maps at the input of the network.

From a more formal perspective, our approach may be explained as follows. Let \(\mathbf {I}_{2} = \mathcal {F}(\mathbf {I}_{1},\mathbf {I}_{3})\) denote a non-linear function (i.e. the trained deep neural network) that describes the mapping from two input images \(\mathbf {I}_{1}\) and \(\mathbf {I}_{3}\) to an interpolated image \(\mathbf {I}_{2}\) lying approximately at the “center” of the input frames. Thinking of \(\mathcal {F}\) as a vectorial mapping, it can be split up into \(h\,\times \, w\) non-linear sub-functions, each one producing the corresponding pixel in the output image

$$\begin{aligned} \mathcal {F}(\mathbf {I}_{1},\mathbf {I}_{3}) = \left( \begin{matrix} f^{11}(\mathbf {I}_{1},\mathbf {I}_{3}) &{} \ldots &{} f^{1w}(\mathbf {I}_{1},\mathbf {I}_{3}) \\ \vdots &{} &{} \vdots \\ f^{h1}(\mathbf {I}_{1},\mathbf {I}_{3}) &{} \ldots &{} f^{hw}(\mathbf {I}_{1},\mathbf {I}_{3}) \end{matrix} \right) _{h\times w}. \end{aligned}$$
(1)

In order to produce the sensitivity maps, we apply back-propagation to compute the Jacobian matrix with respect to each input image individually. The Jacobian with respect to the first image is given by

(2)

illustrating that this derivative results in one \(h\times w\) matrix for each one of the \(h\times w\) pixels at the output. The Jacobian with respect to \(\mathbf {I}_{3}\) is given in a similar way. Let’s define the absolute gradients of the output point (ij) with respect to each one of the input images, and evaluated for the concrete inputs \(\mathbf {\hat{I}}_{1}\) and \(\mathbf {\hat{I}}_{3}\). They are given by

(3)

where \({\text {abs}}\) replaces each entry of a matrix by its absolute value. The gradient maps produced in this way notably represent the seeked sensitivity or influence maps that may now serve in order to derive the coordinates of each correspondence. We notably extract the most responsible point in each gradient map, and connect those two points in order to return the correspondence.

In the spirit of unsupervised learning, we opted for the simplest possible choice, namely taking the coordinates of the maximum entry in \(\mathcal {G}^{i,j}_{\mathbf {I_{1}}}(\mathbf {\hat{I}}_{1},\mathbf {\hat{I}}_{3})\) and \( \mathcal {G}^{i,j}_{\mathbf {I_{3}}}(\mathbf {\hat{I}}_{1},\mathbf {\hat{I}}_{3})\), respectively. Let us denote these points with \(c^{ij}_{\mathbf {I}_{1}}\) and \(c^{ij}_{\mathbf {I}_{3}}\). By computing the two gradient maps for each point in the output image and extracting each time the most responsible point, we thus obtain the following two lists of points

(4)

The set of correspondences \(\mathcal {S}\) is then given by combining same-index elements from \(\mathcal {C}_{\mathbf {I}_{1}}\) and \(\mathcal {C}_{\mathbf {I}_{3}}\), eventually resulting in

$$\begin{aligned} \mathcal {S}= & {} \left\{ s^{ij} \right\} , i=1,\ldots ,h, j=1,\ldots ,w \nonumber \\= & {} \left\{ \left\{ c^{11}_{\mathbf {I}_{1}}, c^{11}_{\mathbf {I}_{3}} \right\} , \ldots , \left\{ c^{hw}_{\mathbf {I}_{1}}, c^{hw}_{\mathbf {I}_{3}} \right\} \right\} . \end{aligned}$$
(5)

3.2 Deep Neural Network for Frame Interpolation

The architecture of our frame-interpolation network is inspired by FlowNetSimple as presented in Fischer et al. [14]. As illustrated in Fig. 2, it consists of a Convolutional Part and a Deconvolutional Part. The two parts serve as “encoder” and “decoder” respectively, similar to the auto-encoder architecture presented by Hinton and Salakhutdinov [20]. The basic block within the Convolutional Part—denoted Convolution Block—follows the common pattern of the convolutional neural network architecture:

figure a

The Parametric Rectified Linear Unit [19] is adopted in our work. Following the suggestions from VGG-Net [9], we set the size of the receptive field of all convolution filters to three—along with a stride and a padding of one—and duplicate three times to better model the non-linearity.

The Deconvolution Part consists of Deconvolution Blocks, each one including a convolution transpose layer [38] and two convolution layers. The first one has a receptive field of four, a stride of two, and a padding of one. The pattern of the Deconvolution Block follows:

figure b

In order to maintain fine-grained image details in the interpolation frame, we make a copy of the output features produced by Convolution Blocks 2, 3, and 4, and concat them as an additional input to the Deconvolution Blocks 4, 3, and 2, respectively. This concept is illustrated by the side arrows in Fig. 2, and similar ideas have already been used in prior work [11, 14]. Recent works [18, 36] indicate that the ‘side arrows’ may also help to better train the deep network.

It is easy to notice that our network is a fully convolutional one, thus allowing us to feed it with images of different resolutions. This is an important advantage, as different data-sets may use different height-to-width ratios. The output blob size for each block in our network is listed in Table 1.

Fig. 2.
figure 2

Architecture of our network. The network takes 2 RGB images as an input to produce the interpolated RGB image. Please note that Dconv Block 4 takes the outputs from both Conv Block 2 and Dconv Block 5 as input. Dconv Block 3 and Dconv Block 2 have a similar input configuration. (Color figure online)

Table 1. The table lists the output blob size of each block in our network. Note that we stack two RGB images into one input blob, and thus the depth is 6. The output of the network is an RGB image and thus the depth equals to 3. The indicated widths are for the network trained on KITTI. The ones for the Sintel data are easily obtained, the only difference being that the input images are scaled to 256 \(\times \) 128 rather than 384 \(\times \) 128.

4 Experiments

In this section, we first explain the implementation details behind MIND such as training data and loss function. The examples as proofs of concept for MIND are introduced before a discussion on the generalization ability of the trained CNN. We finally evaluate MIND in terms of quantitative matching performance and compare it to traditional image matching methods.

4.1 Implementation Details

Training data: Quantity and quality of training data are crucial for training a deep neural network. However, our case is particularly easy as we can simply use huge amounts of real-world videos. In this work, we focus on training with the KITTI RAW videos [15] and Sintel videosFootnote 2 and show that the resulting learned network performs reasonably well. The network is first trained with the KITTI RAW video sequences which are captured by driving around the city of Karlsruhe, through rural areas and over highways. The dataset contains 56 image sequences with in total 16,951 frames. For each sequence, we take every three consecutive frames (both in forward and backward direction) as a training triplet, where the first and the third image serve as inputs to the network and the second image as the corresponding output. These images are then augmented by vertical flipping, horizontal flipping and a combination of both. The total number of sample triplets is 133,921. We then fine-tune the network on examples selected from the original Sintel movie. We manually collected 63 video clips with in total 5,670 frames from the movie. After grouping and data augmentation we finally obtain 44,352 sample triplets. Note that, compared to the KITTI sequences which are recorded with relatively uniform velocity, the Sintel sequences represent more difficult training examples in the context of our work, as they contain a lot of fast and artificially rendered motion captured with a frame rate of only 24 fps. A significant portion of the Sintel samples therefore does not contain the required temporal coherence. We will discuss this issue further in Sect. 4.2.

Loss function: Several previous works [17, 39] mention that minimizing the L2 loss between the output frame and the training example may lead to unrealistic and blurry predictions. We have not been able to confirm this throughout our experiments, but found that the Charbonnier loss \(\rho (x)=\sqrt{(x^2 + \epsilon ^2)}\) commonly employed for robust optical flow computation [37] leads to an improvement over the L2 loss. We employ it to train our network, with \(\epsilon \) set to 0.1.

Training details: The training is performed using Caffe [23] on a machine with two K40c GPUs. The weights of the network are initialized by Xavier’s approach [16] and optimized by the Adam solver [24] with a fixed momentum of 0.9. The initial learning rate is set to 1e-3 and then manually tuned down once ceasing of loss reduction sets in. For training on the KITTI RAW data, the images are scaled to 384 \(\times \) 128. For training on the Sintel dataset, the images are scaled to 256 \(\times \) 128. The batch size is 16. We run the training on KITTI RAW from scratch for about 20 epochs, and then fine-tuned it on the Sintel movie images for 15 epochs. We did not observe over-fitting during training, and terminated the training after 5 days.

Execution time: MIND can be applied to different scenarios (e.g. sparse or dense matching). We focus here on semi-dense image matching in order to obtain a result comparable with other methods. We compute the correspondences across the input images for each corner of a predefined raster grid of 4 pixels width in the interpolated image. Note that MIND currently depends on a large amount of computational resources as it performs back-propagation through the entire network for every pixel that needs to be matched. For an image of size 384 \(\times \) 128, each forward pass through our network takes 40 ms on a PC with K40c GPU, and each backward pass takes 158 ms. For each image pair, we need to perform one forward pass to first obtain the interpolation. We then need to perform \(384\times 128/4/4 = 3072\) backward passes to find the correspondences, resulting in a total of about 486 s (\(\sim \)8 min).

4.2 Qualitative Examples for Interpolation and Matching

We demonstrate here the visual examples as proofs of concept for how the present approach works on both tasks of frame interpolation and image matching.

Fig. 3.
figure 3

Examples of frame interpolation (best viewed in colour). From left to right: example on KITTI, Sintel, ETH Multi-Person Tracking dataset [12] and Bonn Benchmark on Tracking [25], respectively. In each column, the first image is an overlay of the two input frames. The second one is the interpolated image obtained by our network. For the first example, we use the network trained on KITTI itself. For all others, we use the network fine-tuned on Sintel data. (Color figure online)

Examples of frame interpolation: We show the examples of frame interpolation in Fig. 3. The first two columns show the examples on KITTI and Sintel images which are taken from the validation data-sets originally collected for the purpose of monitoring the network training process. It can be seen that the trained CNNs cover the motion correctly for both KITTI and Sintel image pairs. It can furthermore be noticed that some fine-grained details are not preserved well in both examples, despite the special considerations in the architecture of the convolutional network (c.f. Sect. 3.2). Nevertheless, we would like to emphasize that the goal of the present work is not to provide a state-of-the-art frame-interpolation algorithm. As we will see, the preservation of fine-grained image details is in fact not necessarily an indicator for better quality image matching.

And for the goal of image matching, we will see that the preservation of perfect image details is in fact not necessary.

Examples for image matching: Here we present examples to demonstrate how MIND obtains correspondences given the trained CNNs for frame interpolation. The examples taken from KITTI and Sintel videos are shown in Fig. 4. By computing the gradient of manually marked pixels in the interpolated image, MIND successfully obtains correct correspondences between the 2 input images. It can be seen that the correct correspondences are obtained even in some fast motion areas where fine-grained image details are missed, e.g. the area of the character’s shaking hand in the Sintel example.

We further show one failure example taken from Sintel images. In Fig. 5, it can be observed that the interpolation fails as the motion of the small dragon and the character’s hand have not been recovered correctly. It then comes as no surprise that MIND fails to extract correct matches for almost all of the selected points. However, it is worth to note that the No.4 match has better quality than others for which the corresponding gradient maps are less distinctive. The matching score/confidence returned by MIND is inspired by this behavior and defined as the ratio between the maximum gradient intensity and the mean gradient intensity within a small area around the maximal gradient location.

As illustrated in Sect. 4.4, the general performance of MIND, especially on KITTI images, is good. The failure example in Fig. 5 indicates an extreme case in the Sintel sequences dominated by fast and highly non-rigid motion in the scene.

Fig. 4.
figure 4

Two matching examples for image pairs taken from the KITTI RAW video and the Sintel movie clip (best viewed in colour). For each example, the corresponding row of images shows input image 1, the interpolated image, and then input image 2 (from left to right). The red points mark five sample correspondences. The two rows below each example show the gradient/saliency maps for each match (from left to right) in each input image (maps for input image 1 on top, and maps for input image 2 in the bottom). The figures also indicate the coordinates of the maximal gradient location (P) along with the corresponding matching score (S). The matching score is defined as the ratio between the maximum gradient intensity and the mean gradient intensity within a 20 \(\times \) 20 area around P. (Color figure online)

4.3 Generalization Ability of the Trained CNN

We first demonstrate the generalization ability of the trained CNN by applying it to images taken from the ETH Multi-Person Tracking dataset [12] and the Bonn Benchmark on Tracking [25], which have not been used for either training or fine-tuning. The results are shown in Fig. 3, from which we can see that the trained CNN again covers the motion correctly. It provides evidence that, by “watching videos”, the present CNN is indeed learning the ability to interpolate frames and match images, rather than only “remember” the KITTI or Sintel-like images.

Fig. 5.
figure 5

Failure example of MIND for image pair taken from the Sintel movie clip (best viewed in colour). The gradient/saliency maps (from left to right) are for matches labelled as 1, 2, ..., 5, respectively. (Color figure online)

Fig. 6.
figure 6

Examples of MIND on DICOM images. There are two examples shown in different rows. For each example, the columns from left to right show the overlay of the input image-pair, the 1st input image, the interpolation returned by the CNN, and the 2nd input image, respectively. The red points in columns 2, 3 and 4 indicate the matches obtained by MIND. (Color figure online)

The generalization ability is further illustrated by applying MIND to DICOM images of coronary angiogramFootnote 3. As a numerical evaluation of the generalization ability, we compare the CNN based interpolation results to traditional warp based interpolation method [3] using state-of-the-art optical flow, i.e. DeepFlow [40] and a recently proposed phase-based interpolation method [28]. The comparison is similar to the “Ground truth comparisons” outlined in [28]. The averaged SSD (sum of squared distances) for each method is 6.00, 6.23 and 5.55 respectively, suggesting that the trained CNN performs frame interpolation quantitatively well. Two examples are shown in Fig. 6. It can be seen that these images are substantially different from natural ones. Though failing to preserve perfect image details, the CNN, which is trained on natural images, performs impressively well on the DICOM images. The nice generalization ability of the CNN is underlined by results on both frame interpolation and image matching.

4.4 Quantitative Performance of Image Matching

We compare the matches produced by MIND against those of several empirically designed methods: the classical Kanade–Lucas–Tomasi feature tracker [5], HoG descriptor matching [7] (which is widely employed to boost dense optical flow computation), and the more recent DeepMatching approach [40] which relies on a multilayer convolutional architecture and achieves state-of-the-art performance. As observed in [40], comparing different matching algorithms is delicate because they usually produce different numbers of matches for different parts of the image. For the sake of a fair comparison, we adjust the parameters of each algorithm to make them produce as many as possible matches with an as homogeneous as possible distribution across the input images. For DeepMatching, we use the default parameters. For MIND, we extract correspondences for each corner of a uniform grid of 4 pixels width. For KLT, we set the minEigThreshold to 1e-9 to generate as many matches as possible. For HoG, we again set the pixel sampling grid width to 4. We then sort the matches according to suitable metricsFootnote 4 and select the same amount of “best” matches for each algorithm. In this way, the 4 algorithms produce the same numbers of matches with similar coverage over each input image.

Table 2. Matching performance on the KITTI 2012 flow training set. DeepM denotes DeepMatching. Metrics: Average Point Error (APE) (the lower the better), and Accuracy@T (the higher the better). Bold numbers indicate best performance, underlined numbers 2nd best.
Table 3. Matching performance on the MPI-Sintel training set (Final pass). DeepM denotes DeepMatching. Metrics: Average Point Error (APE) (the lower the better), and Accuracy@T (the higher the better). Bold numbers indicate best performance, underlined numbers 2nd best.

The comparisons are performed on both KITTI [15] and MPI-Sintel [8] training sets where ground truth correspondences can be extracted from the available ground truth flow fields. We perform all of our experiments on the same image resolution than the one used by our network. On KITTI, the images are scaled to 384 \(\times \) 128, and for MPI-Sintel, 256 \(\times \) 128. We use the network trained on the KITTI RAW sequences for the matching experiment on the KITTI Flow 2012 training set. We then use the network fine-tuned on Sintel movie clips for the experiments on the MPI-Sintel Flow training set. The 4 algorithms are evaluated in terms of the Average Point Error (APE) and the Accuracy@T. The latter is defined as the proportion of “correct” matches from the first image with respect to the total number of matches [32]. A match is considered correct if its pixel match in the second image is closer than T pixels to ground-truth.

As can be observed in Tables 2 and 3, DeepMatching produces matches with the highest quality in terms of all metrics and on both MPI-Sintel and KITTI sets. Notably, MIND performs very close to DeepMatching on KITTI and outperforms KLT tracking and HoG matching by a considerable amount in terms of Accuracy@10 and Accuracy@20. It is surprising to see that MIND—an unsupervised learning based approach—works so well. The performance on MPI-Sintel however drops a bit due to the difficulty of the contained artificial motion. Though the APE measure indicates better performance than HoG and KLT, it is only safe to conclude that MIND remains competitive in terms of overall performance on MPI-Sintel, which can be seen further in the next section.

4.5 Ability to Initialize Optical Flow Computation

To further understand the matching quality produced by MIND, we replace the DeepMatching part of DeepFlow [40] with MIND to see whether MIND matches are able to boost optical flow performance in a similar way than DeepMatching and HoG or KLT matches. Similar to the evaluation in [40], we feed DeepFlow with matches obtained by each matching method in the previous section. The parameters (e.g. the matching weight) of DeepFlow are tuned accordingly to make best use of the pre-obtained matches. Note that we scale down the input images to 384 \(\times \) 128 for KITTI and 256 \(\times \) 128 for MPI-Sintel. We then up-size the obtained flow field to the original resolution by bi-linear interpolation, to the end of comparing results in full resolution.

The results on the KITTI Flow 2012 training set are indicated in Table 4. It can be seen that using the matches obtained by any of the 4 algorithms improves the flow performance compared to the case where we use no matches for initialization. Notably, MIND again reaches closest performance to DeepMatching in terms of all metrics, thus underlining the good matching quality obtained by MIND (better than KLT and HoG and comparable to DeepMatching). Table 5 shows the results obtained on the MPI-Sintel training dataset. As in KITTI, the pre-obtained matches indeed help to improve the optical flow results especially in terms of the APE and s40+ metrics, while flow initialized by DeepMatching remains best overall. The results initialized from MIND matches however rank behind those initialized by HoG or KLT matches, which again suggests the importance of temporal coherence for training our network. The reason why KLT works better than in the evaluation presented in [40] is because we run KLT in the downscaled images rather than the full resolution ones, and this helps KLT to better deal with large displacements.

From the quantitative evaluations of matching and flow performance, it should be concluded that MIND works well on the KITTI Flow training set and achieves comparable performance to the state-of-the-art defined by DeepMatching. In the MPI-Sintel Flow training set, MIND still obtains comparable performance to the traditional HoG and KLT methods. The latter should still be interpreted as a good result especially considering that the quality of training data for the artificial and perhaps unrealistic Sintel images is insufficient. A closer look into the training data collected from Sintel video suggests that the assumption of temporal coherence does not hold well.

Table 4. Flow performance on KITTI 2012 flow training set (non-occluded areas). out-x refers to the percentage of pixels where flow estimation has an error above x pixels.
Table 5. Flow performance on MPI-Sintel flow training set. s0-10 is the APE for pixels with motions between 0 and 10 pixels. Similarly for s10-40 and s40+.

5 Conclusions

We have shown that the present work enables artificial neural networks to learn accurate image matching from only ordinary videos. Though MIND currently does not provide the required computational efficiency for applications in real-world scenarios, it promises a great potential for more natural solutions to further related problems. It is also our hope that the present work helps to promote the concept of analysis by synthesis towards a broad acceptance. Our future work focuses on making the present approach more applicable in real-world scenarios, in terms of both computational efficiency and reliability.