Learning Contextual Dependencies for Optical Flow with Recurrent Neural Networks

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10114)

Abstract

Pixel-level prediction tasks, such as optical flow estimation, play an important role in computer vision. Recent approaches have attempted to use the feature learning capability of Convolutional Neural Networks (CNNs) to tackle dense per-pixel predictions. However, CNNs have not been as successful in optical flow estimation as they are in many other vision tasks, such as image classification and object detection. It is challenging to adapt CNNs designated for high-level vision tasks to handle pixel-level predictions. First, CNNs do not have a mechanism to explicitly model contextual dependencies among image units. Second, the convolutional filters and pooling operations result in reduced feature maps and hence produce coarse outputs when upsampled to the original resolution. These two aspects render CNNs limited ability to delineate object details, which often result in inconsistent predictions. In this paper, we propose a recurrent neural network to alleviate this issue. Specifically, a row convolutional long short-term memory (RC-LSTM) network is introduced to model contextual dependencies of local image features. This recurrent network can be integrated with CNNs, giving rise to an end-to-end trainable network. The experimental results demonstrate that our model can learn context-aware features for optical flow estimation and achieve competitive accuracy with the state-of-the-art algorithms at a frame rate of 5 to 10 fps.

1 Introduction

Convolutional Neural Networks (CNNs) [1] have brought a revolution in computer vision community with its powerful feature learning capability based on large-scale datasets. They have been immensely successful in high-level computer vision tasks, such as image classification [2, 3] and object detection [4, 5]. CNNs are good at extracting abstract image features by using convolution and pooling layers to progressively shrink the feature maps, which produces translation invariant local features and allows the aggregation of information over large areas of the input images.

Recently, researchers have been attempting to employ CNNs to tackle pixel-level prediction tasks, such as semantic segmentation [6, 7] and optical flow estimation [8, 9]. These tasks differ from the previous high-level tasks in that they not only require precise single pixel prediction, but also require semantically meaningful and contextually consistent predictions among a set of pixels within objects. Optical flow estimation is even more difficult because it requires finding the x-y flow field between a pair of images, which involves a very large continuous labeling space.

There are significant challenges in adapting CNNs to handle dense per-pixel optical flow estimation. First, CNNs do not have a mechanism to explicitly model contextual dependencies among image pixels. Although the local features learned with CNNs play an important role in classifying individual pixels, it is similarly important to consider factors such as appearance and spatial consistency while assigning labels in order to obtain precise and consistent results. Besides, the convolution and pooling operations result in reduced feature maps, and hence produce coarse outputs when upsampled to the original resolution to produce pixel-level labels. These two aspects render CNNs limited ability to delineate object details, and can result in blob-like shapes, non-sharp borders and inconsistent labeling within objects.

In this paper, Recurrent Neural Networks (RNNs) are incorporated to alleviate this problem. RNNs have achieved great success in modeling temporal dependencies for sequential data, and have been widely used in natural language processing [10], image captioning [11], etc. Long short-term memory (LSTM) [12] is a special RNN structure that is stable and powerful for modeling long-range dependencies without suffering from the vanishing gradient problem of vanilla RNN models [13]. We propose a row convolutional LSTM (RC-LSTM), which has convolution operators in both the input-to-state and state-to-state transitions to handle structure inputs. We treat an image as a sequence of rows and use our RC-LSTMs to explicitly model the spatial dependencies among the rows of pixels, which encodes the neighborhood contexture into local image representation.

The proposed RC-LSTM structure can be integrated with CNNs to enhance the learned feature representations and produce context-aware features. In our experiments, we integrate our RC-LSTM with FlowNet [8], the state-of-the-art CNN-based model for optical flow estimation, to form an end-to-end trainable network. We test the integrated network on several datasets, the experimental results demonstrate that our RC-LSTM structure can enhance the CNN features and produce more accurate and consistent optical flow maps. Our model achieves competitive accuracy with the state-of-the-art methods and has the best performance among the real-time ones.

2 Related Work

2.1 Optical Flow

Optical flow estimation has been one of the key problems in computer vision. Starting from the original approaches of Horn and Schunck [14] as well as Lucas and Kanade [15], many improvements have been introduced to deal with the shortcomings of previous models. Most of those methods model the optical flow problem as an energy minimization framework, and are usually carried out in a coarse-to-fine scheme [16, 17, 18]. Due to the complexity of the energy minimization, such methods have the problem of local minima and may not be able to estimate large displacements accurately. The methods in [19, 20] integrate descriptor matching into a variational approach to deal with large displacements problem. The method in [21] emphasizes on sparse matching and uses edge-preserving interpolation to obtain dense flow fields, which achieves state-of-the-art optical flow estimation performance.

2.2 CNNs for Pixel-Level Prediction

CNNs have become the method of choice and achieved state-of-the-art performance in many high-level vision tasks, e.g. image classification. Recently, researchers start attempting to use CNNs for pixel-level labeling problems, including semantic segmentation [6, 7, 22, 23], depth prediction [24], and optical flow estimation [8]. The traditional CNNs require fixed-size inputs, and the fully connected layers transform the feature maps into vector representations which are difficult to reconstruct for 2D predictions. Therefore, it is not straightforward in adapting CNNs designated to produce a high-level label for an image to tackle the tasks of pixel-level predictions.

One simple way to use CNNs in such applications is to apply a CNN in a “sliding window” way and predict a single label for the current image patch [22]. This works well in many situations, but the main drawback is the high computational cost. Fully Convolutional Networks (FCN) is proposed in [6] which can take input of arbitrary size and produce output in the same size with efficient inference. The key insight is to transform the fully connected layers into convolution layers which produce coarse predictions. Then deconvolution layers are incorporated to iteratively refine the prediction to the original size. This method achieves significant improvement on segmentation performance on several datasets.

A similar scheme is utilized in FlowNet [8] to predict an optical flow field with convolution layers to extract compact feature maps and deconvolution layers to upsample them to the desired resolution. The difference is that not only the coarse predictions, but the whole coarse feature maps are “de-convolved”, allowing the transfer of more information to the final prediction. Two FlowNet architectures are proposed and compared in [8]. The FlowNetS architecture simply stacks the image pair together and feeds them through a generic CNN. The FlowNetC architecture includes a layer that correlates feature vectors at different image locations. Although the performance of FlowNet is not as good as the state-of-the-art traditional non-CNN methods, it opens the direction of learning optical flow with CNNs and can be considered as the state-of-the-art CNN-based model.

The central issue in the methodology of [6, 8] is that the convolution and pooling operations result in reduced feature maps and hence produce coarse outputs when upsampled to the original resolution. Besides, CNNs lack smoothness constraints to model label consistency between pixels. To solve this problem, Zheng et al. [7] combines CNNs with a Conditional Random Fields (CRFs)-based probabilistic graphical model. Mean-field approximate inference for the CRFs is formulated as Recurrent Neural Networks, which enables training CRFs end-to-end together with CNNs. This CRF-RNN network is integrated with FCN [6] and achieves more precise segmentation results. However, the inference of mean-field approximation in [7] is for discrete labeling, making it not applicable to refine optical flow maps which is a continuous labeling task with very large label space.

2.3 RNNs and LSTMs

Recurrent Neural Networks (RNNs) have achieved great success in modeling temporal dependency for chain-structured data, such as natural language and speeches [10]. Long short-term memory (LSTM) [12] is a special RNN structure that is stable and powerful for modeling long-range temporal dependencies without the vanishing gradient problem.

RNNs have been extended to model spatial and contextual dependencies among image pixels [25, 26, 27]. The key insight is to define different connection structures among pixels within an image and build spatial sequences of pixels. Multi-dimensional RNNs are proposed in [28] and are applied to offline arabic handwriting recognition. 2D-RNNs [29], tree-structured RNNs [30], and directed acyclic graph RNNs [23] are proposed to model different connections between image pixels for different vision tasks. Applying RNNs to specifically defined graph structures on images are different with the idea of CRF-RNN [7]. Instead of implicitly encoding neighborhood information with a pairwise term in an energy minimization framework, the RNNs enables explicit information propagation via the recurrent connections.

Our approach intergrates ideas from these methods. We propose to model the contextual dependencies among each row of pixels with a row convolutional LSTM(RC-LSTM). Similar as [31], our RC-LSTM utilize convolution operators in both the input-to-state and state-to-state transitions instead of the full matrix multiplication in LSTM to better handle structure inputs. The feature vectors of the pixels in each row form one input to the RC-LSTM, and the rows are processed in sequence. The RC-LSTM enables the information propagation/message passing among the rows of pixels, which encodes the contextual dependencies into the local feature representations. This RC-LSTM structure is integrated with convolution and deconvolution layers, giving rise to an end-to-end trainable network. To the best of our knowledge, our work is the first attempt to integrate CNNs with RNNs for optical flow estimation.

3 Approach

To predict dense pixel-level optical flow from a pair of images, the images are processed by three types of network components: convolution layers, deconvolution layers, and our RC-LSTM network. Functionally, the convolution layers transform raw image pixels to compact and discriminative representations. The deconvolution layers then upsample the feature maps to the desired output resolution. Based on them, the proposed RC-LSTM models the contextual dependencies of local features and produce context-aware representations, which are used to predict the final optical flow map.

In this section, we will first briefly review the LSTM, and then introduce our RC-LSTM for modeling contextual dependencies among image rows. After that, we will explain how our RC-LSTM model can be integrated with CNNs to form an end-to-end trainable network.

3.1 LSTM Revisited

Long short-term memory (LSTM) [12] is a special RNN structure that is stable and powerful for modeling long-range temporal dependencies. Its innovation is the introduction of the “memory cell” \(c_{t}\) to accumulate the state information. The cell is accessed, written and cleared by several controlling gates, which enables LSTM to selectively forget its previous memory states and learn long-term dynamics without the vanishing gradient problem of simple RNNs. Given \(x_t\) as the input of an LSTM cell at time t, the cell activation can be formulated as:
$$\begin{aligned} \begin{aligned}&i_t = \sigma (W_{xi}x_t+W_{hi}h_{t-1}+b_i)\\&f_t = \sigma (W_{xf}x_t+W_{hf}h_{t-1}+b_f)\\&o_t = \sigma (W_{xo}x_t+W_{ho}h_{t-1}+b_o)\\&g_t = \phi (W_{xc}x_t+W_{hc}h_{t-1}+b_c)\\&c_t = f_t \odot c_{t-1} + i_t \odot g_t\\&h_t = o_t \odot \phi (c_t) \end{aligned} \end{aligned}$$
(1)
where \(\sigma \) stands for the sigmoid function, \(\phi \) stands for the tanh function, and \(\odot \) denotes the element-wise multiplication. In addition to the hidden state \(h_t\) and memory cell \(c_t\), LSTM has four controlling gates: \(i_t\), \(f_t\), \(o_t\), and \(g_t\), which are the input, forget, output, and input modulation gate respectively.

The input gate \(i_t\) controls what information in \(g_t\) to be accumulated into the cell \(c_t\). While the forget gate \(f_t\) helps the \(c_t\) to maintain and selectively forget information in previous state \(c_{t-1}\). Whether the updated cell state \(c_t\) will be propagated to the final hidden state \(h_t\) representation is controlled by the output gate \(o_t\). Multiple LSTMs can be temporally concatenated to form more complex structures to solve many real-life sequence modeling problems.

3.2 RC-LSTM

The aforementioned LSTMs are designed to model temporal dependency for a data sequence. To applying LSTMs to one image, a spatial order of the image pixels needs to be defined. The simplest way is to consider each pixel as an individual input data, and the image is reshaped to a sequence of pixels to feed into the LSTM model. However, the interactions among pixels are beyond this chain-structured sequence. This simple way will loss the spatial relationships, because adjacent pixels may not be neighbors in this sequence.
Fig. 1.

The structure of the RC-LSTM.

In order to model the spatial relationships among pixels, we consider each row of the image as one input data, and consider the image as a sequence of rows. This results in structure inputs which LSTM has difficulty to handle. We propose a row convolutional long short-term memory (RC-LSTM) to cope with the structure inputs and learn the contextual dependencies among the rows of pixels. Our RC-LSTM is illustrated in Fig. 1. The feature vectors of the pixels in the rth row form one input matrix \(X_r\in \mathcal {R}^{m\times n}\) to the RC-LSTM, where n is the number of pixels in each row and m is the feature dimension. The RC-LSTM determines the hidden state \(H_r\) by the input \(X_r\) and the states \(H_{r-1}, C_{r-1}\) of the previous row. Convolution operations are used in both the input-to-state and state-to-state transitions, and the RC-LSTM can be formulated as:
$$\begin{aligned} \begin{aligned}&i_{r} = \sigma (w_{xi} \otimes X_{r}+w_{hi} \otimes H_{r-1} + b_i)\\&f_{r} = \sigma (w_{xf} \otimes X_{r}+w_{hf} \otimes H_{r-1} + b_f)\\&o_{r} = \sigma (w_{xo} \otimes X_{r}+w_{ho} \otimes H_{r-1} + b_o)\\&g_{r} = \phi (w_{xc} \otimes X_{r}+w_{hc} \otimes H_{r-1} + b_c)\\&C_{r} = f_{r} \odot C_{r-1} + i_{r} \odot g_{r}\\&H_{r} = o_{r} \odot \phi (C_{r}) \end{aligned} \end{aligned}$$
(2)
where the w-s are convolution kernels with size \(1\times k\) and \(\otimes \) denotes convolution. The distinguishing feature of our RC-LSTM is that the inputs \(X_r\), cell states \(C_r\), hidden states \(H_r\), and the gates \(i_r, f_r, g_r, o_r\) are all matrices. In this sense, our RC-LSTM can be considered as a generalization of the traditional vector-based LSTM to handle structure input. Due to the convolution operations, our model also has fewer parameters and less redundancy, in comparison to the matrix multiplications in Eq. 1.
Fig. 2.

Illustration of top-to-bottom message passing among the image pixels with k equals 1, 3, and 5.

The RC-LSTM model enables the message passing in one direction: from the top to the bottom of the image. The pixel \(v_{(r,c)}\) at location (rc) will get the information propagated from its ancestors in the previous row, which are a small neighborhood of k pixels near the pixel \(v_{(r-1,c)}\). Figure 2 illustrated the message passing among the pixels. Our RC-LSTM explicitly models the contextual dependencies among the pixels using this 1-direction message passing.

Spatial dependencies of a pixel come from surrounding pixels in all directions within an image, and the 1-direction message passing might not be enough to model all these dependencies. This can be addressed by using the RC-LSTM 4 times: row by row from top to bottom, from bottom to top, column by column from left to right, and from right to left. We call this method 4-direction message passing RC-LSTM, which is able to model more complete contextual dependencies.

3.3 Integration with CNNs

Our RC-LSTMs can be integrated with any CNNs structures, other networks (e.g. auto-encoder [32]), and even hand-crafted features, to model the spatial dependencies of the local features and produce context-aware feature representations. Figure 3 shows a generic framework that integrates the RC-LSTM model with a general CNN. The RC-LSTM model gets input from the CNN feature maps, refine the local features with message passing, and output context-aware feature maps, which can be further processed by CNN layers or directly used to predict output flows.
Fig. 3.

The overall framework of an end-to-end trainable network. RC-LSTMs can be used in any point to refine the feature maps, by modeling the spatial dependencies of local features and produce context-aware representations.

In our experiment, we integrate our RC-LSTM with FlowNetS [8], which has a generic network structure and is shown to have better generalization abilities. The image pair is stacked as the input of the FlowNetS, and the network has 10 convolution layers and 4 deconvolution layers. Detailed structure of the integrated network can be found in Sect. 4.1.

4 Experiments

4.1 Datesets and Experiment Setup

Optical flow evaluation requires dense per-pixel ground truth, which is very difficult to obtain from real world images. Therefore, the amount of real images with dense ground truth optical flow field is very small. In Middlebury dataset [33], optical flow ground truth of 6 image pairs are generated by tracking the hidden fluorescent paint applied to the scene surfaces, with computer-controlled lighting and motion stages for camera and scene. The KITTI dataset [34] is larger (389 image pairs) and the semi-dense optical flow ground truth is obtained by the use of 4 cameras and a laser scanner.
Fig. 4.

Sample images from (a) Middlebury, (b) Sintel Clean, (c) Flying Chairs, and (d) Sintel Final dataset. The Final version of Sintel (d) includes motion blur and atmospheric effects to the Clean version (b).

Fig. 5.

Optical flow color coding scheme: the vector from the center to a pixel is encoded using the color of that pixel.

In order to facilitate the evaluation of optical flow algorithms, some larger synthetic datasets with a variant of motion types are generated [8, 35]. In our experiments, we attempt to use both real world images and synthetic images to evaluate our method. The datasets used in our experiments are introduces below, and example images from some of these datasets are shown in Fig. 4.

  • Middlebury dataset [33] contains 8 training image pairs and 8 testing image pairs with dense optical flow ground truth. Six image pairs are real world images and others are synthetic ones. The image resolution ranges from \({316 \times 252}\) to \(640 \times 480\), and the displacements are small, usually less than 10 pixels. The ground truth flows of the training images are publicly available, while the ground truth of the testing images is hidden from the public, researchers can upload their testing results to an evaluation server.

  • KITTI dataset [34] contains 194 training image pairs and 195 testing image pairs. An average of 50% pixels in the images have ground truth optical flow. The images are captured using cameras mounted on an autonomous car in real world scenes, and the image resolution is about \(1240\times 376\). This dataset contains strong projective transformations and special types of motions. Similar with Middlebury dataset, the ground truth flows of the training images are publicly available, while the testing results are evaluated on a server.

  • Sintel dataset [35] contains computer rendered artificial scenes of a 3D movie with dense per-pixel ground truth. It includes large displacements, and pays special attention to achieve realistic image properties. The dataset provides “Clean” and “Final” versions. The Final version includes atmospheric effects (e.g. fog), reflections, and motion blur, while the Clean version does not includes these effects. Each version contains 1041 training image pairs, and 552 testing image pairs. The image resolution is \(1024 \times 436\). The ground truth flows of the training images are publicly available, while the testing results are evaluated on a server.

  • Flying Chairs dataset [8] is a large synthetic dataset which is built to provide sufficient data for training CNNs for optical flow estimation. It contains 22232 training image pairs and 640 testing image pairs. The images have resolution \(512 \times 386\) and are generated by rendering 3D chair models on background images from Flickr. The ground truth flows of the whole dataset are publicly available.

We train our network on the training set of Flying Chairs dataset, and test the network on the Sintel, KITTI, Middlebury, and the testing set of Flying Chairs datasets. Note that we only train our model on the training set of Flying Chairs dataset, and we do not train or fine-tune our model on the other datasets. Therefore, both the “train” and “test” data of other datasets can serve as testing data in our experiments, the same scheme is used in [8]. The datasets used to test our method are summarized in Table 1.
Table 1.

A summary of the datasets used to TEST our method.

Dataset

Image pairs #

Evaluation method

Sintel Clean Train

1041

Compare results with GT

Sintel Clean Test

552

Upload to evaluation server

Sintel Final Train

1041

Compare results with GT

Sintel Final Test

552

Upload to evaluation server

KITTI Train

194

Compare results with GT

Middlebury Train

8

Compare results with GT

Flying Chairs Test

640

Compare results with GT

Table 2.

The detailed structures of the Proposed-1/4dir networks with RC-LSTM + FlowNetS. The illustrated input/output resolutions are based on the input image with size \(512 \times 384\). pr stands for prediction, and pr’ stands for the upsampled pr.

Layer name

Kernel sz

Str

I/O Ch#

InputRes

OutputRes

Input

conv1

\(7 \times 7 \)

2

6/64

\(512 \times 384\)

\(256 \times 192\)

Stacked image pair

conv2

\(5 \times 5 \)

2

64/128

\(256 \times 192\)

\(128 \times 96\)

conv1

conv3

\(5 \times 5 \)

2

128/256

\(128 \times 96\)

\(64 \times 48\)

conv2

conv3_1

\(3 \times 3 \)

1

256/256

\(64 \times 48\)

\(64 \times 48\)

conv3

conv4

\(3 \times 3 \)

2

256/512

\(64 \times 48\)

\(32 \times 24\)

conv3_1

conv4_1

\(3 \times 3 \)

1

512/512

\(32 \times 24\)

\(32 \times 24\)

conv4

conv5

\(3 \times 3 \)

2

512/512

\(32 \times 24\)

\(16 \times 12\)

conv4_1

conv5_1

\(3 \times 3 \)

1

512/512

\(16 \times 12\)

\(16 \times 12\)

conv5

conv6

\(3 \times 3 \)

2

512/1024

\(16 \times 12\)

\(8 \times 6\)

conv5_1

conv6_1

\(3 \times 3 \)

1

1024/1024

\(8 \times 6\)

\(8 \times 6\)

conv6

pr6+loss6

\(3 \times 3 \)

1

1024/2

\(8 \times 6\)

\(8 \times 6\)

conv6_1

deconv5

\(4 \times 4 \)

2

1024/512

\(8 \times 6\)

\(16 \times 12\)

conv6_1

pr5+loss5

\(3 \times 3 \)

1

1026/2

\(16 \times 12\)

\(16 \times 12\)

conv5_1+deconv5+pr6’

deconv4

\(4 \times 4 \)

2

1026/256

\(16 \times 12\)

\(32 \times 24\)

conv5_1+deconv5+pr6’

pr4+loss4

\(3 \times 3 \)

1

770/2

\(32 \times 24\)

\(32 \times 24\)

conv4_1+deconv4+pr5’

deconv3

\(4 \times 4 \)

2

770/128

\(32 \times 24\)

\(64 \times 48\)

conv4_1+deconv4+pr5’

pr3+loss3

\(3 \times 3 \)

1

386/2

\(64 \times 48\)

\(64 \times 48\)

conv3_1+deconv3+pr4’

deconv2

\(4 \times 4 \)

2

386/64

\(64 \times 48\)

\(128 \times 96\)

conv3_1+deconv3+pr4’

RC-LSTM 1/4dir

\(1 \times 3 \)

1

194/200

\(128 \times 96\)

\(128 \times 96\)

conv2+deconv2+pr3’

pr2+loss2

\(3 \times 3 \)

1

200/2

\(128 \times 96\)

\(128 \times 96\)

RC-LSTM 1/4dir

upsample

-

-

2/2

\(128 \times 96\)

\(512 \times 384\)

pr2

Two architectures of our RC-LSTM as described in Sect. 3.2, i.e. 1-direction and 4-direction message passing, are integrated with FlowNetS to build an end-to-end trainable network. We do not include the variational refinement because it is essentially using CNN results to initialize a traditional flow estimation [19] and is not end-to-end trainable. Specifically, our RC-LSTMs are plugged into FlowNetS after the last deonvolution layer and before the final prediction layer. The detailed network structure can be found in Table 2 and our implementation is based on Caffe [36]. The kernel size is set to be \(1\times 3\). For the sake of convenience, we call the integrated networks Proposed-1dir and Proposed-4dir. The networks are evaluated both qualitatively and quantitatively described as follows.

The endpoint error (EPE) is used to evaluate the performance of different methods, which is define as:
$$\begin{aligned} EPE = \frac{1}{N} \sum \sqrt{ (u_i - u_{GTi})^2 + (v_i - v_{GTi})^2} \end{aligned}$$
(3)
where N is the total number of image pixels, \((u_i, v_i)\) and \((u_{GTi}, v_{GTi})\) are the predicted flow vector and the ground truth flow vector for pixel i, respectively.
The optical flow field of an image pair is also visualized as an image by color-coding the flow field as in [33, 35]. Flow direction is encoded with color and flow magnitude is encoded with color intensity. Figure 5 shows the color coding scheme: the vector from the center to a pixel is encoded using the color of that pixel. We use the open source tool provided in [33] to generate color-coded flow maps. The estimated flow maps on Flying Chairs test, Sintel train, and Middlebury train are visualized and compared with the visualized ground truth.
Fig. 6.

The estimated optical flow maps from Flying Chairs dataset(top two) and Middlebury dataset(bottom two).

Fig. 7.

Predicted optical flows on the Sintel Final dataset. In each row from left to right: overlaid image pair, ground truth flow, the predicted results of EpicFlow [21], FlowNetS [8], and Proposed-1dir RC-LSTM. Endpoint error (EPE) is shown at the top-right corner of each prediction. In most cases (Row 1–8), the proposed method produces visually better results with lower EPE than the FlowNetS. Although the EPEs of the proposed method are somewhat worse than that of EpicFlow, our model often produces better object details, see Row 3–5, 7–8, 10.

4.2 Results

Figure 6 compares the Proposed-1dir and Proposed-4dir networks to the FlowNetS method on the Flying Chairs and Middlebury datasets. The results produced by FlowNetS are often blurry and inconsistent, which lose some of the object details (e.g. chair legs). Both our proposed networks are able to produce better visualized flow maps, which are more accurate and contains more consistent objects with finer details. This demonstrates that our RC-LSTM is able to enhance the CNN features of FlowNetS, by explicitly modeling the spatial dependencies among pixels and producing context-aware features for optical flow estimation. It can be found the Proposed-4dir network produces lower endpoint errors (EPE) than the Proposed-1dir network.

In Fig. 7, we qualitatively compare our Proposed-1dir to FlowNetS as well as the state-of-the-art method, EpicFlow [21], on more examples from Sintel Final dataset. The images of EpicFlow and FlowNetS are from the results in [8]. It can be seen from the figure that in most cases (Row 1–8), the Proposed-1dir method produces visually better results and lower endpoint errors (EPE) than the FlowNetS. Although the EPEs of the proposed method are somewhat worse than that of EpicFlow, our model often produces better object details, see Row 3–5, 7–8, and 10 in Fig. 7.
Table 3.

Average endpoint errors (in pixels) of our networks compared to several well-performing methods on several datasets. Since we trained our network on Flying Chairs dataset, we can test our model on both the train and test images on other datasets.

Method

Sintel Clean

Sintel Final

KITTI

Middle

Chairs

Time(s)

Train

Test

Train

Test

Train

Train

Test

CPU

GPU

EpicFlow [21]

2.40

4.12

3.70

6.29

3.47

0.31

2.94

16

-

DeepFlow [20]

3.31

5.38

4.56

7.21

4.58

0.21

3.53

17

-

EPPM [37]

-

6.49

-

8.38

-

-

-

-

0.2

LDOF [19]

4.29

7.56

6.42

9.12

13.73

0.45

3.47

65

2.5

FlowNetS [8]

4.50

7.42

5.45

8.43

8.26

1.09

2.71

-

0.05

Proposed-1dir

3.77

6.72

4.93

7.94

7.55

1.02

2.64

-

0.08

Proposed-4dir

3.77

6.69

4.90

7.91

7.54

1.01

2.56

-

0.2

Table 3 shows the quantitative comparison between our Proposed-1dir and Proposed-4dir methods to many well-performing methods on Sintel Clean, Sintel Final, KITTI, Middlebury, and Flying Chairs datasets. It is shown that both our Proposed-1dir and Proposed-4dir methods consistently outperform the FlowNetS method on all these datasets. This demonstrates that our RC-LSTM is able to produce more powerful context-aware features for optical flow estimation. The running time of FlowNetS and our models is measured using an NVIDIA TITAN GPU on Middlebury dataset, and the time of other methods are from [8].

On Sintel and KITTI train datasets, our models outperform the LDOF [19] method. And our models are comparable to the real-time method EPPM [37], while our Proposed-1dir is two times faster. Although the proposed methods performs not as well as the state-of-the-art EpicFlow [21], it has been shown in Fig. 7 that our model often produce more consistent results. It is more interesting to see the quantitative results on the Flying Chairs test dataset. Since our models are trained on the training set of Flying Chairs, they are expected to perform better when testing on this dataset than on others. Table 3 shows that both our models outperform all the state-of-the-art methods.

The overall experimental results show that our models achieve best performance on the Flying Chairs dataset, and generalize well to other existing datasets. Note the fact that training data is essential to the performance of the CNN models, while the Flying Chairs dataset contains unrealistic images with 3D chair models rendered on background images. These results indicate that our method is very promising and may perform even better, if sufficiently large datasets with more realistic images are available.

5 Conclusion

In this paper, we proposed a row convolutional long short-term memory (RC-LSTM) network to model contextual dependencies of image pixels. The RC-LSTM is integrated with FlowNetS to enhance its learned feature representations and produce context-aware features for optical flow estimation. The experimental results demonstrated that our model can produce more accurate and consistent optical flows than the comparing CNN-based models. Our model also achieved competitive accuracy with state-of-the-art methods on several datasets.

Notes

Acknowledgement

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under the Grant RGP36726. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan GPU for this research.

References

  1. 1.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)Google Scholar
  4. 4.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014)Google Scholar
  5. 5.
    Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy, C.C., et al.: DeepID-Net: deformable deep convolutional neural networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2403–2412 (2015)Google Scholar
  6. 6.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015)Google Scholar
  7. 7.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 1529–1537 (2015)Google Scholar
  8. 8.
    Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)Google Scholar
  9. 9.
    Teney, D., Hebert, M.: Learning to extract motion from videos in convolutional neural networks. arXiv preprint arXiv:1601.07532 (2016)
  10. 10.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning (ICML), pp. 1764–1772 (2014)Google Scholar
  11. 11.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning (ICML), pp. 1310–1318 (2013)Google Scholar
  14. 14.
    Horn, B.K., Schunck, B.G.: Determining optical flow. In: Technical Symposium East, pp. 319–331. International Society for Optics and Photonics (1981)Google Scholar
  15. 15.
    Lucas, B.D., Kanade, T., et al.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence (IJCAI), vol. 81, pp. 674–679 (1981)Google Scholar
  16. 16.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24673-2_3 CrossRefGoogle Scholar
  17. 17.
    Wedel, A., Cremers, D., Pock, T., Bischof, H.: Structure-and motion-adaptive regularization for high accuracy optic flow. In: IEEE International Conference on Computer Vision (ICCV), pp. 1663–1668 (2009)Google Scholar
  18. 18.
    Sun, D., Roth, S., Black, M.J.: A quantitative analysis of current practices in optical flow estimation and the principles behind them. Int. J. Comput. Vis. 106, 115–137 (2014)CrossRefGoogle Scholar
  19. 19.
    Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 500–513 (2011)CrossRefGoogle Scholar
  20. 20.
    Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: IEEE International Conference on Computer Vision (ICCV), pp. 1385–1392 (2013)Google Scholar
  21. 21.
    Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: edge-preserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1164–1172 (2015)Google Scholar
  22. 22.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013)CrossRefGoogle Scholar
  23. 23.
    Shuai, B., Zuo, Z., Wang, G., Wang, B.: DAG-Recurrent neural networks for scene labeling. arXiv preprint arXiv:1509.00552 (2015)
  24. 24.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS), pp. 2366–2374 (2014)Google Scholar
  25. 25.
    Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3367–3375 (2015)Google Scholar
  26. 26.
    van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning (ICML) (2016)Google Scholar
  27. 27.
    Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., Bengio, Y.: Renet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393 (2015)
  28. 28.
    Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 545–552 (2009)Google Scholar
  29. 29.
    Shuai, B., Zuo, Z., Wang, G.: Quaddirectional 2D-recurrent neural networks for image labeling. IEEE Sig. Process. Lett. 22, 1990–1994 (2015)CrossRefGoogle Scholar
  30. 30.
    Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)
  31. 31.
    Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  32. 32.
    Rolfe, J.T., LeCun, Y.: Discriminative recurrent sparse auto-encoders. arXiv preprint arXiv:1301.3775 (2013)
  33. 33.
    Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92, 1–31 (2011)CrossRefGoogle Scholar
  34. 34.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  35. 35.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_44 CrossRefGoogle Scholar
  36. 36.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  37. 37.
    Bao, L., Yang, Q., Jin, H.: Fast edge-preserving patchmatch for large displacement optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3534–3541 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.College of Computer ScienceZhejiang UniversityHangzhouChina
  2. 2.School of Computing ScienceSimon Fraser UniversityBurnabyCanada

Personalised recommendations