1 Introduction

For machines to successfully interact in the real world, anticipating actions and events and planning accordingly, is essential. This is a difficult task, despite the recent advances in deep and reinforcement learning, due to the demand of large annotated datasets. If we limit our task to anticipating future appearance, annotations are not needed anymore. Therefore, machines have a slight advantage, as they can employ the vast collection of unlabeled videos available, which is perfectly suited for unsupervised learning methods. To anticipate future appearance based on current visual information, a machine needs to successfully be able to recognize entities and their parts, as well as to develop an internal representation of how movement happens with respect to time.

We make the observation that time is continuous, and thus, video frame-rate is an arbitrary discretization that depends on the camera sensor only. Instead of predicting the next discrete frame from a given input video frame, we aim at predicting a future frame at a given continuous temporal distance \(\varDelta t\) away from the current input frame. We achieve this by conditioning our video frame prediction on a time-related input variable.

In this work we explore one-step, long-term video frame prediction, from an input frame. This is beneficial both in terms of computational efficiency, as well as avoiding the propagation and accumulation of prediction errors, as in the case of sequential/iterative prediction of each subsequent frame from the previous predicted frame. Our work falls into the autoencoding category, where the current video frame is presented as input and an image resembling the anticipated future is provided as output. Our proposed method consists of: an encoding CNN (Convolutional Neural Network), a decoding CNN, and a separate branch, parallel to the encoder, which models time and allows us to generate predictions at a given time distance in future.

1.1 Related Work

Predicting Future Actions and Motion. In the context of action prediction, it has been shown that it is possible to use high-level embeddings to anticipate future actions up to one second before they begin [23]. Predicting the future event by retrieving similar videos and transferring this information, is proposed in [28]. In [8] a hierarchical representation is used for predicting future actions. Predicting a future activity based on analyzing object trajectories is proposed in [6]. In [3], the authors forecast human interaction by relying on body-pose trajectories. In the context of robotics, in [7] human activities are anticipated by considering the object affordances. While these methods focus on predicting high-level information—the action that will be taken next, we focus on predicting low-level information, a future video frame appearance at a given future temporal displacement from a given input video frame. This has the added value that it requires less supervision.

Anticipating future movement in the spatial domain, as close as possible to the real movement, has also been previously considered. Here, the methods start from an input image at the current time stamp and predict motion—optical flow or motion trajectories—at the next frame of a video. In [9] images are aligned to their nearest neighbour in a database and the motion prediction is obtained by transferring the motion from the nearest neighbor to the input image. In [12], structured random forests are used to predict optical flow vectors at the next time stamp. In [11], the use of LSTM (Long Short Term Memory Networks) is advised towards predicting Eulerian future motion. A custom deep convolutional neural network is proposed in [27] towards future optical flow prediction. Rather than predicting the motion at the next video frame through optical flow, in [25] the authors propose to predict motion trajectories using variational autoencoders. This is similar to predicting optical flow vectors, but given the temporal consistency of the trajectories, it offers greater accuracy. Dissimilar to these methods which predict future motion, we aim to predict the video appearance information at a given continuous future temporal displacement from an input video frame.

Predicting Future Appearance. One intuitive trend towards predicting future information is predicting future appearance. In [26], the authors propose to predict both appearance and motion for street scenes using top cameras. Predicting patch-based future video appearance, is proposed in [14], by relying on large visual dictionaries. In [29] future video appearance is predicted in a hirarchical manner, by first predicting the video structure, and subsequently the individual frames. Similar to these methods, we also aim at predicting the appearance of future video frames, however we condition our prediction on a time parameter than allows us to perform the prediction efficiently, in one step.

Rather than predicting future appearance from input appearance information, hallucinating possible images has been a recent focus. The novel work in [24] relies on the GAN (Generative Adversarial Network) model [13] to create not only the appearance of an image, but also the possible future motion. This is done using spatio-temporal convolutions that discriminate between foreground and background. Similarly, in [17] a temporal generative neural network is proposed towards generating more robust videos. These generative models can be conditioned on certain information, to generate feasible outputs given the specific conditioning input [15]. Dissimilar to them, we rely on an autoencoding model. Autoencoding methods encode the current image in a representation space that is suitable for learning appearance and motion, and decode such representations to retrieve the anticipated future. Here, we propose to use video frame appearance towards predicting future video frames. However, we condition it on a given time indicator which allows us to predict future appearance at given temporal distances in the future.

2 Time-Dependent Video Frame Prediction

To tackle the problem of anticipating future appearance at arbitrary temporal distances, we deploy an encoder-decoder architecture. The encoder has two separate branches: one to receive the input image, and one to receive the desired temporal displacement \(\varDelta t\) of the prediction. The decoder takes the input from the encoder and generates a feasible prediction for the given input image and the desired temporal displacement. This is illustrated in Fig. 1. The network receives as inputs an image and a variable \(\varDelta t\), \(\varDelta t \in \mathbb {R}^+\), indicating the time difference from the time of the provided input image, \(t_0\), to the time of the desired prediction. The network predicts an image at the anticipated future time \(t_0 + \varDelta t\). We use a similar architecture to the one proposed in [20]. However, while their architecture is made to encode RGB images and a continuous angle variable to produce RGBD as output, our architecture is designed to take as input a monochromatic image and a continuous time variable, \(\varDelta t\), and to produce a monochromatic image, resembling a future frame, as output.

Fig. 1.
figure 1

Our proposed architecture consists of two parts: (i) an encoder part consisting of two branches: the first one taking the current image as input, and the second one taking as input an arbitrary time difference \(\varDelta t\) to the desired prediction and (ii) a decoder part that generates an image, as anticipated, at the desired input time difference, \(\varDelta t\).

More specifically, the architecture consists of the following:

  1. 1.

    an encoding part composed of two branches:

    • an image encoding branch defined by 4 convolutional layers, 3 pooling layers and 2 fully-connected layers at the end;

    • a time encoding branch consisting of 3 fully-connected layers.

    The final layers of the two branches are concatenated together, forming one bigger layer that is then provided to the decoding part.

  2. 2.

    a decoding part composed of 2 fully-connected layers, 3 “unpooling” (upscaling) layers, and 3 “deconvolutional” (transpose convolutional) layers.

The input time-indicator variable is continuous and allows for appearance anticipations at arbitrary time differences. Training is performed by presenting to the network batches of \(\{I_x, \varDelta t, I_y\}\) tuples, where \(I_x\) represents an input image at current relative time \(t_0\), and \(\varDelta t\) represents a continuous variable indicating the time difference to the future video frame, and \(I_y\) represents the actual video frame at \(t_0 + \varDelta t\).

Predictions are obtained in one step. For every input image \(I_x\) and continuous time difference variable \(\varDelta t\), a \(\{I, \varDelta t\}\) pair is given to the network as input, and an image representing the appearance anticipation \(I_y\) after a time interval \(\varDelta t\) is directly obtained as output. No iterative steps are performed.

3 Experiments

3.1 Experimental Setup

We evaluate our method by generating images of anticipated future appearances at multiple time distances, and comparing them both visually and through MSE (Mean Squared Error) with the true future frames. We also compare to a CNN baseline that iteratively predicts the future video frame at \(k \varDelta t\) \((k=1, 2, ...)\) temporal displacements, from previous predictions.

Training Parameters. During training, we use the Adam optimizer [5], with \(L_2\) loss and dropout rate set to 80% for training. Training is performed up to 500,000 epochs with randomized minibatches consisting of 16 samples, where each sample contains one input image at current relative time \(t_0=0\), a temporal displacement \(\varDelta t\) and the real target frame at the desired temporal displacement \(\varDelta t\). On a Titan X GPU, training took approximately 16 h with, on average, about 100,000 training samples (varying in each action category). We argue that the type of action can be automatically detected, and is better incorporated by training a network per action category. Thus, we opt to perform separate preliminary experiments for each action instead of training one heavy network to anticipate video frames corresponding to all the different possible actions.

Network Architecture. Given that the input, and thus also the output, image size is \(120\,\times \,120\,\times \,1\,\) (\(120\,\times \,120\) grayscale images), in our encoder part, we stack convolutional and pooling layers that yield consecutive feature maps of the following decreasing sizes: \(120\,\times \,120\), \(60\,\times \,60\), \(30\,\times \,30\) and \(15\,\times \,15\), with an increasing number of feature maps per layer, namely 32, 64 and 128 respectively. Fully-connected layers of sizes 7,200 and 4,096 are added at the end. The separated branch of the encoder that models time consists of 4 fully connected layers of size 64, where the last layer is concatenated to the last fully-connected layer of the encoder convolutional neural network. This yields an embedding of size 4160 that is presented to the decoder. Kernel sizes used for the convolutional operations start at \(5\,\times \,5\) in the first layers and decrease to \(2\,\times \,2\) and \(1\,\times \,1\) in the deeper layers of the encoder.

For the decoder, the kernel sizes are the same as for the encoder, but ordered in the opposite direction. The decoder consists of interchanging “unpooling” (upscaling) and “deconvolutiton” (transpose convolution) layers, yielding feature maps of the same sizes as the image-encoding branch of the encoder, only in the opposing direction. For simplicity, we implement pooling as a convolution with \(2\,\times \,2\) strips and unpooling as a 2D transpose convolution.

3.2 Dataset

We use the KTH human action recognition dataset [18] for evaluating our proposed method. The dataset consists of 6 different human actions, namely: walking, jogging, running, hand-clapping, hand-waving and boxing. Each action is performed by 25 actors. There are 4 video recordings for each action performed by each actor. Inside every video recording, the action is performed multiple times and information about the time when each action starts and ends is provided with the dataset.

To evaluate our proposed method, we randomly split the dataset by actors, in a training set—with 80% of the actors, and a testing set—with 20% of the actors. By doing so, we ensure that no actor is present in both the training and the testing split and that the network can generalize well with different looking people and does not overfit to specific appearance characteristics of specific actors. The dataset provides video segments of each motion in two directions—e.g. walking from right to left, and from left to right. This ensures a good setup for checking if the network is able to understand human poses and locations, and correctly anticipate the direction of movement. The dataset was preprocessed as follows: frames of original size \(160\,\times \,120\) px were cropped to \(120\,\times \,120\) px, and the starting/ending time of each action were adjusted accordingly to match the new cropped area. Time was estimated based on the video frame-rate and the respective frame number.

3.3 Experimental Results

Our method is evaluated as follows: an image at a considered time, \(t_0=0\) and a time difference \(\varDelta t\) is given as input. The provided output represents the anticipated future frame at time \(t_0+\varDelta t\), where \(\varDelta t\) represents the number of milliseconds after the provided image.

Fig. 2.
figure 2

Comparison of predictions for (a) a person walking to the left, (b) a person walking to the right, (c) a person waving their hands and (d) a person slowly clapping with their hands. The third set of images in each group represent the actual future frame—the groundtruth.

The sequential encoder-decoder baseline method is evaluated by presenting solely an image, considered at time \(t_0=0\) and expecting an image anticipating the future at \(t_0 + \varDelta t_b\) as output. This image is then fed back into the network in order to produce an anticipation of the future at time \(t_0 + k \varDelta t_b\), \(k=1, 2, 3, ...\).

For simplicity, we consider \(t_0 = 0\) ms and refer to \(\varDelta t\) as simply t. It is important to note that our method models time as a continuous variable. This enables the model to predict future appearances at previously unseen time intervals, as in Fig. 3. The model is trained on temporal displacements defined by the framerate of the training videos. Due to the continuity of the temporal variable, it can successfully generate predictions for: (i) temporal displacements found in the videos (e.g. t={40 ms, 80 ms, 120 ms, 160 ms, 200 ms}), (ii) unseen temporal displacement within the values found in the training videos (e.g. t={60 ms, 100 ms, 140 ms, 180 ms}) and (iii) unseen temporal displacement after the maximal value encountered during training (e.g. t=220 ms).

Fig. 3.
figure 3

Prediction of seen and unseen temporal displacements.

Figure 2(a) illustrates a person moving from right to left, from the camera viewpoint, at walking speed. Despite the blurring, especially around the left leg when predicting for \(t\,=\,120\) ms, our network correctly estimates the location of the person and position of body parts. Figure 2(b) illustrates a person walking, from left to right. Our proposed network correctly localized the person and the body parts. The network is able to estimate the body pose, and thus the direction of movement and correctly predicts the displacement of the person to the right for any given time difference. The network captures the characteristics of the human gait, as it predicts correctly the alternation in the position of the legs. The anticipated future frame is realistic but not always perfect, as it is hard to perfectly estimate walking velocity solely from one static image. This can be seen at \(t\,=\,200\) ms in Fig. 2(b). Our network predicts one leg further behind while the actor, as seen in the groundtruth, is moving slightly faster and has already moved their leg past the knee of the other leg.

Fig. 4.
figure 4

Long distance predictions. For larger temporal displacements artifacting becomes visible. The anticipated location of the person begins to differ from the groundtruth towards the end of the total motion duration.

Our proposed network is able to learn an internal representation encoding the stance of the person such that it correctly predicts the location of the person, as well as anticipates their new body pose after a deliberate temporal displacement. The baseline network does not have a notion of time and therefore relies on iterative predictions, which affects the performance. Figure 2 shows that the baseline network loses the ability to correctly anticipate body movement after some time. Also in Fig. 2(a) the baseline network correctly predicts the position of the legs up to \(t\,=\,80\) ms, after that, it correctly predicts the global displacement of the person, but body part movements are not anticipated correctly. At \(t\,>\,160\) ms the baseline network shows a large loss of details, enough to cause its inability to correctly model body movement. Therefore, it displays fused legs where they should be separated, as part of the next step the actor is making. Our proposed architecture correctly models both global person displacement and body pose, even at \(t\,=\,200\) ms (Fig. 4).

Figure 2(c) displays an actor handwaving. Our proposed network successfully predicts upward movement of the arms and generates images accordingly. Here however, more artifacts are noticeable due to the bidirectional motion of hands during handwaving, which is ambiguous. It is important to note that although every future anticipation is independent from the others, they are all consistent: i.e. it does not happen that the network predicts one movement for \(t_1\) and a different movement for \(t_2\) that is inconsistent with the \(t_1\) prediction. This is a strong indicator that the network learns an embedding of appearance changes over time, the necessary filters relevant image areas and synthesizes correct future anticipations.

As expected, not every action is equally challenging for the proposed architecture. Table 1 illustrate MSE scores averaged over multiple time differences, t, and for different predictions from the KTH test set. MSE scores were computed on dilated edges of the groundtruth images to only analyze the part around the person and remove the influence of accumulated variations of the background. A Canny edge detector was used on the groundtruth images. The edges were dilated by 11 px and used as a mask for both the groundtruth image and the predicted image. MSE values were computed solely on the masked areas. We compare our proposed method with the baseline CNN architecture. The average MSE scores, given in Table 1, show that our proposed method outperforms the encoder-decoder CNN baseline by a margin of 13.41, on average, which is due to the iterative process of the baseline network.

Table 1. Average MSE over multiple time distances and multiple video predictions, on the different action categories of KTH. We compare our method with the iterative baseline CNN, and show that our method on average performs better than the baseline in terms of MSE (lower is better).

3.4 Ambiguities and Downsides

There are a few key factors that make prediction more difficult and cause either artifacts or loss of details in the predicted future frames. Here we analyze these factors.

(i) Ambiguities in body-pose happen when the subject is in a pose that does contain inherent information about the future. A typical example would be when a person is waving, moving their arms up and down. If an image with the arms at a near horizontal position is fed to the network as input, this can results in small artifacts, as visible in Fig. 2(c) where for larger time intervals t, there are visible artifacts that are part of a downward arm movement. A more extreme case is shown in Fig. 5(a) where not only does the network predict the movement wrong, but it also generates many artifacts with a significant loss of detail, which increases with the time difference, t.

Fig. 5.
figure 5

Examples of poorly performing future anticipations: (a) loss of details in waving, (b) loss of details in jogging, (c) extreme loss of details in running, (d) loss of details with low contrast and (e) artifacts in boxing.

(ii) Fast movement causes loss of details when the videos provided for training do not offer a high-enough framerate. Examples of this can be seen in Figs. 5(b) and (c) where the increased speed in jogging and an even higher speed in running generate significant loss of details. Although our proposed architecture can generate predictions at arbitrary time intervals t, the network is still trained on discretized time intervals derived from the video framerate. These may not be sufficient for the network to learn a good model. We believe this causes the loss of details and artifacts, and using higher framerate videos during training would alleviate this.

(iii) Decreased contrast between the subject and the background describes a case where the intensity values corresponding to the subject are similar to the ones of the background. This leads to an automatic decrease of MSE values, and a more difficult convergence of the network for such cases. Thus, this causes to loss of details and artifacts. This can be seen in Fig. 5(d). Such effect would be less prominent in the case in which color images would be used during training.

(iv) Excessive localization of movements happens when the movements of the subject are small and localized. A typical example is provided by the boxing action, as present in the KTH dataset. Since the hand movement is close to the face and just the hand gets sporadically extended, the network has more difficulties in tackling this. Despite the network predicting a feasible movement, often artifacts appear for bigger time intervals t, as visible in Fig. 5(e).

Despite the previously enumerated situations leading our proposed architecture to predictions that display loss of details and artifacts, most of these can be tackled and removed by either increasing the framerate, the resolution of the training videos, or using RGB information.

4 Conclusion

In this work, we present a convolutional encoder-decoder architecture with a separate input branch that models time in a continuous manner. The aim is to provide anticipations of future video frames for arbitrary positive temporal displacements \(\varDelta t\), given a single image at current time \((t_0\,=\,0)\). We show that such an architecture can successfully learn time-dependant motion representations and synthesizes accurate anticipation of future appearance for arbitrary time differences \(\varDelta t > 0\). We compare our proposed architecture against a baseline consisting of an analogous convolutional encoder-decoder architecture that does not have a notion of time and relies on iterative predictions. We show that out method outperforms the baseline both in terms of visual similarity to the groundtruth future video frames, as well as in terms of mean squared error with respect to it. We additionally analyze the drawbacks of our architecture and present possible solutions to tackle them. This work shows that convolutional neural networks can inherently model time without having a clear time domain representation. This is a novel notion that can be extended further and that generates high quality anticipations of future video frames for arbitrary temporal displacements. This is achieved without explicitly modelling the time period between the provided input video frame and the requested anticipation.