Flow-aware synthesis: A generic motion model for video frame interpolation

A popular and challenging task in video research, frame interpolation aims to increase the frame rate of video. Most existing methods employ a fixed motion model, e.g., linear, quadratic, or cubic, to estimate the intermediate warping field. However, such fixed motion models cannot well represent the complicated non-linear motions in the real world or rendered animations. Instead, we present an adaptive flow prediction module to better approximate the complex motions in video. Furthermore, interpolating just one intermediate frame between consecutive input frames may be insufficient for complicated non-linear motions. To enable multi-frame interpolation, we introduce the time as a control variable when interpolating frames between original ones in our generic adaptive flow prediction module. Qualitative and quantitative experimental results show that our method can produce high-quality results and outperforms the existing state-of-the-art methods on popular public datasets.


Introduction
Video frame interpolation aims to synthesize one or more intermediate frames between original frames. It is a fundamental yet important task, especially in the fields of video research and film production.
Due to restrictions of camera sensors and network bandwidth, many videos on the Internet have low frame rate, especially old films and sports videos, and frame interpolation methods can greatly enhance their temporal quality. In addition, interpolation is also widely used in many other applications, including video compression [1,2], medical imaging [3], and view synthesis [4].
Thanks to deep neural networks, we have witnessed remarkable improvements on the video frame interpolation [5][6][7][8][9][10][11][12][13][14][15]. However, it is still very challenging due to diverse factors, e.g., variations in lighting conditions, occlusion, and non-linear motion. Most state-of-the-art frame interpolation methods explicitly assume motions of objects or the background between consecutive frames to be linear [5,6,[15][16][17], i.e., the velocity of each object moving from one frame to another is constant in screen space. Nevertheless, many motions in the real world observed in video frames complex non-linear behaviour, leading to the problem we illustrate with a simple example, the 2D path of a ball, in Fig. 1. Given the positions of the ball at four time t = 0, . . . , t = 3, a linear predictive model would wrongly estimate the location of the ball at time t = 1.5, due to the constant velocity assumption in the linear motion model. Recently, some higher order motion models have been employed, such as a quadratic [14] or even a cubic motion model [8], to help overcome this issue. In general, the motions of camera and objects in the scene are irregular, due to variations in forces in the real world. This means that quadratic and cubic models still cannot well represent the complex motion patterns in the real world, which are beyond any fixed model. Similar examples to the one in Fig. 1 can also be found showing the limitations of these fixed motion models. In this paper, we present a flow-aware video frame interpolation (FAI) method to address generic non-linear motions. Without making any physical assumptions about the motion, we propose an adaptive flow prediction module which can dynamically estimate the motions for each frame sequence from the input optical flows between successive frames. By doing so, we can bypass the limitations of fixed motion models and allow the network to learn complex motion patterns in a datadriven manner.
As a further issue, interpolating only one intermediate frame between the two adjacent frames is usually insufficient for a good representation of non-linear motion-multiple intermediate frames are needed to better represent details of non-linear motions. It is straightforward for methods with fixed physical motion models to interpolate multiple intermediate frames. The basic idea is that, a timedependent physical equation of motion time can be used to calculate the required optical flow, so intermediate frames can be synthesized by warping the reference frame or features accordingly. In contrast, it is harder for interpolation methods without a physical motion model to interpolate frames at arbitrary time. To address this problem, we use time as the control variable for our adaptive flow prediction module to allow it to learn the motion parameterized by time. By doing so, the model can predict the required flow at arbitrary time t ∈ [0, 1] and thus support interpolating multiple intermediate frames.
The proposed method is a generic motion model for video frame interpolation. We use four consecutive frames as the input to estimate complex motions in this paper, but it is scalable to more complex motions by using more input frames. The contributions in this paper can be summarized as • an adaptive flow prediction module which can overcome the limitation of linear motion models and better represent the complex motions in real world scenarios; • temporal intervals are used as a control variable to predict motion, enabling our network to synthesize intermediate frames at arbitrary time between the two input frames, and • our method covers generic motion modeling in video interpolation, and can be further extended to support more complex motions with slight modification.

Flow-based methods
Flow-based methods are intuitive and have attracted much attention in recent years. The basic idea is to estimate the optical flow between consecutive original frames, and then warp the original frames or features using the required flow, which is computed from the estimated optical flow, to synthesize the intermediate frames.
Our method belongs to this type of method. As a pioneer of flow-based methods, Liu et al. [17] proposed use of a fully-convolutional network, called deep voxel flow (DVF), to predict 3D optical flow vectors across space and time for each pixel, and then synthesize the target frame by trilinear interpolation. To address inaccurate optical flows and occlusion issues, Super Slomo [16] employed two U-Nets to refine the required flow and learn soft visibility maps for blending occluded regions. CyclicGen [18] further made use of edge information and designed a novel cycle consistency loss to produce better details with less training data.
Recently, rather than implicitly training a optical flow network within the framework, more and more methods have directly employed an estimator that is well-trained with ground truth optical flow from other large-scale datasets, to get more accurate optical flow. For example, Bao et al. [5] integrated FlowNet [19] within the proposed MEMC-Net, and Xue et al. [15] adopted SpyNet [20] in their proposed ToFlow and DAIN [6] utilized PWC-Net [21] as its flow estimator. To further enhance details and address challenging scenarios, e.g., occlusion and large motions, Niklaus and Liu [11] and Yuan et al. [22] proposed to not only warp the input frames but also the contextual features extracted by ResNet [23]. To address blending of occluded regions, DAIN [6] leveraged depth information, while SoftSplat [12] learned blending weights for each overlapped pixel from a depth-related importance mask. Park et al. [24] proposed the BMBC method to estimate the bilateral motions using a bilateral cost volume to obtain more accurate optical flow.
All of the above methods explicitly or implicitly assume the motions of objects or the background in the input frames are linear. A linear model can well approximate simple motions, but it is insufficient for more complex motions, as noted. To represent more complex motions, Xu et al. [14] used a quadratic motion model, and Chi et al. [8] presented a cubic motion model. However, the movements of objects and background are irregular in the real world, due to variations in forces. No matter whether linear or quadratic or cubic motion models are used, all are still fixed physical motion models that cannot well represent the extremely complex motion patterns in the real world. Unlike the above methods, our method can dynamically estimate the motion model for each frame sequence from the input optical flows between successive frames.

Kernel-based methods
Unlike flow-based methods that explicitly estimate motions, kernel-based methods deal with motions in an implicit way. They directly estimate kernels for convolving with input frames to produce intermediate frames. As a pioneer, Long et al. [25] proposed to regress the target frame from the two input frames using a CNN, but the results always tended to be blurry. To produce more visually pleasing frames, Niklaus et al. [13] proposed to estimate a 2D convolution kernel for each pixel to capture the motion. The output frame can be synthesized by locally convolving the two input frames with the estimated kernels. It can produce intermediate frames with sharper edges, but it is extremely memoryconsuming, as it estimates an independent large kernel (41×41) for every output pixel. To improve memory efficiency, Niklaus et al. [26] proposed use of 1D separable convolution kernels instead of 2D kernels, but it still fails to synthesize plausible results for motion larger than the kernel size. Recently, AdaCoF [9] proposed to estimate not only the kernel weights but also offsets for each pixel to support larger motions. DSepConv [7] adopted deformable separable convolution [27,28] to replace conventional convolution to address large motions with a smaller kernel size.
Since all the kernel-based methods use convolution kernels to implicitly model the motion between frames, they cannot directly interpolate multiple intermediate frames between the two input frames. Although we can recursively feed the interpolated frames back into their model to produce multiple intermediate frames, this approach leads to error accumulation. Also, this solution implicitly assumes a linear motion model, so cannot well represent complex motions in the real world, as discussed above.

Other methods
Besides flow-based and kernel-based methods, several other novel methods have been proposed to interpolate frames. Meyer et al. [29] proposed a phase-based method that combines phase information across the levels of a multi-scale pyramid. It provides an efficient alternative to optical flow, but large motions of high frequency components cannot be well represented by the estimated phase. In order to alleviate this issue, Meyer et al. [10] proposed PhaseNet to combine the phase-based motion representation with a neural network decoder, to improve robustness. Recently, FeFlow [30] was devised to synthesize intermediate frames in a structure-to-texture manner. It divides the video frame interpolation task into two steps: structure-guided interpolation and texturerefinement; an attention mechanism is employed in their method to better handle occlusions. CAIN [31] adopted channel attention to spread the information in feature maps into multiple channels and extracted motion clues from them.
Like kernel-based methods, these methods also have to adopt recursive processing to interpolate multiple intermediate frames, again leading to error accumulation and implicitly assuming a linear motion model, while our method can dynamically estimate non-linear motion models for each frame sequence and directly interpolate multiple intermediate frames.

Overview
The target of frame interpolation is to increase the frame rate of video. Given four consecutive video frames I −1 , I 0 , I 1 , and I 2 , our goal is to interpolate a frame I t that is temporally between I 0 and I 1 . The overall pipeline of our method is shown in Fig. 2. Our method can be divided into two stages, adaptive flow prediction and frame synthesis.
To better use the input information, we regard both I 0 and I 1 as reference frames. As shown in Fig. 2, for each reference frame, we estimate a group of basic optical flows from it to the other three input frames, denoted by To model the complicated non-linear motion, each group of basic flows and the time t are then fed into the proposed adaptive flow prediction (AFP) module to predict the optical flow from the reference frame to the target frame, that is represented as f 0→t and f 1→t for the two reference frame, respectively. Here, the input time t is used to control the time at which the required flow is predicted, so our method can interpolate multiple intermediate frames as needed.
If we directly warp the reference frames (I 0 and I 1 ) with the required flows to produce the warped frames (Î 0 andÎ 1 ) and fuse them to give the final results, the results tend to be blurred. Therefore, we further use a pyramid context extractor to extract multi-scale contextual features for the reference frames (I 0 and I 1 ), denoted {F 1 0 , F 2 0 , F 3 0 } and {F 1 1 , F 2 1 , F 3 1 } for I 0 and I 1 , respectively. We then employ a forward warping layer [12] to warp not only the reference frames (I 0 and I 1 ) but also their multi-scale contextual Finally, the warped reference frames (Î 0 ,Î 1 ) and warped pyramid contextual features are fed into the frame synthesis network to produce a residual map between the ground truth and the average blending of framê I 0 andÎ 1 . Our final prediction of the interpolated frame is obtained by summing the residual map and the average blending of frameÎ 0 andÎ 1 . We provide the details of each component of our approach in the following sections.

Adaptive flow prediction
The goal of the adaptive flow prediction (AFP) module is to dynamically estimate the motions for each frame sequence from the basic optical flows between successive input frames, to better represent complex motions than traditional linear, quadratic, or cubic motion models. We employ the off-the-shelf method PWC-Net [21] to produce the basic flows, as it is a state-of-the-art optical flow estimation method widely used in recent research. For each reference frame (I 0 and I 1 ), we estimate three different basic optical flows. Note that to support interpolating multiple intermediate frames, we need to control the target time of the required flow. Therefore, we also and their contextual features extracted by a pyramid context extraction module, which are then fed to the frame synthesis network to produce the final target frame. expand the target time t into a tensor of shape of H × W × 1, where H and W are the height and width of input frames, and concatenate it with the three optical flows as the input to the AFP module to generate the flows f 0→t and f 1→t . Mathematically, this procedure can be represented as By analyzing the motion patterns within the input basic flows, the AFP module estimates the respective required bi-directional optical flows from reference frames I 0 and I 1 to the intermediate target frame I t . Since multilayer perceptrons (MLP) are known to powerfully approximate functions, we employ them to approximate the complex non-linear motions of the input consecutive frames. As shown in Eq. (1), the input basic flow group {f 0→−1 , f 0→1 , f 0→2 } and the output required flow f 0→t are spatially aligned with the same reference frame I 0 ; a similar observation holds for Eq. (2). Therefore, we can use a 1 × 1 convolution to realize the temporal MLP. Without any spatially resampling, the AFP module only needs to learn the motion patterns from the input basic flow group and to estimate the required flow values.
The architecture of our AFP module is shown in Fig. 3; it consists of six convolution layers. The first five layers have Leaky Rectified Linear Units (LeakyReLU) [32] as activation function; the last layer acts as the output layer, so no activation function is applied. The kernel size in each convolution layer is set to one, as discussed above. The shape of the feature map in each layer is H ×W ×C, and the shape of feature maps within the AFP module remains unchanged, while the number of channels C increases at the beginning and decreases to two at the end, as the predicted flow for each pixel should be a 2D vector. The AFP module can learn and predict f 0→t from the stacked flows and time t, so we refer to the proposed method (FAI) as a flow-aware synthesis method.
Note that our method is scalable to addressing more complex motions. With slight modification, further basic flows can be fed into our AFP module to extend its capability to approximate more complex motions. It can be described as where t 1 to t n denote a sequence of time, assuming frames at those time steps can be used as inputs. f 1→t can be determined in a similar way.

Frame synthesis
Having predicted the required optical flow using the AFP module, we need to warp the reference frames (I 0 and I 1 ) to the target time step. To better use the information inside the reference frames, we not only warp them directly but also their multiscale contextual features to the target time step. We employ the pyramid feature extraction network proposed by Niklaus and Liu [12] to extract multiscale contextual features of the reference frames at three scales F 1 , F 2 , and F 3 . The multi-scale contextual features for I 0 and I 1 are denoted by {F 1 0 , F 2 0 , F 3 0 } and {F 1 1 , F 2 1 , F 3 1 }, respectively. Our AFP module generates flows f 0→t and f 1→t aligned with the reference frames I 0 and I 1 , respectively. Therefore, forward warping is a more suitable way to get the warped frames and contextual features, rather than backward warping. We adopt the differentiable forward warping layer proposed by softmax splatting [12], so the whole framework can be trained jointly. Note that we resize the predicted flows (f 0→t and f 1→t ) to each scale in the pyramids of multi-scale contextual features, and rescale the flow vector values accordingly, to allow us to warp the pyramidal contextual features.
Forward warping can leave holes due to occlusion. To fill in the missing information and enhance the details in the final synthesized frame, we employ GridNet [33] as our frame synthesis network. GridNet contains three rows and six columns. Inspired by Niklaus and Liu [11], we adopt bilinear upsampling in GridNet to avoid checkerboard artifacts, and incorporate parametric rectified linear units to stabilize training. Specifically, we concatenate the warped input frames (Î 0 ,Î 1 ) and the first level of contextual features (F 1 ) as input to the first row of GridNet, and feed the second (F 2 ) and third (F 3 ) level contextual features into the second and third rows of GridNet, respectively. To encourage convergence, we let the frame synthesis network learn the residual map between the ground truth target frame and the average blending of warped reference frames (Î 0 andÎ 1 ).

Loss functions
Inspired by Refs. [9,26], we consider two different types of loss functions in our method, color loss L 1 and combination loss L com . The color loss and combination loss make our network focus on quantitative quality and visual quality, respectively.

Color loss
The color loss L 1 is defined as the L 1 norm of the pixelwise color difference. Alternatively, following recent works [6,24], we optimize the L 1 norm using the Charbonnier penalty function [34]. Mathematically, it can be written: where is set to 10 −6 in our experiments.

Combination loss
It has been shown that introducing perceptual loss into image generation tasks can produce more visually pleasing results and sharper edges [35,36]. The basic idea is to supervise the synthesized results in the feature domain. Various feature extractors φ can be utilized to map the synthesized frame into feature space to compute the perceptual loss. We empirically adopt relu4 4 layer of the VGG-19 network here. However, using perceptual loss by itself may lead to color distortion. Thus, we combine the L 1 loss and perceptual loss together to form a combination loss: where λ is set to 2 × 10 −5 in our experiments.

Experiments
In this section, we provide implementation details and a comparison with other state-of-the-art methods on widely used datasets. We also conduct several ablation study experiments to evaluate the effectiveness of the modules in our method.

Implementation details
Our training dataset consisted of two parts. The first contained 25 video clips with a frame rate of 240 fps and resolution of 720×1280, collected from YouTube. These video clips were diverse in terms of action and scene type. However, the cameras in these videos were mostly still. Thus we randomly selected some consecutive frames from GOPRO [37] and Adobe240 [38] datasets as our another part of the training dataset. These were recorded with handheld cameras and therefore contain more complex motions. Consequently, the final training dataset consisted of 14,819 training samples, each sample with 25 consecutive frames following QVI [14]. Our model took the 1st, 9th, 17th, and 25th frames as inputs (I −1 , I 0 , I 1 , and I 2 ) to synthesize 7 frames I t from 10th to 16th as ground-truth I gt at time steps t = 0.125, 0.25, 0.5, . . . , 0.875. During the training phase, we resized the frames to 360×640 and randomly cropped them to 256×256. We also performed data augmentation by randomly flipping frames vertically and horizontally. In order to increase the diversity of our dataset, we randomly changed the temporal order with probability 0.5. Our experiments were performed on a single NVIDIA TITAN RTX GPU.
Since we adopt forward warping, there are some gaps inÎ 0 andÎ 1 . We experienced a degradation in performance and hard convergence of the model when jointly training the whole network. In order to preserve the functionality of our AFP module, we chose to train AFP first supervised by weak ground truth optical flow f 0→t and f 1→t . The training loss for AFP was L 1 . Next, we trained the whole network with the AFP module fixed. To train our network, we used AdaMax [39] with β = (0.9, 0.999) and minibatch size of 8 samples. The initial learning rate was set to 2 × 10 −4 and reduced by a factor of 0.5 for every 30 epochs. We trained our network with flow estimation and AFP module fixed for 70 epochs and then fine tuned the flow estimation for another 10 epochs. We will release our source code upon publication.

Overview
We evaluated our approach on four widely used datasets, including two multi-frame interpolation datasets: GOPRO and Adobe240, and two singleframe interpolation datasets: DAVIS [40] and Vimeo90K septuplet [15].
For quantitative evaluation, we used peak signal-tonoise ratio (PSNR), structural similarity index (SSIM) [41], and interpolation error (IE) [42] between I t and I gt on the evaluation datasets, adopting root-meansquared (RMS) difference for IE in our experiments.

Multi-frame interpolation datasets
The GOPRO dataset consists of 33 high-quality videos while Adobe240 consists of 133 videos, both recorded by high-speed hand-held cameras and designed for benchmarking deblurring tasks. The frame rate is 240 fps and resolution is 720×1280. We extracted 4275 samples of 25 consecutive frames from GOPRO and 8702 from Adobe240 as set following QVI and randomly separated the samples into training and testing parts: for GOPRO, 2775:1500; for Adobe240, an equal split. Following the test settings in QVI [14], we kept the resolution to 720×1280 for GOPRO and resized frames for Adobe to 360×640 during testing. For each sample, the 1st, 9th, 17th, 25th frames(I −1 , I 0 , I 1 , and I 2 ) are used to synthesize the frames I t from 10th to 16th.

Single-frame interpolation datasets
DAVIS is a video dataset originally designed for segmentation tasks, with a frame rate of 30 fps. Xu et al. [14] previously extracted 2849 quintuples (I −1 , I 0 , I 1 I 2 as inputs and I 0.5 as the target) from DAVIS. We used this data and resized the frames to 480×856 for our evaluation. Vimeo90K septuplet data was originally designed for video denoising, deblocking, and super-resolution, and contains 7824 samples of 7 consecutive frames with a resolution of 256 × 448. We took the 1st, 3rd, 5th, and 7th frames as inputs (I −1 , I 0 , I 1 , and I 2 ) to synthesize the 4th frame, corresponding to I 0.5 in our experiments.

Overview
We compared our method with five state-of-the-art interpolation methods, including SepConv-L 1 [26], Super SloMo [16], QVI [14], DAIN [6], and AdaCoF [9]. We used the authors' released codes for Super SloMo, QVI, DAIN, and AdaCoF and corresponding retrained versions on our training dataset. Since the training code of SepConv is not publicly available, we could not retrain it and directly used the released model in our experiments. Note that our model interpolates frames at arbitrary input time. We compared the above methods on both the multi-frame and single-frame interpolation datasets.

Quantitative evaluation
A quantitative comparison for the multi-frame interpolation datasets is shown in Table 1. Following QVI [14], we evaluated different methods in two settings. Evaluation metrics for the 4th frame are denoted "center", while the average over all 7 interpolated frames is denoted "whole". We can see that on GOPRO and Adobe240 datasets, the proposed method consistently and significantly outperforms all the other methods which use a linear physical motion model assumption. Moreover, our method achieves 0.3 dB and 0.8 dB PSNR gains respectively compared with QVI which assumes a quadratic model for the motions. Similarly, our method also performs favorably against the other methods except for QVI on the single-frame interpolation datasets, as shown in Table 2.

Properties of methods
Following the analysis method in DSepConv [7], we analyse the properties of different methods in the following ways: • number of parameters (Param.), in millions; • number of input frames (Input); • sub-networks used, including flow, kernel, context, mask. The results are shown in Table 3. Enc-Dec denotes the self-trained flow estimation module with Encoder-Decoder network architecture. LH denotes the learned hierarchical feature extraction module defined in DAIN. In addition, the kernel in flow-based methods with bilinear interpolation operations is denoted bilinear(k), where k is the kernel size. In particular, we can see that our method uses the fewest parameters, only about half of the number used in other methods.

Qualitative evaluation
A qualitative evaluation can better show the visual differences in results of different methods. A com- parison of our method with other state-of-the-art methods on some challenging scenarios is shown in Fig. 4. These scenarios contain complex motions with both translational and rotational motion; our adaptive flow prediction module can effectively address such complicated non-linear motions.
We can see that the results of our methods are more visually pleasing in Fig. 4, e.g., in the top row, our method synthesizes the flamingo legs more clearly than other methods. In the 3rd row, the frame generated by our method contains the whole wheel and less distortion in the crosswalk compared to results generated by other methods. Also, our method can better address occlusions, as shown in the bottom row, where our synthesized frame contains a clear background and fewer artifacts near edges compared to results of other methods.
Moreover, as intended, combination loss L com produces better visual results than color loss L 1 , as shown in Fig. 4, while L 1 outperforms the L com quantitatively as shown in Table 1.

Quality consistency evaluation
Besides average frame quality, quality consistency of frames along the time axis is also important to video quality. If the frame quality varies too much over time, people will experience the video to be flickering, leading to discomfort. To evaluate consistency of quality, we computed PSNR values at each time step of the results from different methods on the Adobe240 dataset: see Fig. 5. For all methods, the PSNR values at the center time (t = 4) tend to be lower than at the marginal time (t = 1, t = 7). This is because the time difference between the input frames and the target frames at the marginal time is lower than that at the center time. The PSNR curve for our method is consistently above the curves of other methods, and it is also smoother, indicating that our method can produce both better frame quality and more consistent results.

Ablation study
To evaluate the effectiveness of the individual components of our model, we conducted ablation studies using the multi-frame interpolation datasets. We considered the following variants of our method: (1) Linear w/o t: without the adaptive flow prediction module; (2) Linear w/ t: without the adaptive flow prediction module but with time t as input;  (3) Ours w/o t: without t as input; (4) Ours: the full proposed model. In (1) and (2), since there is no adaptive flow prediction module to predict the required flow f 0→t and f 1→t from reference frames to target frames, we just apply the linear flow combination method, to produce intermediate flows as tf 0→1 and (1 − t)f 1→0 . In (1) and (3), the network cannot synthesize the frame at an arbitrary time, it can only generate a single in-between frame I 0.5 , but we can recursively interpolate 7 frames for (1). Note that all the variants are trained with L 1 loss.
The performance of the above variants was evaluated on the GOPRO, Adobe240, DAVIS, and Vimeo90K datasets, with results as shown in Table 4. We can see that the linear variants of our methods (Linear w/o t and Linear w/ t) cannot compete with our full methods, as they cannot well represent the complex motions in the dataset. Moreover, the performance of Ours w/o t for center frame is slightly better than Ours, because the network can pay more attention to the quality of the center frame if we do not require it to interpolate other intermediate frames. Besides the average frame quality, the quality consistency of frames is also important as mentioned above. Therefore, we also evaluate the quality consistency of our method and its variants. We use them to interpolate the intermediate seven frames, and compute the PSNR value for each frame on the Adobe240 dataset, and plot the PSNR values against the time index in Fig. 6. We can see that no matter with L 1 or L com loss, our method can always produce more consistent results for all the consecutive frames compared with the linear variants of our method.
To visualize the difference of estimated flows from the linear model and our proposed AFP module, we show two examples in Fig. 7 from the DAVIS dataset that contain non-linear motions. The top row shows the optical flow f 0→0. 5 , while the second row shows f 1→0.5 . We estimate flows between the input frames, e.g., {I 0 , I 0.5 } and {I 1 , I 0.5 }, as the  approximate ground truth (first column). The optical flows in the second and last columns are produced by the linear model and our proposed AFP module, respectively. We visualize the optical flow in HSV color space, where hue indicates direction. We can see that the hue of the visualization predicted by AFP is closer to the ground truth, indicating that AFP can better address non-linear motions than a conventional linear motion model.

Conclusions
We have presented a flow-aware multi-frame interpolation method to address the complicated non-linear motions in the real world by dynamically learning the motions for each frame sequence with our proposed adaptive flow prediction module. By introducing time as a control variable for the adaptive flow prediction module, our method can interpolate multiple intermediate frames from consecutive input frames. Such that the frame interpolated videos can better present the complex non-linear motions. Our generic motion model is scalable and can be extended to support more complicated motions with more input frames. Extensive experiments show the quality of our results. Both qualitative and quantitative experimental results indicate that our methods outperform existing state-of-the-art methods on widely used datasets. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.