Homography-guided stereo matching for wide-baseline image interpolation

Image interpolation has a wide range of applications such as frame rate-up conversion and free viewpoint TV. Despite significant progresses, it remains an open challenge especially for image pairs with large displacements. In this paper, we first propose a novel optimization algorithm for motion estimation, which combines the advantages of both global optimization and a local parametric transformation model. We perform optimization over dynamic label sets, which are modified after each iteration using the prior of piecewise consistency to avoid local minima. Then we apply it to an image interpolation framework including occlusion handling and intermediate image interpolation. We validate the performance of our algorithm experimentally, and show that our approach achieves state-of-the-art performance.


Introduction
Image interpolation is a process that generates a new image using available images, which is useful for frame rate-up conversion [1], view synthesis [2], etc.In some applications, the available images have a wide baseline.Here, baseline means the translation and rotation that a camera undergoes to capture image pairs.For example, in virtual street roaming applications, users can teleport themselves from one street spot to another by clicking the directional arrow.In order to make the transition between discrete views smooth, it is important to realistically interpolate intermediate views between wide-baseline image pairs since the sampled street views are usually far from each other.
Nie et al. [2] discussed the definition of various kinds of baselines, and divided them into three categories based on the median distance between the KITTI images [3]: small-baseline, medium-baseline, and wide-baseline.The basic idea of most image interpolation algorithms is to estimate the motion field of the input views and map them to the desired position.Traditional interpolation methods were usually designed for small baseline images [4], and recent large displacement optical flow methods [5] can be regarded as medium-baseline algorithms.Due to the large translations and rotations involved, it is still a challenging problem to estimate the motion field for wide-baseline image pairs.
One classical approach to motion estimation is to consider it as a labeling problem, which can be formulated to a global optimization problem in a Markov random field.In other words, we need to select the best motion vector from the set of potential motion vectors for each pixel in the source image, to minimize the energy defined using some prior assumptions such as brightness constancy and spatial smoothness.However, since the space of all possible motion vectors is usually too large, employing global optimization over the full image grid in this space has excessive computational requirements.To reduce the amount of computation, some approaches use a search window as the candidate label set [6].However, for wide-baseline image pairs, the window size should be very large to avoid falling into local minima, which makes the optimization prohibitively slow.Other approaches use approximate nearest neighbors in feature space to prune the set of potential motions [5].But the proposed set is still excessive, because it needs to maintain a high recall of the target motions.So they have to perform optimization on a sampled image grid, and use interpolation [7] to get the motion field of the full image grid.
An alternative strategy to estimate the motion is to compute parametric transformation models locally, which can transform each pixel to its target position in the target image [2].It is an efficient strategy to deal with wide-baseline image pairs.However, this strategy cannot guarantee the estimated motion field to be piecewise smooth, which may lead to some artifacts of stretching, overlapping, holes, etc.Therefore, methods using this strategy usually need an extra global optimization stage to further eliminate artifacts.
In this paper, we propose a novel method of motion estimation, which combines the advantages of both global optimization and local parametric transformation model based algorithms.
We formulate the problem in terms of global optimization in a Markov random field.Rather than using a constant set of candidate motions like previous methods [5,6], we adjust the candidate set iteratively guided by homography fitting and propagation.More specifically, we first initialize the set of candidate motions for each pixel by approximate nearest neighbor search in feature space.Unlike DiscreteFlow [5], where the candidate set is excessive, the size of our candidate set can be very small.Then, we perform global optimization over the full image grid with the proposed candidate set.As the small candidate set may not include the target motion, we propose a novel strategy to update the candidate set iteratively through local refinement under a piecewise parametric model.Our approach requires neither a large candidate set to guarantee that the target motion is included, nor a coarse-to-fine scheme to gradually refine the estimated motions.
In summary, the main contributions of this paper are as follows: • a novel optimization framework for motion estimation based on homography guided belief propagation, • application of the proposed motion estimation method to an image interpolation framework, • experiments to show that our approach is able to deal well with the wide-baseline image interpolation problem, and • a demonstration that our approach also performs well for traditional small-baseline image pairs too, through experiments on a typical optical flow dataset.The rest of this paper is organized as follows: we first review the related work in Section 2. Then in Section 3 we introduce our approach including candidate set initialization, the inference algorithm, and the modification strategy for the candidate set.In Section 4, our algorithm is validated and compared to other approaches experimentally.Finally, we conclude and discuss the limitations of this paper in Section 5.

Related work
As mentioned above, the basic idea of image interpolation algorithms is motion estimation.In other words, image interpolation is a high-level application of motion estimation techniques.So we first review relevant low-level motion estimation algorithms, and then we mainly review related work on image interpolation including frame rateup conversion and view synthesis.

Motion estimation
Optical flow methods are typical motion estimation algorithms, most of which are designed for smallbaseline image pairs.Since the original work of Horn and Schunck [8], there has been a huge body of literature on optical flow [9][10][11][12].One typical approach is to consider it as a labeling problem as mentioned in Section 1.The motion field can be estimated by solving an energy minimization problem based on brightness constancy and spatial smoothness [13][14][15].Since the space of all possible labels is usually too large or even infinite [16,17], some strategies have been proposed to reduce the label set.The simplest way is to use a search window centered at the initial label [6], but it is prone to converging to local minima, especially when there are large displacements between image pairs.Discrete Flow [5] pruned the label set by proposing a diverse set of candidate labels using approximate K-nearest-neighbor search and random sampling around the reference pixel.Veksler [18] decreased the computational cost of the graph cuts stereo correspondence technique efficiently using the results of a simple local stereo algorithm to limit the disparity search range.The particle belief propagation technique [19] applied Markov chain Monte Carlo sampling to the current belief estimation using a Gaussian proposal distribution.Besse et al. [20] defined a new family of algorithms, called PMBP, which combines the best features of both PatchMatch and particle belief propagation; they leveraged PatchMatch to produce particle proposals effectively.Other methods are based on PMBP [21,22].Li et al. [21] proposed a method called SPM-BP to tackle the computational bottleneck of PMBP.Hornáček et al. [22] showed that optimization over high-dimensional, continuous state space can be carried out using an adaptation of PMBP.We use belief propagation as the base algorithm to optimize the objective function too.But instead of using PatchMatch, we utilize homography estimation to propose new labels, which performs better than PMBP-based methods.
There are also many other types of optical flow estimation algorithms.For example, the recent advances in deep learning have significantly influenced the literature on optical flow estimation.However, it is beyond the scope of this paper to review the entire literature.For a more detailed survey of optical flow estimation, please refer to Refs.[23,24].

Frame rate-up conversion
Frame rate-up conversion is a typical application of image interpolation, where one can interpolate intermediate frames between adjacent video frames to increase the frame rate of a video.In this situation, objects undergo very small displacements, since sequential video frames are very similar.Owing to their simplicity, block matching algorithms are commonly used in frame rate-up conversion [25].These methods divide a frame into non-overlapping blocks and search for the most similar block in the following frame.At the pixel level, Mahajan et al. [26] moved the image gradients through a given time step and solve a Poisson equation to reconstruct the interpolated frame.Stich et al. [27] found edges and homogeneous regions in images for matching, yielding a dense motion field between images.Meyer et al. [28] proposed propagating phase information across oriented multi-scale pyramid levels for video interpolation.CNN-based methods also show good performance for this application.Long et al. [29] trained a deep CNN to directly predict the interpolated frames, but the results are usually blurred.Some methods take advantage of accurately estimated pixel-wise optical flow to improve performance [1,4].Other methods formulate frame interpolation as convolution over local patches and estimate the convolution kernels for each output pixel [30,31].However, these methods are designed for small-baseline image pairs, and they are ineffective for wide-baseline image interpolation.

View synthesis
View synthesis is the process of generating a new view using existing views taken from multiple cameras.In this situation, there may be large displacement because of large translation or rotation of a camera.Recently, large-displacement optical flow methods have been proposed.Some methods initialize the variational model by sparse feature correspondences or an approximate nearest neighbor field [32], which helps to escape from the local minima.These methods are improved by proposing more sophisticated feature matching algorithms [7].From a different angle, Bao et al. [33] obtained large displacement optical flow by increasing the smoothness of PatchMatch [34].However, these methods do not perform very well for wide-baseline image interpolation.Image-based rendering techniques [35][36][37][38] have been proposed to get better results in wide-baseline view synthesis.Chaurasia et al. [37] reconstructed a 3D model for a scene, and compensated for reconstruction errors by depth synthesis.However, sometimes they may fail to reconstruct the 3D scene, e.g., if there are to few images.Some researchers have applied deep learning methods to view synthesis problem [39][40][41][42].For example, Zhou et al. [39] trained a convolutional neural network to generate an appearance flow vector that specifies which pixels in the input image can be used to reconstruct the output.However, learning based methods require a large amount of training data and much training time.Nie et al. [2] proposed a method that only needs two images as input.They oversegment the source image into superpixels, and estimate for each superpixel a homography, which transforms each superpixel to the target position.However, without explicitly enforcing a spatial smoothness constraint, artifacts may occur because of the discontinuity between different superpixels.
Although there is a mesh warping framework to further eliminate artifacts, some artifacts still remain like stretching and holes.Our method is similar to Ref. [2], since we both use the assumption that each superpixel represents a small plane, and our method also includes homography fitting and propagation.But unlike them, we formulate the whole process of motion estimation as an energy minimization problem, which explicitly enforces spatial smoothness and achieves better performance than Ref. [2].

Background
Our aim is to generate intermediate images between two given images I 1 and I 2 .To that end, we compute a forward displacement vector from I 1 to I 2 for each pixel in I 1 and a backward displacement vector from I 2 to I 1 for each pixel in I 2 .Our approach considers this to be a labeling problem, where the label here is the displacement vector for each pixel.We solve this problem by minimizing an energy function in a Markov random field (MRF) over dynamic candidate label sets.Inspired by belief propagation (BP) [43], we propose a novel optimization scheme guided by homography fitting and propagation to avoid local minima.The pipeline is shown in Fig. 1.First of all, for each pixel, we generate an initial candidate label set whose size is very small: see Section 3.3.Then, to tackle the problem of insufficient candidates caused by the limited size of the label set, we propose new labels using homography estimation, and modify the candidate label sets after each iteration of the optimization: see Section 3.4.
Before presenting the details of the algorithm, we first introduce the formulation of our motion estimation approach and some essential concepts of BP in Section 3.2.

Formulation of motion estimation
Without loss of generality, we only consider estimation of forward displacement vectors from I 1 to I 2 , since the backward displacement from I 2 to I 1 can be obtained in exactly the same way.Our goal is to estimate the motion field w for I 1 , where w(p) = (u(p), v(p)) is the displacement vector at pixel p and p = (x, y) represents pixel coordinates in image I 1 .Since we formulate this problem as global optimization in an MRF, we can also consider w(p) to be a label for pixel p.The energy function to be minimized is formulated as Eq. ( 1); it includes a data term E d and a smoothness term E s .The data term represents the similarity between the matched pixels corresponding to the motion field, and the smoothness term constrains the labels of adjacent pixels to be similar.Here, is a set containing all neighborhoods on a four-connected image grid, and λ weights the smoothness term.
L } be the candidate label set of each pixel p in image I 1 , which contains L candidate labels.For simplicity, here we set the size of every pixel's label set to be the same, L, although they can differ in our algorithm.
Belief propagation is an inferencing algorithm which works by passing messages around the 4connected image grid iteratively [43].It updates an L-dimensional message m t p→q (w q i ), 1 i L, sent from each pixel p to each neighbor q at each iteration t from [0, T ].The messages are computed in the following way, where N (p)\q denotes the neighbors Fig. 1 Pipeline of our approach.Label set initialization composes the label sets using N nearest neighbors in feature space.In addition to iterative optimizing the objective function, the optimization phase marks the worst candidate in each label set by cost and replaces it by a new label proposal in each iteration.
of p other than q.m t p→q (w q i ) = min Then, with the obtained m t p→q , we can compute a belief vector b t p (w p i ) for each pixel p at each iteration The value of b t p (w p i ) represents an approximation to the probability that the correct label for p is w p i .After T iterations, the final belief vector b T p (w p i ) can be calculated for each pixel, and we can select the best label w * (p) for every pixel p from its label set C(p) by minimizing b T p (w p i ) pixelwise.How we choose the label set C(p) is very important.The set cannot be too large because optimization will be prohibitively slow.But a fixed small candidate label set may easily cause convergence to local minima.Therefore, our approach uses a compact dynamic candidate label set.We initialize a very small label set for each pixel, and modify the label sets iteratively during BP to avoid local minima.

Initialization
We use a multi-scale K-nearest-neighbor search strategy to initialize the candidate label sets, as shown in Fig. 2. First, we construct image pyramids with N L levels, where N L = 4 in our experiments, for both I 1 and I 2 , by downsampling the original images using bilinear interpolation.Let I i (i = 1, 2) be the downsampled image of I i at each pyramid level .We compute a feature descriptor for each pixel in I 1 and I 2 to help finding correspondences: for a wide-baseline image pair, the brightness of an object may change during the transition between views, so a feature descriptor is more robust when finding nearest matches.To overcome local scale and rotation changes in the wide baseline scenario, we use per-pixel scale-invariant feature transform (SIFT) descriptors [6] as the dense feature descriptor.After we get the feature maps D 1 for I 1 and D 2 for I 2 , we search K nearest neighbors in D 2 for every descriptor in D 1 under L 1 distance.Then we get K labels corresponding to the K nearest neighbors for each pixel in I 1 at level , and we upsample it to the original scale of image I 1 to propose K initial labels for each pixel in I 1 .We collect the initial labels proposed from each level to get the initial candidate label set of each pixel in I 1 with size N = K .In our experiments, we consider K = 2 labels for each level to get 8 candidates for each pixel.Note that the multi-scale scheme is only used during initialization.The optimization stage does not require a coarseto-fine scheme to prevent local minima, since we use the homography guided modification strategy, as introduced in the next section.

objective function
We first introduce the specific data term and smoothness term used.We use the truncated L 1 distance between the matched SIFT descriptors (computed in the initialization phase) along with the displacement as the data term to account for matching outliers, and we use the truncated L 1 distance between labels of neighboring pixels as the smoothness term to account for motion discontinuities.These are shown in Eqs. ( 4) and ( 5 Given this specific energy function, optimization can be performed.As mentioned in Section 3.2, a small candidate label set may lead to local minima easily, so we propose a novel optimization scheme to tackle the problem.Inspired by BP [43], we also solve the minimization problem by passing messages.But after message passing at each iteration, we perform a homography check and a label set modification to prevent local minima.In order to conduct the homography check and the label set modification, we first over-segment image I 1 into superpixels S = {S 1 , • • • , S K } following Ref.[44], and we regard each superpixel as a small plane, which corresponds to somewhere in I 2 by a homography, as in Ref. [2], since it is small and it usually has homogeneous color.

Homography check
As introduced in Section 3.2, we compute a belief vector b t p (w) for every pixel p after each iteration t, and select the currently best label w * t (p) from C(p) for p.With prior knowledge of the plane approximation for each superpixel, we can fit a homography H i for each superpixel S i from the best labels of all the pixels in S i using RANSAC [45].The homographies help to generate new labels while modifying the label sets, as explained later.To ensure the validity of labels suggested by homographies, we first need to identify whether a homography is reliable or not.
After H i is obtained, we can project each pixel p in S i to a new location p in I 2 using H i .p = H i p ( Let q = p + w * t (p) be the location corresponding to the current best label.Then we can define a delta function using the Euclidean distance Dis(p , q) between p and q: to determine whether a pixel is an inlier (δ(p) = 1) or an outlier (δ(p) = 0), where r is a threshold.We show the process in Fig. 3.
Then we can compute the reliability Re(S i ) of the fitted homography H i of a superpixel S i , which calculates the proportion of inlier pixels in where |S i | is the number of pixels in S i .Therefore, we can identify whether the fitted homography H i of a superpixel S i is reliable using a threshold ζ, and find the set R of superpixels whose fitted homographies are reliable.
The remaining superpixels constitute the set U = S \ R of superpixels whose fitted homographies are unreliable.

Label set modification
After determining all superpixels to be reliable R or unreliable U, we modify the candidate label set by substituting new labels.Substituting a label w here means replacing the worst label in the current candidate label set for each pixel by the new proposed label w.Here, similar to the definition of the currently best label, we select the current worst label by maximizing b t p (w p i ).We use different ways to propose new labels for pixels for reliable and unreliable superpixels.In the first case, if a pixel p belongs to a reliable superpixel S i , we directly use the homography H i fitted in the homography check to generate a new label using Eq. ( 10) since we consider the reliable homography to be a good estimate of the transformation of p from If p is a pixel of superpixel S i whose fitted homography H i is unreliable, we can not use H i directly to generate a new label.Instead, we utilize other superpixels whose homographies are reliable to help generate new labels.To that end, we construct an undirected graph whose nodes are all superpixels and edges connecting the superpixels with shared boundaries, as shown in Fig. 4. The weight of each edge is defined as the color similarity between the connected superpixels.Following Ref. [2], we create a normalized color histogram for each superpixel, and compute the χ 2 distance between histograms of pairs of adjacent superpixels as the color similarity.Using the graph structure, we define the similarity between any two superpixels as the shortest path connecting them on the graph, which can be easily computed using Dijkstra's algorithm.Then we generate the M new labels based on the similarity between any two superpixels.We search for the M most similar superpixels from R for S i ∈ U , and we project p using the M corresponding homographies H j i , j = 1, • • • , M, to generate the M new labels w j as below: We show this process in Fig. 5.For the unreliable superpixel S i (yellow in Fig. 5(b)), we consider the M superpixels (blue) in R most similar to S i .Note that we do not use the neighboring superpixels directly to propose new labels for S i , as some neighboring superpixels may not belong to the same object as S i when S i is near the boundary of an object.Moreover, unlike reliable superpixels where we propose one new candidate for each pixel, we propose M new candidates for each pixel in the case of unreliable superpixels, to improve the chance that we propagate the correct homography to the unreliable superpixel.Since we use the same homography to generate new labels for pixels in the same superpixel in both two cases, these labels will share consistency between neighboring pixels so that the smoothness term may be reduced dramatically even if these labels are incorrect.Therefore, in practice, to avoid this issue, during each iteration, we uniformly sample 30% pixels from the outlier pixels of reliable superpixels and 30% pixels from unreliable superpixels for modification.

Occlusion handling
Since we do not consider occlusion explicitly, the computed displacement vectors of occluded pixels may be incorrect.Therefore, we remove outliers from our result using forward-backward consistency checking, i.e., we compute forward displacement vectors from I 1 to I 2 and backward vectors from I 2 to I 1 and discard inconsistent ones.Then we use a state-of-the-art interpolation scheme [46] to interpolate the discarded regions.

Interpolation
With the computed displacement vectors w 1 for I 1 and w 2 for I 2 , we can smoothly interpolate any intermediate image I t at time t ∈ (0, 1) between I 1 and I 2 using a patch-based reconstruction scheme [47].For any pixel p in I 1 , its motion vector to I t is t • w 1 (p).Thus, we can map each pixel p in I 1 to  After obtaining the intermediate image I 1 t warped from I 1 and I 2 t from I 2 , we blend them using the multiband blending method [48] to get the final interpolation result I t .

Experiments
In this section, we first analyze the performance of our approach experimentally, and validate the claims made in Section 4.1.Then we evaluate our method by comparing it to prior work in Section 4.2.

Validation for label set modification
Since we use a very small candidate label set for each pixel, the initial label set may not include the correct label at all.Therefore, if we perform optimization over constant label sets, it is easy to fall into local minima.However, our strategy of label set modification can help avoid local minima without enlarging the label sets.To validate this claim, we first perform experiments on image pairs with ground truth optical flow.We show two cases from the MPI Sintel dataset [49] with and without large displacements respectively.
To evaluate a pixel's candidate label set, we select the label nearest to the ground truth label from the label set.If the endpoint error (EPE) between the selected label and the ground truth label is less than γ pixels, where γ is an threshold, the pixel's candidate label set is considered to be a good label set.Pixels without good label sets tend to stick in local minima more often than pixels with good label sets.Therefore, we expect more pixels to have good label sets after label set modification.To show the quality of all pixels' label sets clearly, we draw a pixel in black if its label set is good, and otherwise white: see Fig. 6 and Fig. 7.We compare the ratio of pixels with good label sets before optimization to the ratio after 10 iterations of optimization, to evaluate the effectiveness of our label set modification strategy, using a threshold γ = 5.It is shown that our modification strategy effectively improves the ratio of pixels with good label sets.For an image pair with small displacement (see Fig. 6), 93.1% pixels' initial label sets are good, while 99.1% pixels' modified label  sets are good.For a more challenging image pair with large displacement (see Fig. 7), label set modification increases the ratio from 53.6% to 78.5%.
We further validate the effectiveness of our strategy by comparing the energy convergence with and without label set modification.We performed experiments on image pairs with large pixel displacements (≈ 200 px) and small pixel displacements (< 10 px). Figure 8 shows the change of energy during iteration.We can see that in both cases, the energy decreases dramatically after employing our dynamic label set framework.
We also compare the results visually and quantitatively.Figure 9 shows the visual comparison between the interpolated images from wide-baseline image pair with and without using the modification process.We can see that there are more artifacts in the result without using our label set modification strategy.The quantitative comparison on the Middlebury dataset [23] is shown in Table 2 and Table 1.All these results demonstrate the effectiveness of our  label set modification strategy for introduce more correct labels to the candidate label set.

Validation of homography check
In Section 3.4, we use a homography check to divide superpixels into reliable and unreliable ones in order to guide the process of label set modification.Here we experimentally validate the effects of the homography check.
We performed an extra set of experiments, in which we do not conduct the homography check, i.e., we consider all fitted homographies to be reliable.As in Section 4.1.1,we first compare the energy change during iteration.Figure 8(b) shows that, for cases whose pixel displacements are small, there is not much difference in performance between the methods with and without the homography check process.The reason is that for these relatively easier cases, there are sufficient inlier pixels in each superpixel to fit a reliable homography, because there are sufficient pixels whose initial candidate label sets are good enough (as shown in Fig. 6).However, the homography check process is more useful for widebaseline image pairs.As we can see in Fig. 8(a), in more challenging cases whose pixel displacements are much larger, the energy is lower when we conduct the homography check process.Moreover, comparing interpolated images in Fig. 11, we can easily see that with the homography check process, our method generates many fewer artifacts such as distortion and holes.
We also compare the performance quantitatively on the Middlebury dataset [23].See Table 2 and Table 1.Conducting the homography check process improves the accuracy of both the estimated motion fields and interpolated images.

Comparison to prior work
In this section, we first compare our method with prior work by evaluating images interpolated for wide-baseline image pairs by Refs.[50] and [37], qualitatively showing the effectiveness of our method for handling large displacements.In addition, we quantitatively compare our method with other algorithms by evaluating the estimated motion fields and interpolated images on the Middlebury benchmark database [23].
We show that our method also achieves good performance on image pairs containing small motions, demonstrating the robustness of our method.

Qualitative evaluation
Nie et al. proposed a wide-baseline image interpolation algorithm, which is the state of the art for our problem.The second column of Fig. 10 shows the results of Ref. [2], while the last column shows our results.We can see that the method of Ref. [2] generates more artifacts such as distortion and blurring than ours.Our method handle these cases much better, due to the spatial smoothness constraint which we enforce explicitly during optimization.
Since optical flow methods can also be used to interpolate images between image pairs, we also compare our method with Maskflownet [9], the state-of-the-art optical flow method based on deep learning, and some variational model methods similar to ours.In our experiments, we computed optical flow between image pairs using these optical flow methods and interpolated the intermediate images using the interpolation method in Section 3.6.The first column of Fig. 10 shows the results of Maskflownet.We use the pre-trained model trained on Flying Chairs [51], Flying Things3D [52], and MPI Fig. 10 Comparison between Refs.[2], [9], and our method.There is less distortion in our results.Sintel [49], provided by the authors of Ref. [9], to infer optical flow.As shown in Fig. 10, Maskflownet generates more artifacts than our method when interpolating between wide-baseline images.The performance of Maskflownet dramatically reduces when the displacement between the image pair is too large, as shown in the third row of Fig. 10; our method can handle these wide-baseline cases very well.One possible reason is the lack of training data suitable for many amateur datasets, which are more common.Our method takes only two images as input, which making our method more flexible.
We also compared our method with two variational model optical flow methods, DiscreteFlow [5] and SPM-BP [21], which are similar to our optimization scheme.DiscreteFlow is a representative large displacement optical flow method, which considers large-displacement optical flow from a discrete point of view.It proposes a diverse candidate label set which is quite large for each pixel, and performs optimization on this constant label set.Since their candidate label set is much larger than ours, optimization has to been performed on a sampled image grid and they need to get the final flow field by interpolation, while our method performs optimization directly on the full image grid.Our method outperforms DiscreteFlow visually: Fig. 12 shows a comparison.The second column shows the results of DiscreteFlow while the third column shows our results.We can see that our approach produces fewer artifacts such as distortion.
Like our method, PMBP [20] uses the idea of dynamic label set update, but they utilizes PatchMatch to propose new labels.SPM-BP takes advantage of efficient edge-aware cost filtering to speed up PMBP and improves the performance.The first column of Fig. 12 shows results from SPM-BP.We can see that our method perform much better than theirs, due to our strategy of homography guided label proposal, which is more effective than SPM-BP's approach based on the idea of PatchMatch [34].

Quantitative evaluation
We now quantitatively compare our method with other work by evaluating results on two kinds of different datasets.Our method is designed for widebaseline image interpolation.However, the baseline between pairs of images in commonly used optical flow datasets, such as KITTI [3] and MPI Sintel [49], is not wide enough [2], so we use wide-baseline synthetic image pairs photo-realistically rendered from virtual scenes to quantitatively evaluate our method.MVS-Synth [53] is a photo-realistic synthetic Fig. 12 Comparison with Refs.[5] and [21].There are fewer artifacts in our results.dataset that provides ground truth depth maps and the camera parameters for each rendered RGB image.Therefore, we can generate ground truth motion fields between image pairs using the provided ground truth geometry.We compare our method with previous works using wide-baseline image pairs rendered from 20 different scenes, where the average ground truth pixel displacement is about 300 pixels.We give the average end-point error (EPE) of the motion fields estimated by different methods in Table 3, which shows that our method quantitatively outperforms these previous methods.The Middlebury dataset [23] is a widely used dataset for traditional optical flow method evaluation.Since it provides ground truth for the intermediate images, we also make comparisons using them although the average ground truth pixel displacement is only about 10 pixels.In Table 1, we list the peak signal to noise ratio (PSNR) between the interpolated images and the ground truth for different methods.We also compute the average EPE of estimated motion fields on image pairs with ground truth motion fields for different algorithms, as shown in Table 2. Our method quantitatively outperforms these previous algorithms.

Conclusions
We have proposed a novel method of image interpolation, based on a motion estimation algorithm using homography guided optimization.We combine the advantage of both global optimization and a local parametric transformation model.Optimization is performed over very small candidate label sets, which are iteratively modified to avoid local minima, using piecewise consistency priors with superpixel as the bridge.We show experimentally that the proposed method improves the accuracy of both estimated motion fields and interpolated images.
Our method also has limitations.First, our strategy for new label proposal based on homography fitting and propagation uses superpixels as a fundamental structure.Therefore, our method's performance relies on the quality of superpixel segmentation.In addition, corresponding areas in image pairs representing different scenes may not be associated with a homography: our approach does not handle matching between different scenes very well, which is also a target of our future work.
), where D 1 and D 2 are the feature maps of the original input images I 1 and I 2 , and τ d and τ s are the truncation thresholds of the data term and the smoothness respectively.E d (w(p)) = min ( D 1 (p) − D 2 (p + w(p)) 1 , τ d ) (4) E s (w(p), w(q)) = min ( w(p) − w(q) 1 , τ s )

Fig. 4
Fig.4 The superpixel graph.Red lines: boundaries of superpixels.Yellow points (graph nodes) and black lines (graph edges) illustrate the graph structure.

Fig. 5
Fig. 5 Generating new labels for unreliable superpixels.(a) Original input image I1.(b) Searching for similar superpixels in R. Dark regions: unreliable superpixels.Yellow: unreliable superpixel to be processed.Blue: most similar superpixels found in R to the yellow one.
its new location p + t • w 1 (p) in I t to render the intermediate image.Likewise, we can also render the intermediate image using I 2 .

Fig. 6
Fig. 6 Effects of our label set modification strategy for an image pair with small displacement.Pixels having a good label set are black, and others are white.(c) and (d) visualize label set quality before and after optimization.The percentage of the black pixels increases from 93.1% to 99.1%.

Fig. 7
Fig. 7 Effects of our label set modification strategy for an image pair with large displacements.Pixels having a good label set are black, and others are white.(c) and (d) visualize label set quality before and after optimization.The percentage of the black pixels increases from 53.6% to 78.5%.

Fig. 8
Fig. 8 Energy change during optimization.(a) Image pair with large displacement.(b) Image pair with small displacement.

Fig. 9
Fig.9 Comparison between (a) image interpolated using the baseline method with a constant label set, (b) generated by our method with label set modification.

Fig. 11
Fig. 11 Comparison of images interpolated by our approach (a) without the homography check, (b) with the homography check.

Table 2
Motion error (EPE) for the Middlebury benchmark

Table 3
Motion error (EPE) for the MVS-Synth dataset