A retargeting method for stereoscopic 3D video

We propose a disparity-constrained retargeting method for stereoscopic 3D video, which simultaneously resizes a binocular video to a new aspect ratio and remaps the depth to the perceptual comfort zone. First, we model distortion energies to prevent important video contents from deforming. Then, to maintain depth mapping stability, we model disparity variation energies to constraint the disparity range both in spatial and temporal domains. The last component of our method is a non-uniform, pixel-wise warp to the target resolution based on these energy models. Using this method, we can process the original stereoscopic video to generate new, high-perceptual-quality versions at different display resolutions. For evaluation, we conduct a user study; we also discuss the performance of our method.


Introduction
Stereoscopic 3D media has been available for a long time.With the development of mobile internet and social network technologies, stereoscopic videos are displayed on various devices with different display sizes.This brings the need for stereoscopic video retargeting techniques, to adapt videos to new aspect ratios.
In the traditional 2D broadcast world, fixedwindow cropping and scaling are general methods to fill top and bottom (or left and right) video areas.The same process is applied to each frame independently to achieve a new aspect ratio.However, due to the drawback of omitting some parts of the content, many content-aware methods have been proposed [1].These preserve the region-ofinterest from cropping and distortion.With the wide application of stereoscopic 3D techniques, research into content-aware retargeting methods has been carried out in the 3D image processing field.Existing methods for 2D media have been improved and applied to process 3D media.
However, differently from 2D media, the final visual experience is determined by both the image quality and stereo peception.Viewers see one image (frame) with the left eye and see the other with the right eye.Panums area in the brain merges the two images to form a single image with relative depth perception [2].The distance between a pair of corresponding points is called the disparity, and is an important depth cue [3].Retargeting methods for 3D media must include methods for appropriate disparity processing.
Due to improper display configurations and image processing methods, when watching 3D videos, viewers often fail to see the stunning stereo visual experience that they deserve.Sometimes, viewers even feel visual fatigue or eye strain [4].There are two main reasons for poor visual experience.The first one is the distortion of the image contents in 2D.Simple processing methods (e.g., scaling, cropping, uniform stretching) can not produce good video results in many cases because they lack some information from the original video [1].The second reason is the arbitrary modification of the disparity in the stereoscopic field.A good stereo effect can be obtained by modifying the disparity properly according to the new view configuration.However, continual arbitrary disparity changes always cause viewers to feel fatigue [5].Mismatched view configurations, e.g., screen distance, interocular distance, etc., can also cause discomfort.Thus, disparity values should be adjusted carefully in various scenes.Furthermore, compared to 3D still images, 3D video is more complicated because it involves time.Disparity constraints should be introduced not only in the spatial domain but also in the temporal domain.Thus the issue of stereoscopic 3D video retargeting is a more challenging problem.Key parts of the method must not only avoid content distortion, but also remap disparity to a proper range for each frame.
In this paper, we present a technique that extends existing warping-based 2D media retargeting methods to stereoscopic video.The objective of our work is to perform content-aware retargeting to the content of video frames and meanwhile adjust the corresponding disparity values for the stereoscopic video content.We first build energy models for the importance of the contents of the left and right frames.We try to preserve the more important parts in the left and right pictures with minimal distortion.Then we consider disparity constraints based on coherence in spatial and temporal domains, building approrpiate energy models.Unlike previous methods, the proposed method considers both disparity values and disparity variation.It also considers objects' disparity relationships in video frames.Overall, we transform the 3D retargeting problem to a tradeoff of the various negative factors affecting 3D visual experience.In different application scenarios, we may alter weights for the energy components and minimize the sum of the distortion energies to obtain optimal results.
The main contributions of this paper can be summarized as follows: (1) We propose a new retargeting framework for stereocopic 3D videos.The retargeting results can be adjusted according to various viewing scenarios.
(2) We maintain disparity coherence in the temporal domain by constraining both the interframe disparity changes and the intra-frame disparity relation variation.
(3) In the spatial domain, we use a linear depth remapping algorithm to manipulate the disparity range to lie in a comfort zone.

Related work
Previous literature has proposed many methods for retargeting a single 2D image while preserving its important content.Early methods intelligently crop away the surrounding content [6].The seam carving method resizes images by removing or adding pixels judiciously [7,8].The warping method maps the original image to the target resolution according to some constraints [9][10][11], and is more robust for complicated pictures.In this paper, we improve the shape preservation model in Ref. [11].
Many of these single image resizing methods have been extended to 3D image (frame) resizing.Both Refs.[12] and [13] are based on the seam carving method, which sometimes impairs the image contents or changes their shape.Resizing systems based on the warping method are proposed in Refs.[14][15][16], while Refs.[17,18] edit stereoscopic pairs by using a warping method.However, content representing structures such as lines is not preserved well.In Ref. [19], a cropping and warping method is used to improve the stereoscopic 3D experience.In Ref. [20], a layer-based resizing method is proposed.But it seems difficult to extend this method to stereo videos because of the uncertainty and inconvenience of layer segmentation.
For stereo videos, Ref. [21] focuses on disparity manipulation and proposes four disparity mapping operators to adjust disparity.However, depth relations of the adjacent features are not constrained.Besides, it does not support video resizing while adjusting depth, so some important problems in the resolution adaptation field, e.g., consistency of left and right pictures, are not considered by this method.A depth mapping method for stereo videos is proposed in Ref. [22].To minimize distortion of stereoscopic image content, many energy models have been proposed to control the depth mapping process.However, the mesh edge preservation model is not a similar transformation constraint, and it can not ensure that shapes of image contents are maintained.Furthermore, coherence of rates of change of disparity is not constrained at key points, which can cause depth jitter.In Ref. [23], although the proposed system is fully automatic, the videos are cropped under 3D viewing constraints, risking loss of information.Like Ref. [3], the authors of Ref. [24] state some principles in the field of stereoscopic media processing, which give us some inspiration when designing our disparity constraint models.Reference [25] conducts a series of eyetracking experiments and tries to minimize the adaptation time for sudden temporal depth changes.Reference [26] focuses on the visual discomfort caused by disparity changes and uses a depth perception model to determine the time taken by the human visual system (HVS) to adapt to the changes.

The proposed method
We first detect the important parts in the key frames.
In our experiments we use the saliency algorithm proposed in Ref. [27].In order to incorporate highlevel information, we also allow the viewer to optionally specify a region of interest.We then determine feature correspondences between the left and right frames for the disparity constraint.We extract SURF (speeded up robust features) using the method presented in Ref. [28] and perform a matching process between the two sets of key points.Estimating the fundamental matrix using the RANSAC method [29], we can get high quality matched key point pairs.Then, in order to control the disparity change in the temporal domain, we track the key point pairs across video key frames using the algorithm proposed in Ref. [30].We construct quad meshes in left and right frames.To control the degree of impairment of the image quality and the stereo vision experience, we build energy models for various distortions.We minimize the total energy to preserve the significant information and control the disparity in the spatial and temporal domains.By computing the coordinates of the quads vertices, we transform every quad to generate a new picture with the target resolution and comfortable depth perception.An overview of our algorithm is given in Fig. 1.To demonstrate the function of our algorithm, we increase the horizontal disparity while reducing the width of the image in the last two subfigures on the right.Details of the algorithm are described below.

Mesh deformation energy
Before modeling the energy, we compute the significance map for each picture.We consider three low-level visual features and integrate them to compute the significance at each pixel: where F S is the saliency map generated by the algorithm in Ref. [27], F G is the gradient map of the image, F L is the line segment map generated by the algorithm in Ref. [31], and S is the final significance map as shown in Fig. 1.If viewers specify the ROI (region of interest), the significant areas also can be propagated automatically into neighboring frames [32].

Grid deformation energy
As in Ref. [11], we attempt to ensure that important quads undergo a similarity transformation and compute the energy for every grid.We denote the i-th grid's energy as where e k is the deformed version of the original k-th edge vector e k .We compute the scale factor s f using ∂E g (i)/∂s f = 0 (3) Unlike Ref. [11], we express the scale factor in terms of the eight unknown coordinate parameters of each quad's four vertices and substitute it into Eq.( 2).Thus, instead of computing the scale factor value of every quad iteratively to approach the optimal solution, we obtain the optimal result quickly.We define the grid deformation energy for both frames as where N is the number of quads, E gL is the energy of the i-th grid in the left view, E gR is the energy of the i-th grid in the right view, and S L (i) and S R (i) are the average significance values of all pixels in the i-th quad in the left and right frames respectively.

Edge distortion energy
Preservation of the important content, as well as the disparity constraints mentioned later, will sometimes seriously deform the grids in unimportant regions.We stop this happening by reducing the bending of the mesh edges.We define the bending energy for every edge in a frame as where e i is i-th edge vector in the mesh and e i is the deformed version.Using the same principle in Section 3.1.1,we describe the scale factor s e in terms of the coordinates of the unknown deformed edge vertices.Thus the edge distortion energy is defined as where M is the number of edges in a frame.With this model, we smooth the edges of adjacent quads.

Spatial domain
The aim of the disparity mapping energy is to constrain the disparity change to ensure depth perception.We first design the disparity constraints in the spatial domain, to control the new disparity in both the horizontal and vertical directions.
After energy-based optimization, we perform a 2D spatial projective transformation to every quad in the frames.This gives the new locations of the corresponding feature points using the vertex coordinates of the quads containing them.Thus, based on elementary geometry, we can describe the new disparity using a function f of the eight unknown x-coordinates of the quads' vertices in the warped left and right frames.The detailed expression for the function f is omitted for the sake of brevity.
where x L and x R denote the vertices x-coordinates in the left and right frames respectively, i is the row index of the upper left vertex of a quad containing a feature point, and j is its column index.We now define the horizontal disparity mapping energy as where D h (i) is the output horizontal disparity at the i-th pair of key points with original disparity D h (i), and K is the number of key point pairs.s d is the scale factor, by which we can adjust the disparity linearly according to the viewing configuration.In Fig. 1 we increase disparity 1.5×.As mentioned in Ref. [3], vertical disparity always causes 3D fatigue for viewers.However, for points in the frames, the nonlinear transformation may cause location changes in the y-direction.Using the same principle, we can describe the vertical disparity and mapping energy as where D v (i) is the output vertical disparity at the i-th pair of key points.Unlike in the case of the horizontal disparity, the objective is now to reduce the sum of the vertical disparity mapping energies to zero.Vertical disparity is one of the most important reasons for the dissatisfaction with 3D perception.See Fig. 2. With this constraint, vertical disparities are removed.

Temporal domain
We now consider how to constrain disparity consistency in the temporal domain.Many factors affect the detection of important contents in an image and cause different mesh deformation for the same objects in the adjacent frames.Performing retargeting on each frame separately will lead to depth jitter in certain local regions.Frequent disparity variation, i.e., depth jitter, easily causes visual fatigue even within the zone of comfortable viewing [5].To resolve this problem, we collect the depth relations between each sequential pair of key frames by tracking the key points [30] and estimate their inter-frame disparity variation.In many cases, such as accelerating motion and so on, the disparity between adjacent key frames at a key point could be different.The disparity difference must be preserved in the retargeted video.Also important is that the depth variation trend in the retargeted video must be consistent with that in the original video at the key points.To ensure that the depth changes are consistent, we preserve the disparity variation rate at every key point by defining a disparity jitter energy as where Z is the number of key points tracked successfully, D T (i) is the output disparity value at time T at the i-th key point, and the original disparity value is D T (i).D T +1 (i) is the output disparity value at the same key point in the next key frame.(Frames are assumed to be one unit of time apart.)Note that at time T + 1, the original disparity variation rate and output disparity D T (i) at time T are known constants.Secondly, our concern is to preserve depth layer relations, i.e., intra-frame disparity variation.At a global level, any difference in depth between an object and its surroundings should be preserved in proportion.To simplify the problem, our goal is to keep the disparity differences at the key points changing in proportion to those between the original two neighboring frames.The constraint of interframe disparity variation can only ensure that the disparity differences between objects are unchanged.It cannot ensure that the proportional disparity differences change in adjacent frames.We define the number of the neighboring quads as the search window radius to find nearby key points.In our experiments, the number of neighboring quads is 2.
We define the disparity difference energy as where Z j is the number of nearby key points for the i-th key point.ΔD T (j) is the disparity difference between the i-th key point and the j-th nearby key point at time T .

Energy optimization
The combined warp energy generated from all the aforementioned terms is 14) where ω g , ω e , ω h , ω v , ω t , and ω d are the weights of each energy component, which can be adjusted to suit needs.
The boundary constraint ensures that we cannot change the coordinates of the vertices at the four corners.The vertices on the boundaries are only allowed to move along the boundary edge.The minimization of the total energy constitutes a leastsquares problem.By solving a system of linear equations [33], we obtain the optimal results (Fig. 3).The weight values in Eq. ( 14), in order, are 1, 1, 5, 1000, 500, and 500, which denote that we pay more attention to removing the vertical disparity and the depth variation consistency in the temporal domain.

Experiments
Our method is efficient.Running on a PC with an Intel quad core 3.1 GHz CPU and 4 GB RAM, our method takes about 1.5 s for every stereoscopic frame at a resolution of 720 × 416 with quad size 20×20.This is without any parallel algorithm based on either multicore CPU or GPU.

Without spatial disparity control
In Fig. 4, we retargeted one frame from resolution 362 × 248 to 248 × 248 twice.The spatial disparity constraints were applied in one case and not in the other.Note that • indicates a key point in the left frame and + is the location of the corresponding point in the right frame.We could readily see the inconsistent disparity values and vertical disparities in the retargeted frame without the spatial disparity control.

Without temporal disparity control
In Fig. 5, we retargeted the sequence in Fig. 3 without the temporal disparity control.In the retargeting process, we tracked 90 key points in the sequence.Then we computed the disparity variation rate d t /d t+1 for every corresponding point respectively and the differences of the rate values before and after retargeting.The smaller the difference value is, the better the disparity variation is preserved.The average difference value is 4.58%.In the same way, we obtained the average difference value is 2.19% when we considered the temporal disparity constraints.It shows that the disparity variation can be preserved better when we constrain the disparity in the temporal domain.[21].Bottom two images: results computed by our method.The objective is to reduce the disparity to lie within the comfort zone.In the blue circle, the disparity changes obviously when using Ref. [21].Using our method, the disparity variation is more consistent with the original video.

Assessment
Lacking an objective assessment method for stereo video, and executable code for other methods, we cannot definitively state that our method is better than others.But by making a comparison to the video processing results given in previous work, we can demonstrate the effectiveness of our method.In Fig. 6, we compare the performance of our method to that of a previous video retargeting method [21].The upper two images are two sequential original frames.The objective is to improve the visual experience by reducing excessive disparity.Figure 6 shows red-cyan pictures, clearly indicating the disparity range.The method in Ref. [21] does not take account of temporal disparity control and causes disparity variation for the white car, while the disparity is stable in the original video.However, by using our method, disparity variation is constrained.In the two frames, the disparity values in the blue circle are reduced and kept similiar.
We further assessed the perceptual quality of our method by performing a subjective assessment based on a user study.We adopted the pair comparison method with 4 videos from "Mobile 3DTV" and produced 3 test videos for each one [34,35].In short, as judged by 14 observers with normal stereopsis, we obtained good results: 86% correct discrimination for depth change and 81% acceptance of the picture quality.

Limitations
Via experimentation and subjective assessment, we may summarize the limitations of our method.Firstly, for videos containing many structured objects in the pictures, our method cannot produce good results because there is not enough space left for warping, which is a common problem for warping-based methods.Secondly, too large a disparity adjustment will impair the quality of the video.Layer-based resolution adaptation methods or modification based on monocular depth cues might be helpful in overcoming this problem in some cases.

Conclusions
In this paper, we proposed a retargeting method to resize stereoscopic 3D video and adjust disparity of video content to lie in the comfort zone.We build energy models for picture deformation and disparity mapping.To preserve important image contents, we constrain grid deformation and edge distortion.To maintain depths within the comfort zone, we constrain the disparity in the spatial and temporal domains.The spatial disparity constraints apply to horizontal and vertical disparity mapping.The temporal disparity constraints consider inter-frame disparity variation and intra-frame disparity variation.Finally, by adjusting the weight combination, our method can be applied to various view configurations.The objective of our work is to perform contentaware retargeting of pictures of video frames and meanwhile adjust the disparity values for the contents in stereoscopic video.Although there are some limitations to our method, our experiments have demonstrated its capabilities by a user study and a comparison to the latest related work.We hope that our work promotes future research in stereoscopic 3D media retargeting.

Fig. 1
Fig. 1 Algorithm overview.Left to right: original left frame, output left frame, significance map of the left frame, warped mesh of the left frame, disparity map between the original left and right frame (• are key points in the left frame and + are those in the right frame), disparity map between the output left and right frame.

Fig. 2
Fig. 2 Left to right: original left frame, original anaglyph frame, resizing without vertical disparity constraint, resizing with vertical disparity constraint.

Fig. 4
Fig. 4 Retargeted frame with and without spatial disparity control.Left to right: original left frames, original anaglyph frame, retargeted anaglyph frame with spatial disparity constraints, retargeted anaglyph frame without spatial disparity constraints.

Fig. 5
Fig.5 Retargeted frame with and without temporal disparity control.Left to right: original anaglyph frame, retargeted anaglyph frame without temporal disparity constraints, difference in disparity variation rate without temporal disparity constraints, retargeted anaglyph frame with temporal disparity constraints, difference in disparity variation rate with temporal disparity constraints.

Fig. 6
Fig.6 Comparision of retargeting results.Top two images: two sequential original frames.Middle two images: results computed by the method in Ref.[21].Bottom two images: results computed by our method.The objective is to reduce the disparity to lie within the comfort zone.In the blue circle, the disparity changes obviously when using Ref.[21].Using our method, the disparity variation is more consistent with the original video.