As we remap the depth range of each pair of stereoscopic video frames, we not only change the disparity of each matched keypoint pair, but also preserve image features, such as relative depth distance between neighboring features, lines and plane surfaces. Our system can automatically incorporate the detected image features and any user specified high-level information in the depth mapping process.
Depth Preservation of Neighboring Features
The 3D depth field that a viewer perceives from a stereoscopic 3D image is mainly conveyed by the relative depths among neighboring objects/features in the 3D image. Hence, we need to preserve the relative depth distances among neighboring features in order to avoid cardboard effects after depth mapping (Ward et al. 2011). This will also help preserve the 3D scene structure. We achieve this with the following constraint:
$$\begin{aligned} E_{rel-z}=\sum _{i} \sum _{j \in N_{i}} \Vert (s_{i}^{\prime }-s_{j}^{\prime })-(\hat{s}_{i}-\hat{s}_{j})\Vert ^{2}, \end{aligned}$$
(11)
where \(N_i\) is the set of neighboring keypoints to \(i\), \(\hat{s}_{i}-\hat{s}_{j}\) is the ideal disparity difference between keypoints \(i\) and \(j\) after depth mapping, and \(s_{i}^{\prime }-s_{j}^{\prime }\) is the actual disparity difference between keypoints \(i\) and \(j\) after depth mapping and optimization. In our implementation, we set the neighboring threshold as one-eighth of the image width. This means that all features that are within this threshold distance from keypoint \(i\) are considered as neighboring features of \(i\). We utilize the energy term \(E_{rel-z}\) to preserve the change in relative depth distances among neighboring keypoints. Our results show that this constraint helps produce smoother object depths and prevent cardboard effects.
Mesh Edge Preservation
In low texture regions, objects may be stretched or squeezed after depth mapping due to the lack of matched keypoints in these regions. Preserving object shape in these regions is therefore important. We achieve this by preserving the length of mesh edges.
Let \(x_{i,j}\) denote the horizontal coordinate of a mesh vertex in \(i\) row and \(j\) column. We introduce the following energy term for preserving horizontal length of mesh edges:
$$\begin{aligned} E_{length}=\sum _{i,j} \Vert (x^{\prime }_{i,j+1}-x^{\prime }_{i,j})-(x_{i,j+1}-x_{i,j}) \Vert ^{2}. \end{aligned}$$
(12)
We also try to preserve the linearity of vertical mesh edges to be in the same column as follows:
$$\begin{aligned} E_{align}=\sum _{i,j} \Vert 2x^{\prime }_{i,j}-x^{\prime }_{i+1,j}-x^{\prime }_{i-1,j} \Vert ^{2}, \end{aligned}$$
(13)
where \(x_{i+1,j}\) and \(x_{i-1,j}\) are the horizontal coordinates of mesh vertices above and below \(x_{i,j}\) of column \(j\). \(x^{\prime }_{i+1,j}\), \(x^{\prime }_{i-1,j}\) and \(x_{i,j}^{^{\prime }}\) are the actual horizontal coordinate after depth mapping.
Line Preservation
Straight lines appearing in a movie often cross multiple quads constructed by our depth mapping algorithm. Consider a line \(l\) and refer to the sequence of mesh edges (both horizontal and vertical) that it crosses as \((x_{1},y_{1}),(x_{2},y_{2})...(x_{n},y_{n})\). After depth mapping, their coordinates become \((x^{\prime }_{1},y^{\prime }_{1}),(x^{\prime }_{2},y^{\prime }_{2})\)
\(...(x^{\prime }_{n},y^{\prime }_{n})\). Since the vertical coordinate of each pixel does not change after depth mapping, which means \(\frac{x^{\prime }_{i-1}-x^{\prime }_{i}}{y_{i-1}-y_{i}}=\frac{x^{\prime }_{i}-x^{\prime }_{i+1}}{y_{i}-y_{i+1}}\), we introduce the following energy term to prevent lines from bending:
$$\begin{aligned} E_{line}=\sum _{l} \sum _{i \in l} \Vert (x^{\prime }_{i-1}-x^{\prime }_{i})-\frac{y_{i-1}-y_{i}}{y_{i}-y_{i+1}}(x^{\prime }_{i}-x^{\prime }_{i+1}) \Vert ^{2}.\nonumber \\ \end{aligned}$$
(14)
Generally speaking, since points on a line may have different disparity values, a line projected to the left and to the right images may be rotated in opposite directions after depth mapping. Hence, we should allow lines to rotate.
This is different from image resizing, where line rotation should not be allowed. However, if a line is vertical and points on it are with same disparity, its orientation should be maintained after depth mapping. For example, if a pillar is vertical to the ground and points on it are of the same disparity, then it should still be vertical to the ground after depth mapping.
Plane Preservation
As our method is based on adjusting the disparity of matched keypoints and image warping, planes may be distorted after depth mapping. The main reason is that keypoints originally lie on a 3D plane may no longer lie on the same plane after depth mapping. We show our proof on this in the Appendix.
We address this problem by utilizing plane fitting in the original 3D image and then plane preservation in the depth mapping process. Let \((x_{l}, y)\) and \((x_{r}, y)\) be a pair of matched keypoints on a plane. If we fix our view point at \((\frac{w}{2\beta }, \frac{h}{2\beta }, 0)\), where \(h\) and \(w\) are the height and width of a 3D image, the coordinate of the matched keypoint pair in 3D space is:
$$\begin{aligned} \begin{aligned} X&= \frac{e}{e\beta -(x_{r}-x_{l})}\left(\frac{x_{r}+x_{l}-w}{2}\right), \\ Y&= \frac{e}{e\beta -(x_{r}-x_{l})}\left(y-\frac{h}{2}\right), \\ Z&= \frac{et\beta }{e\beta -(x_{r}-x_{l})}. \end{aligned} \end{aligned}$$
(15)
We extract 3D planes from the original 3D image and then identify matched keypoints that are on the same planes as follows:
-
1.
We triangulate the keypoints on the original left (or right) image and compute the normal of each triangle.
-
2.
If the normals of some adjacent triangles are similar, we combine these triangles to form a small plane and update its normal.
-
3.
We further cluster small adjacent planes into larger ones and update their normals iteratively until no two plane clusters can be combined together.
-
4.
Finally, if a plane cluster contains at least a certain number of keypoints (30 in our implementation), we output this as a plane.
After we have obtained the objective coordinates of keypoints in each frame, we fit a plane to a keypoint set that are originally on the same 3D plane. We use \(D=a_{l}x+b_{l}y+c_{l}\) to present a plane in left image and \(D=a_{r}x+b_{r}y+c_{r}\) to represent its corresponding plane in right image (see Appendix), where \(D\) is the disparity value. \(a_{l},b_{l},c_{l},a_{r},b_{r}\), and \(c_{r}\) are parameters to be solved by a least square method.
In order to ensure that a matched keypoint pair, \((\hat{x}_{i,l},y_{i})\) and \((\hat{x}_{i,r},y_{i})\), lie on the target plane, which is separately mapped to the left and right images, we define the following energy terms:
$$\begin{aligned} \begin{aligned} E_{lp,i}&= \Vert a_{l}\tilde{x}_{i,l}+b_{l} y_{i}+c_{l}-(\tilde{x}_{i,l}-\tilde{x}_{i,r})\Vert ^{2},\\ E_{rp,i}&= \Vert a_{r}\tilde{x}_{i,r}+b_{r} y_{i}+c_{r}-(\tilde{x}_{i,l}-\tilde{x}_{i,r})\Vert ^{2}, \end{aligned} \end{aligned}$$
(16)
where \((\tilde{x}_{i,l},y_{i})\) and \((\tilde{x}_{i,r},y_{i})\) are the target positions of \((\hat{x}_{i,l},y_{i})\) and \((\hat{x}_{i,r},y_{i})\), respectively, after optimization.
In fact, \(a_{l},a_{r}\) and \(b_{l},b_{r}\) are always smaller than \(0.01\). If we simply solve Eq. 16 by combining it with constraint \(\tilde{x}_{i,r}-\tilde{x}_{i,l}=\hat{x}_{i,l}-\hat{x}_{i,r}\), the objective horizontal coordinates of the matched keypoint pair may be shifted far away from their original position, due to the small coefficients. Hence, we introduce another constraint energy term as follows, which aims to prevent keypoints to be shifted too far away from their original positions:
$$\begin{aligned} E_{cp,i}=\Vert \tilde{x}_{i,l}-\hat{x}_{i,l} \Vert ^{2} + \Vert \tilde{x}_{i,r}-\hat{x}_{i,r} \Vert ^{2}. \end{aligned}$$
(17)
By minimizing the following energy term and assigning the computed values of \((\tilde{x}_{i,l},y_{i})\) and \((\tilde{x}_{i,r},y_{i})\) to \((\hat{x}_{i,l},y_{i})\) and \((\hat{x}_{i,r},y_{i})\), we may update the objective horizontal coordinates of a set of matched keypoint pairs that fall on the target plane:
$$\begin{aligned} (\tilde{x}_{l}, \tilde{x}_{r}) = \mathop {\text{ argmin}}_{\tilde{x}_{l}, \tilde{x}_{r}} \sum _{i \in h} (E_{lp,i}+E_{rp,i}+\omega E_{cp,i}), \end{aligned}$$
(18)
where \(i\) is a keypoint on plane \(h\), \(\omega \) is empirically set as \(1.0\times 10^{-5}\) in our experiment. Therefore, we can define our plane preservation energy term as
$$\begin{aligned} E_{plane}=\sum _{i} \sum _{i \in h} (\Vert x^{\prime }_{i,l}-\hat{x}_{i,l} \Vert ^{2}+\Vert x^{\prime }_{i,r}-\hat{x}_{i,r} \Vert ^{2}). \end{aligned}$$
(19)
In conclusion, for each frame in the depth mapping process, we compute the optimized coordinates of mesh vertices associated with the stereo frame by minimizing the following energy term:
$$\begin{aligned} E_{frame}&= E_z + w_{2} E_{rel-z} + w_{3} E_{length} + w_{4} E_{align} \nonumber \\&+ w_{5} E_{line} + w_{6} E_{plane}. \end{aligned}$$
(20)
In our experiments, we set the parameter values as follows: \(w_{2}=2.0,w_{3}=1.0,w_{4}=1.0,w_{5}=1000\) and \(w_{6}=100\).