Depth Estimation using Modified Cost Function for Occlusion Handling

The paper presents a novel approach to occlusion handling problem in depth estimation using three views. A solution based on modification of similarity cost function is proposed. During the depth estimation via optimization algorithms like Graph Cut similarity metric is constantly updated so that only non-occluded fragments in side views are considered. At each iteration of the algorithm non-occluded fragments are detected based on side view virtual depth maps synthesized from the best currently estimated depth map of the center view. Then similarity metric is updated for correspondence search only in non-occluded regions of the side views. The experimental results, conducted on well-known 3D video test sequences, have proved that the depth maps estimated with the proposed approach provide about 1.25 dB virtual view quality improvement in comparison to the virtual view synthesized based on depth maps generated by the state-of-the-art MPEG Depth Estimation Reference Software.


I. INTRODUCTION
3D video systems have recently gained a lot of attention.
Many new 3D video systems have been developed. Among them super multiview television and free viewpoint television can be examples of such novel 3D systems. In the free viewpoint television a user is able to freely choose a position of a virtual camera. The requested view of a scene is generated from dynamic 3D representation of the scene.
The most commonly used 3D representation is a MultiVideo and Depth (MVD) [6] composed of multiple videos acquired by the set of cameras and accompanied depth maps for each of the views. Based on transmitted videos and depth data any view can be easily generated by employing depth-image-base rendering (DIBR) [7].
Recently 3D extension of such standards as AVC [32], [33] and HEVC [31] that allows for efficient transmission of dynamic 3D scene representation in MVD format has been finalized.
Depth information in such systems can be acquired either directly by depth cameras [8], or indirectly by algorithmic depth estimation from recorded videos [9]. Commonly depth information is obtained by the conversion from disparity information [10]. Although in computer vision, disparity d is often treated as synonymous with depth (distance z), essentially those terms are the inverse of each other. Disparity is a displacement vector between corresponding fragments (pixels, blocks) of two images of the same scene taken from different viewpoints. Those two corresponding fragments represent the same fragment of an observed scene but seen from two different viewpoints. Stereo correspondence search is an active research topic in computer vision, and one of the basic method of obtaining disparity information. There are many stereo disparity estimation methods known. Comprehensive study of stereo disparity estimation methods can be found in [34], and on the Middlebury webpage [30] containing up-to-date benchmark of stereo disparity estimation methods. In the scope of development of multiview systems stereo correspondence search was extended to multiview correspondence search [11], [12], [35].
For the sake of simplicity and accuracy, many algorithms assume that images are taken by a rectified set of cameras [13], [14]. Consecutively, corresponding fragments of a given image can be found on the same horizontal line in the remaining images.
Some algorithms use three views (left, central and right) [15], [17], [16], [36] as inputs and produce disparity map or depth map for the central view (Fig. 1). Often when it is not important which of left or right view is referred to, a name "side view" is used instead.
During disparity estimation, for a given fragment of the central view, the algorithm searches for the corresponding fragment in the side views that represent the same fragment/portion of the scene.
The correspondence search is done on the basis of Similarity Metric which expresses how probable it is that a certain fragment of one image is the corresponding fragment of the second image. Although the metric used is often called similarity, it actually expresses dissimilarity between fragments. There are many Similarity Metrics known from literature: Sum of Absolute Difference (SAD) Sum of Squared Difference (SSD), Rank, Census, Cross Correlation and other [3], [4].
The correspondence search is often defined as an optimization problem in which for every fragment of the central image the best (the most similar) fragment of the side views is selected. This optimization problem maybe expressed in terms of energy function using Markov Random Field (MRF) and optimized via one of the optimization algorithms such as Belief Propagation [19], [20], Dynamic Programming [22], or Graph Cut [21].
Since input videos are captured by multiple cameras with different positions, some parts of the observed scene can be occluded, and thus not visible, within some of the views. Disparity estimation for those fragments of a scene is challenging and requires special care. If the the algorithm do not properly taking into account, possible occlusions within the scene, estimated disparity can be wrong. Estimated disparity can indicated not truly corresponding fragments.
In this paper a novel approach to occlusion handling designed to work in three-view disparity estimation algorithms is proposed.

II. OCCLUSION PROBLEM IN DISPARITY ESTIMATION
Given three images, center I C , left I L and right I R of the same size, we search for such a displacement t for every pixel P of center view (at coordinates (x, y)) that minimize cost function expressing a similarity between pixel P (or small fragment around the pixel P like block) and a corespondent pixel P' (small fragment around pixel P') displaced by t in side views (at positions (x + t, y) in left and (x − t, y) in right view). Such displacement is then a disparity of a given pixel P of a center view.
In disparity estimation based on / from three views (see Fig. 2), a given point of the scene visible from center view can be visible from both of the side views (point A), or only from one of the side views (left or right, point B), or from neither of them (point C).
If the given fragment of the scene visible from center view is not visible from one or both of the side views, we say that a given fragment of the scene is occluded in side view (is not visible from that particular side view).
The simplest method for detecting occluded fragments is cross-checking [23]. Cross-checking tests the consistency of estimated disparity value for pixels from center view with those estimated for pixels in left and right views. If the disparity value estimated in each view is different for a correspondent triple of pixels from center, left and right views given pixels are assumed to be occluded. Next, the disparity value for occluded pixels are extrapolated from neighboring pixels that are not occluded. In order to perform cross-checking, disparity maps for all of the three views are required. Estimation of three disparity maps is not always possible. Even if the estimation of three disparity maps instead of one is possible it is resource and time consuming.
Occlusion handling is performed by adding/putting additional constraints, such as ordering constraint or uniqueness constraint to objective function of optimization procedures like Graph Cut (GC), Dynamic Programming (DP) or Belief Propagation (BP) used to estimate the disparity map.
The ordering constraint [24] imposes the same order of corresponding pixels in all views. If a pixel A is on the left of pixel B in the center view, in the side view pixel A' that is a corresponding pixel of pixel A must be as well on the left of pixel B', a corresponding pixel of pixel B. In real scenes the ordering constraint can be violated in the case of big perspective change or in case of thin objects. In such cases ordering constraint can introduce errors in estimated disparity maps.
The uniqueness constraint [25], [26] imposes the one-to-one correspondence between pixel in center and side views. If a given pixel A of the central view is assigned to a corresponding pixel B in the side view, no other pixel of central view can be assigned to a correspondence with pixel B in side view. This way unique pixel to pixel correspondence is forced across all of the views.
There are many disparity estimation algorithms known, that handle occlusion in efficient way [5], [28], [26]. The main drawback of all of those algorithms are additional constraints (terms) imposed in optimization procedures with increased complexity and thus execution time of the disparity estimation.
Another approach to occlusion handling is to change cost term (eq. 2) composed of similarity metric in optimization algorithms. As we search for corresponding fragment of a central view in both side views simultaneously, there are many ways of defining a Cost(x, y, t) function.
Commonly [17], [16] it is the sum of similarity metrics between a fragment in the center view and corresponding fragments in left and right views.
Because of the occlusions Tanimoto [15] proposed to pick just the most similar fragment from either left or right view. The intuition is that the occluded fragment of the images will lead to less similar fragment, thus the minimum of similarity metrics from left and right view is used.

III. PROPOSED OCCLUSION HANDLING
As it was said before, a given fragment of a scene visible from center view can be occluded in one or both side views (left or/and right) (Fig. 2). In such a case, searching for a correspondence of a given pixel of center view in this particular side view (left or right) is pointless, as the given fragment of the scene is not visible from that particular side view. Considering the correspondence with an occluded fragment of an image could cause errors in estimated disparity.
Therefore the correspondence search should be performed only in side views in which a considered fragment of a center view is not occluded. The cost function should be constructed in such a way that it considers only similarity metrics from not occluded views. If a given fragment is visible in both views, then the cost function should be an average of both similarity metrics, in order to reduce the influence of noise, which is present in all views. We propose to define the cost function in a way that it considers only similarity metrics of fragments from a not occluded view (either left or right) (eq. 5) where N otOcc L (x, y, t), N otOcc R (x, y, t) expresses whether a given pixel of a center view is not occluded in left and right views respectively. Depending on the existence of occlusion in the views, the sum N otOcc L (x, y, t)+N otOcc R (x, y, t) in the denominator of eq. 5 can be 2 if a pixel in not occluded in both views, 1 if it is occluded in one of the side views (either left or right), and 0 if it is occluded in both side views. If a given pixel is occluded in both side views the equation 5 loses its meaning, thus in such a case constant penalty value is used as a cost value.
Cost(x, y, t) = const But why a given fragment (object A) of a scene is not visible in a side view? Because in a side view that fragment is occluded by some other part of the scene (object B). Object B blocks light rays from object A, so in side view closer object B is visible instead the farther object A.
Consider the example on The distance to the camera z is reciprocal to disparity. So a fragment of an image representing a closer object (point B) has bigger disparity than the fragment representing farther object (point A).
For a given pixel A Center of center view at coordinates (x, y) and considered displacement t, corresponding pixel A Lef t in left view should be at coordinates (x + t, y). So, if we want to check whether a fragment A of a scene is occluded in left view we have to check the disparity (distance) assigned to the considered corresponding pixel A Lef t in left view. If a disparity d Lef t (x + t, y) assigned already to considered corresponding pixel A Lef t is bigger than the considered displacement t then probably a pixel A lef t is not a fragment of the same object A but rather some other closer object B that occludes object A in the left view.
Based on such a consideration we can create a function assessing whether for a pixel at coordinates (x, y) and displacement t, corresponding pixel is/can be/will be occluded or not in left and right views.
N otOcc(x, y, t) equal 1 means that the corresponding pixel in side view at a given displacement is probably not occluded.

IV. APPLICATION OF PROPOSED IDEA
The proposed idea is general -it does not impose any particular source of disparity maps d Lef t for left and d Right for right view. But in general, disparity maps for left and right views are unknown before estimating the disparity for central view.
Commonly, disparity maps are estimated iteratively with the use of such algorithms like Belief Propagation or Graph Cut. In such algorithms, at each iteration of the estimation, algorithm maintains up-to-date / best already estimated disparity map for center view. This disparity map is further refined in the next iteration of the algorithm.
For our occlusion detection we propose to use disparity maps of side views created based on the disparity map of center view through Depth-Image-Based Rendering (DIBR). After each iteration of a disparity estimation algorithm, we create disparity maps of side views (d Lef t and d Right ) from the best already estimated disparity map of a center view. This way if the estimation algorithm used assigned already some disparity d Center (B) to some pixel B Center , then pixel A Center cannot have such a disparity that the corresponding pixel A Lef t (Fig. 3) is at the same position as corresponding pixel B Lef t of pixel B Center . In other words fragment B of a scene represented by pixel B Center in center view should occlude a fragment A of a scene (represented by pixel A Center in center view) seen from left view.

V. EXPERIMENTS
We have implemented our idea in Depth Estimation Reference Software (DERS) [18] version 5.0 developed by Moving Picture Experts Group (MPEG) of International Standardization Organization (ISO) during works on 3D video compression standardization. DERS is the state-of-the-art disparity estimation technique, designed with 3D video application in mind. It uses Graph Cut as the optimization algorithm along with many other techniques that improve or/and speed up disparity estimation from three input videos.
Proposed approach was tested on four 3D video test sequences recommended by the MPEG committee (Fig. 5 In applications such as Free View Television, disparity maps are used mainly for the purpose of view synthesis. Therefore, we have evaluated our proposed method indirectly, by assessing the quality of the synthesized views. Disparity maps for two views A and B (Fig. 4) have been estimated with the use of the proposed method and original unmodified DERS software. Based on views A and B and estimated disparity maps for views A and B, view V that is positioned in between of view A and B was synthesized. Exact view numbers for each of test sequences used during experiments are provided in Table I. The quality of estimated disparity maps for views A and B is measured as a quality of rendered view V. Quality of synthesized view V is expressed by PSNR of luminance in comparison with view V captured by real camera positioned at the same spatial position (see Fig. 4).
Such methodology is compliant with experimental methodology developed and approved by the MPEG committee of International Standardization Organization and is used by other research institutes, targeted at high quality 3D television for e.g. autostereoscopic displays.
In the course of evaluation disparity maps were estimated for every frame within the sequences (mostly 250 frames per view). This has allowed to evaluate our algorithm on a wide range of different images. The disparity estimation was done with pixel, half-pixel and quarter-pixel precision. Also, a wide range of regularization terms used in Graph Cut algorithm has been evaluated. In DERS the regularization is controlled by so-called smoothing coefficient. In experiments, the range of 1 to 4 was explored.
In the course of that, we have modified DERS algorithm to directly output raw disparity maps in the format required by Middlebury evaluation webpage [30]. Because both proposed methods and the DERS algorithm are designed to work with three input images, we have extended recommended/standard stereo pair with third image as specified in Table II.

VI. RESULTS
The comparison of quality of estimated disparity maps for proposed method versus original DERS can be found in Fig. 7d, 7a, 7c, 7b. As it can be noticed, the smoothing coefficient can have significant impact on the quality of disparity maps estimated by DERS. It can be expected that in a real-world-use scenario, this parameter will be automatically controlled to provide the best results. Therefore, in summarized Table III, we have presented only the best-performing cases. Depending on the case, the proposed occlusion handling brings a gain of 0.02-2.50 dB of luminance PSNR of synthesized view, related to the original unmodified DERS. On average, the proposal provides an improvement of 1.26 dB for pixel-precise disparity estimation, 1.23 dB for half-precise disparity estimation, and 1.18 db for quarter-precise disparity estimation.
The application of proposed occlusion handing to Middlebury images results in 0.2 bad pixel improvement (Table IV). Please keep in mind that Middleburry datasets have very little occlusions.

VII. CONCLUSION
We have presented a novel approach to occlusion handling in disparity estimation, based on a modification of similarity cost function. Proposed approach has been tested in the threeview disparity estimation scenario. For occlusion detection synthesized disparity maps of left and right views have been used.
For well-known multiview video test sequences, the experimental results show that the proposed approach provides virtual view quality improvement of 1.25 dB of luminance PSNR over the state-of-the-art technique implemented in MPEG Depth Estimation Reference Software (DERS). Moreover, direct quality evaluation of estimated disparity reveals that proposed the approach reduces a number of bad pixels by 1.26 p.p.