Mixed reality depth contour occlusion using binocular similarity matching and three-dimensional contour optimisation

Mixed reality applications often require virtual objects that are partly occluded by real objects. However, previous research and commercial products have limitations in terms of performance and efficiency. To address these challenges, we propose a novel depth contour occlusion (DCO) algorithm. The proposed method is based on the sensitivity of contour occlusion and a binocular stereoscopic vision device. In this method, a depth contour map is combined with a sparse depth map obtained from a two-stage adaptive filter area stereo matching algorithm and the depth contour map of the objects extracted by a digital image stabilisation optical flow method. We also propose a quadratic optimisation model with three constraints to generate an accurate dense map of the depth contour for high-quality real-virtual occlusion. The whole process is accelerated by GPU. To evaluate the effectiveness of the algorithm, we demonstrate a time consumption statistical analysis for each stage of the DCO algorithm execution. To verify the reliability of the real-virtual occlusion effect, we conduct an experimental analysis on single-sided, enclosed, and complex occlusions. Subsequently, we compare it with the occlusion method without quadratic optimisation. With our GPU implementation for real-time DCO, the evaluation indicates that applying the presented DCO algorithm enhances the real-time performance and the visual quality of real-virtual occlusion.


Introduction
As an extension of augmented reality (AR) and virtual reality (VR), mixed reality (MR) (Milgram et al. 1995) will play an increasingly important role in various fields in the future.Realvirtual occlusion is one of the basic requirements to break the boundary between virtuality and reality.In contrast, current MR technology solely superimposes virtual objects on the real scene applying a three-dimensional (3D) registration and tracking method.One of the major problems of real-virtual occlusion, which are yet to be considerably simplify the complexity of the depth algorithm and improve the real-time performance of real-virtual interaction algorithm.However, the high cost of professional devices prevents this product from being popular.Another adverse factor is that even efficient real-virtual occlusion algorithms that do not rely on professional devices cannot guarantee accuracy and efficiency.In terms of the examples of Google ARCore and Apple ARKit, ARKit exploits machine learning algorithms to segment the silhouette of each character in a video sequence frame and subsequently renders the background, characters, and virtual objects according to the depth information to position the character in front of the virtual objects.
ARCore demonstrates its core functions from the key points of calibration, tracking and consistent illumination.Recently, ARCore can handle some real-virtual occlusion shipped through its depth API.
However, there still exists many real-virtual occlusion challenges to solve.In the future, realvirtual occlusion in MR will be satisfied as follows: (1) real time, because MR is a real-time interactive scene; (2) contour sensitivity-the occlusion contour accuracy of the occluded virtual object requires greater accuracy than the depth accuracy, because the depth information provides the object's distance from the viewpoint but a contour requires a close fit with the contour of real-world objects; (3) portability-the proposed real-virtual occlusion algorithm can be implemented in a variety of environments.
In this paper, we propose a novel depth contour occlusion (DCO) algorithm.Particularly, it is combined with the sparse depth map obtained from a two-stage adaptive filter area stereo matching algorithm and the 3D depth contour of the objects and a quadratic optimisation model with three constraints to generate an accurate depth map of the depth contour.Finally.We realised the DCO algorithm to solve the real-virtual occlusion problem in MR.
The technical contributions of this study can be summarised as follows.
(1) We propose a novel binocular stereo matching algorithm based on two-stage adaptive binocular camera to generate a sparse depth map.
(2) We combine a bidirectional optical flow field algorithm and a quadratic optimisation model to generate accurate depth contours for high-quality real-virtual occlusion.
(3) We improve the efficiency of the real-virtual occlusion contour with a GPU-based speeding-up.This paper is organised as follows.Section 2 presents an overview of the related work.Section 3 provides an overview of the proposed method.
Section 4 introduces a two-stage adaptive absolute difference (AD)-census stereo matching to generate a sparse depth map.Section 5 details the depth contour extraction based on a digital image stabilisation (DIS) optical flow method.For the optimisation of depth information extraction, Section 6 proposes three constraints to adjust the strength of the depth, smoothness, and stability constraints during the quadratic optimisation.
Section 7 introduces the implementation of the DCO algorithm and demonstrates the evaluation of the effectiveness of the algorithm and the reliability of the real-virtual occlusion effect.Finally, Section 8 concludes this paper and provides the scope of future work.

Related work
We give a brief overview of the current treatment methods for the real-virtual occlusion in AR/MR, which are mainly divided into four categories, partly according to the survey paper (Xin and Peng 2018): (1) pre-modelling method, (2) 3D reconstruction method, (3) contour-based approaches, and (4) deep learning approaches.
2.1 Pre-modelling method Breen et al. (1996) and Klein et al. (2004) proposed a method to realise real-virtual occlusion based on a pre-modelling method.They established a corresponding model based on real objects in advance to complete real-virtual interaction and achieved relatively accurate real-virtual occlusion and collision between virtual and real objects.Fischer et al. (2004) deployed medical device and improved the modelling method for medical volume datasets, which extracts their visual hull volume.The resultant visual hull iso-surface, which is simplified significantly, is implemented for real-time static occlusion handling in their AR system.Although these methods maintain high precision and satisfactory real-time performance, their method requires pre-modelling, which cannot meet the versatility requirements and does not apply to common complex environments.
2.2 Three-dimensional model reconstruction Fuchs et al. (1994) were the first who attempted to solve the problem of real-virtual occlusion and dealt with the occlusion problem in the video perspective AR system.However, their method relied on large-scale data capture device; therefore, the 3D reconstruction speed and accuracy were too low to meet the requirements of real-time interaction.Wloka et al. (1995) presented a stereoscopic video image matching algorithm, which leverages the change of the vertical coordinate of each pixel to calculate the depth of field information for the occlusion processing of the video perspective AR system.However, owing to the insufficient development of the stereo matching algorithm and the inadequate hardware conditions at that time, the occlusion effect was not ideal and the algorithm did not meet the requirements of real-time processing (three frames per second).Ni et al. (2006) combined the fast sum of absolute differences (SAD) algorithm to build a depth detection system.The SAD algorithm could achieve a real-time processing speed of 30 frames per second.However, problems still exist, such as the high matching error rate in low texture areas and the low contour accuracy of occlusion.Thereafter, the research on the occlusion based on 3D reconstruction has focused on how to improve the accuracy and the efficiency of occlusion, for example, by only performing partial 3D reconstruction of areas that may be occluded to improve the processing speed, and by using offline 3D reconstruction (Tian et al. 2015), preload static scenes, or semi-global matching (SGM) algorithm to improve the reconstruction accuracy (Guo et al. 2018).
The advantage of the 3D reconstruction algorithm is that it can deal with almost every real environment without collecting information from the real scene in advance (Zheng, 2016).In contrast, it can solve the origin of the real-virtual occlusion problem.However, the disadvantage of this type of algorithm is that it requires a significant amount of calculation, and its accuracy is not as good as that of the pre-modelling algorithm in some special situations, such as static scenes.Berger et al. (1997) were the first who proposed a method based on object contour recognition to determine the occlusion relationship and realised a method that can quickly calculate the occlusion mask without 3D reconstruction.Although this method significantly improves the efficiency and the accuracy of occlusion processing, the effect of the algorithm is heavily dependent on the quality of contour mapping.If the foreground and background colours are similar, the recognition error is large and the occlusion mask cannot be generated accurately.

Contour-based approaches
Feng (2007) modelled the virtual scene hierarchically, realising multilevel real-virtual occlusion, including occlusion for nonrigid bodies.
However, the occlusion effect of characters and the accuracy of occlusion contours need improvement; moreover, the occlusion is not flexible.
A depth-based approach by Schmidt et al. (2002)

Cost calculation
The cost calculation is the similarity measure of the left and right images pixel by pixel in parallax.
We consider fusing the AD and census measures (Zabih and Woodfill 1994) where N is the number of pixels in the adaptive filter area Np.After calculating the cost of point p under each disparity, a winner-takes-all strategy is adopted, and the smallest matching cost is selected as the initial disparity: ((, )). (3)

Parallax optimisation
In general, there are many mismatched areas in the parallax information provided by the initial parallax, such as occlusion area, image edge distortion, misjudgement of different value points, and noise interference.In our study, the parallax information in each pixel field was used for the statistical optimisation of this point.Our optimisation selects the highest statistical frequency in the neighbour field of pixel p, as the parallax value of point p, namely: where ℎ(,   ) is the statistical frequency of parallax d in the neighbourhood Νp of point p.
The statistically optimal parallax can be obtained using Eq. ( 4), which can be applied to perform multiple iterations of parallax to eliminate as many mismatches as possible.Therefore, the outlier points in the parallax map that are clearly in the margin of the parallax value are removed.In the proposed DCO algorithm, the accuracy of occlusion contour depends on the accuracy of depth contour.Therefore, this section will extract the depth contour of the three-dimensional boundary of the object, that is, find the contour edge filter where the object may be occluded, filter the contour information of the whole image, and retain the three-dimensional contour information that may be occluded.
In order to filter the planar texture without three-dimensional features and retain the depth profile on the three-dimensional edge, the depth profile filter needs to be calculated first.In order to obtain the depth profile filter, the gradient am- After obtaining the optical flow field I, we convert the optical flow field data to polar coordinates, as shown in Eq. ( 6): where Nflow is a plane coordinate space with a horizontal upper limit of usize and a vertical upper limit of vsize.Because the scalar rp can reflect the speed of the pixel at point p, rp is used to calculate the motion change rate of adjacent pixels on the x-coordinate u and y-coordinate v, as follows: The amplitude of the pixel motion change rate in the horizontal and vertical directions in the optical flow field can be calculated by Eq. ( 6) and Eq. ( 7), and the maximum amount of r change in the two directions is considered as the final gradient amplitude.Hence, it constitutes the complete gradient amplitude matrix, M: After visualising the matrix, the effect is illustrated in the second row of Fig. 4. The result indicates that the object part that may have a depth contour is extracted.

Bidirectional amplitude fusion
To eliminate the unreliable part of the data and retain the reliable one, we propose a fusion gradient amplitude method.It can be observed that the data of Ipast and Ifuture are exactly complementary in the three input frames.In other words, the occlusion part in one matrix data corresponds to a more reliable part in the other matrix.
To extract reliable data, we started from the optical flow data by establishing a mathematical model and obtaining reliable data from the internal rule.First, observing a certain spatial point in the previous frame, if it is not occluded in the future frame, the optical flow data around the corresponding projected plane coordinates of this part of the space are more reliable.Conversely, if a certain spatial point in the past frame is occluded in the future frame, it is generally unreliable.
After expanding and simplifying, we obtain: where  is the unit vector corresponding to the optical flow vector of the pixel at point p.If the object in Fig. 5 moves to the right and the area where point p is located is no longer occluded, the pixel motion intensity at point p1 in the positive direction of the motion vector will be higher than that at point p0 in the reverse direction of the motion vector.Meanwhile, point p' is in the occluded area; consequently, the pixel motion intensity at point p'1 in the positive direction of the corresponding motion vector is lower than that at point p'0 in the reverse direction.By observing these, suppose that points p and p' are located at the same coordinates as Mpast and Mfuture, respectively, and let: (11) Irrespective of the direction in which point p moves, its relationship with r in the occluded area is constant; hence, the acquisition constraints of each element in the bidirectional amplitude fusion matrix can be obtained as follows: It can be observed that a depth contour filter with high confidence has been extracted.However, the high-confidence area in the filter is often not large enough to cover the entire depth contour.
We use box filtering for Mfuse to expand the confidence range to ensure that it covers the entire depth contour filter, as shown in the third row of Fig. 4 (right).

Depth contour extraction
In our method, we deploy the depth contour filter in the dual threshold detection stage to filter the texture that is not at the 3D contour and retain the depth contour.

Image contour extraction
As the most intuitive and easy-to-extract structural information, image texture edges are the basis for obtaining high-level image information.
However, the Canny edges (Canny 1986) obtained from these data alone are rough; for this reason, we performed a non-maximum suppression processing (Neubeck and Gool 2006) to improve the contour extraction.
After non-maximum suppression processing, the processed grey gradient data must be con- mark the points greater than Thigh as contours.After coarse filtering, there will be some points between Thigh and Tlow in the image.The points in between may be in the contour part or in the noncontour.Here, eight connected regions are exploited to identify those points that should be kept connected to the corresponding pixels higher than Thigh, marking them as edge points, and the rest ones as non-edge points.

Non-depth texture contour filtering
To extract the final depth contour, it is necessary where Dsparse (p) is the depth information of the discrete depth map at point p, and the value of wsparse satisfies: In summary, if a point is in the missing information area of the sparse depth map, it has no cost contribution; otherwise, the constraint is valid and included in the cost contribution.
The second constraint needs to consider the discontinuity of the depth information of the depth edge and the smoothness of the depth information of the non-depth edge.This study uses smoothness constraint to set the depth contour information that participates in the optimisation and generates more accurate 3D depth contour information; the smoothness constraint is expressed as follows: Point q is a point in the 4-neighbourhood of point p, and the value of wpq satisfies: Normalise the Mintensity involved in the Canny contour extraction to the interval [0,1] to obtain the matrix MI.Consequently, sq and sp satisfy: where Bdp is the binary image of the depth contour.
If either point p or point q is on the depth contour, the depth value at point p will not be smoothed; hence, the depth discontinuous area will be aligned with the depth contour.Otherwise, it can be determined that neither of points p and q is on the depth contour or both points belong to the depth contour.However, in either case, the depth information of the two points is smooth, and if Mfuse or MI is low, indicating that the two points are in the weak texture area or the texture area of the non-depth edge, the value of wpq is larger and the smoothness constraint is stronger.
The third constraint is stability constraint.In the second optimisation, the sparse depth information of the previous frame is combined for calculation to make the propagation of depth information more stable.The stability constraints are as follows: where Dpre(p) is the dense depth information at point p in the sparse depth map of the previous frame, and wstable satisfies: Similar to the first constraint, if there is no depth information at point p in the previous frame, the point has no contribution; otherwise, it provides constraints.After determining the constraints, we can define the quadratic optimisation formula: where λd is depth constraint balance coefficient, λs is smoothness constraint balance coefficient, and λs2 is stability constraint balance coefficient.From textures.This happens because the cost aggregation time is directly related to the size of the adaptive filter area.The sparser the texture is, the larger the adaptive filter area will be.Consequently, the construction of the adaptive filter area as well as the matching process will require more time.In general, the efficiency was significantly improved when the CUDA platform is adopted to depth map optimise stage.The result is demonstrated in Fig. 9.After experimental comparison, it can be inferred that the DCO real-virtual occlusion can additionally obtain the correct occlusion results in more complicated occlusion situations.However, the real-virtual occlusion that deploys solely twostage adaptive AD-census stereo matching has not only inaccurate contours but also false occlusions.
This is because in AD-census only method, the outlier points are not eliminated in depth map construction.Thereby, compared to the proposed DCO method, the depth data generated by ADcensus only method are abnormal, which causes abnormal occlusion results.

Conclusion
In this paper, we proposed a DCO algorithm to value of the pixel corresponding to the image, it is discarded during the rendering of the virtual object pixels.In addition, in the proposed method, the left and right images of the binocular fisheye camera have been calibrated and stereo corrected to achieve the epipolar constraint of binocular matching by computing the un-distortion and rectification transformation map with OpenCV builtin De-distortion functions in initialization phase and CUDA accelerating at image cropping and remapping stage of each frame.

Fig. 1
Fig. 1 Overall architecture of the depth contour occlusion algorithm

Fig. 2
Fig. 2 Process of generating smooth adaptive crosssupport window method, called ADcensus adaptive measurement (Li and Wang 2020), to construct the adaptive function by exploiting the shortest arm length of the texturebased cross-support window.The AD-census adaptive measurement function for calculating the matching cost of point p and point q is: (, ) =  (1 λAD and λcensus are the regularisation parameters for the two basic measures and p is the point to be matched in the left image of the binocular visual image.The pixels with parallax d in the horizontal direction in the right picture are q = pd.Moreover, α is the weight of adjusting the contribution cost of the two measures, which is based on the shortest arm length Lmin, the edge control parameter γL, and the correction parameter ε.In other words, when the shortest arm length becomes longer, the texture of the current region becomes weaker, and the weight smaller.The cost contribution of the AD measurement calculation is reduced, whereas that of the census measurement calculation is increased to achieve the purpose of adaptive cost calculation according to the texture.4.2.2Cost aggregation In the cost aggregation stage, the adaptive filtering area is leveraged for cost aggregation.Assuming that pixel p is a pixel in the left view, and Dp is the pixel set in the corresponding adaptive filter area, the matching cost of pixel p within the parallax range d ∈［dmin, dmax］ can be finally expressed as

4. 4
Construction of sparse depth mapThis subsection describes the conversion of the parallax information into depth information.Because the original image was reduced to a quarter of its original size in the previous calculation of parallax, it is necessary to double all the parallax data before deploying the parameters provided by the binocular module for depth calculation.Fig.3depicts some results of sparse depth map construction in a standard dataset (Middlebury 2014).

Fig. 3
Fig. 3 Some sparse depth map results of Middlebury 2014 dataset plitude needs to be calculated to extract the occlusion area when the three-dimensional object moves.Then, according to the confidence analysis of optical flow field data, the gradient amplitude region with the highest confidence in the two opposite flow fields is combined and normalized to extract the three-dimensional depth profile.Fig. 4 illustrates the whole process of the DIS-based depth contour filter extraction.The following subsections will introduce the details of bidirectional optical flow field and gradient amplitude calculation methods, bidirectional amplitude fusion and depth contour extraction methods respectively.

Fig. 4
Fig. 4 Depth contour filter extraction based on DIS optical flow method

Fig. 5
Fig. 5 Confidence analysis of optical flow data As depicted in Fig. 5, an object moves to the right in the next frame relative to the previous frame.Now, consider points p and p'.Point p is located near the left edge of the object, and point p' is located near the right edge.Consider the two points as the midpoints, and assume the pixel motion direction to be the positive direction to create the 2D vectors  0  1 ⃗⃗⃗⃗⃗⃗⃗⃗ and  0 ′  1 ′ ⃗⃗⃗⃗⃗⃗⃗⃗ .Subsequently, find the projection values  0 ,  1 , ′ 0 , and ′ 1 of the motion vectors corresponding to points   ,   , ′  , and ′  , respectively: (  ) = ((  ),   ⃗⃗⃗⃗⃗ ).
fuse () = {  past (),  past >  future ,  future (),  future >  past .(12) Essentially, on one hand, the generation of depth contour is followed by the law of complementary motion between the intermediate frame and front-back frames; on the other hand, the generation of depth contour is dependent on the relationship between the area with high confidence in the optical flow data and the occlusion area.The visualisation effect of matrix Mfuse after the aforementioned bidirectional amplitude fusion is shown in the third row of Fig. 4 (left).
verted into image contours.Here, dual thresholds are exploited to detect and connect the edge contours.First, we set the strong threshold Thigh and the weak threshold Tlow, which discard the points whose gradient intensity is less than Tlow and

Fig. 6
Fig. 6 Results of the contours generated by Canny and the depth contours after texture filtering

Fig. 7
Fig. 7 Comparison results of depth information extraction

Fig. 9
Fig. 9 Comparison results of single-sided occlusion handle real-virtual occlusion.The proposed method was based on the sensitivity of contour occlusion and a binocular stereoscopic vision device.First, we reduced the size of the input images to a quarter of the original ones and implemented a two-stage adaptive filter area stereo matching algorithm, AD-census stereo matching, to establish a sparse depth map.The purpose of that stage was to improve the efficiency and obtain more accurate depth information.Second, the DIS optical flow method was used to extract the depth contour of the real object.Third, we proposed three constraints to adjust the strength of the depth, smoothness, and stability constraints during the quadratic optimisation to propagate the sparse depth map to dense depth map.Finally, we implemented a DCO algorithm and evaluated the effectiveness and reliability of the real-virtual occlusion effect.Through experimental comparisons, we proved that our method had satisfactory stability, real-time performance, and effectiveness.Taken together, these results suggest that the DCO algorithm could be further tested and developed in more MR applications.Future work will overcome the limitations of the proposed method in the following ways: 1) we may combine the SLAM algorithm and depthpoint re-projection to build a more efficient computing model; 2) artificial intelligence algorithms can be leverage to optimise depth information extraction and depth contour extraction, with deep neural network model applied.Since the deep learning method need to train/test on a dataset of a certain amount, more data should be collected by the binocular stereoscopic vision device first.

Table 1
, it can be noticed that for scenes with sparse textures, the efficiency of this algorithm is lower than that of scenes with dense

Table 1
Time consumption comparison of stereo matching process (ms)

Table 2
Time consumption of depth contour extraction algorithm (ms)

Table 3
Time consumption of each stage of DCO al- periment was a cube to simulate the comparison result of single-sided occlusion by human hands.