Robust Local Light Field Synthesis via Occlusion-aware Sampling and Deep Visual Feature Fusion

Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence. Rendering a locally immersive light field (LF) based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques. In this work, we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view. To fully explore the precious information from source LF captures, we propose a novel occlusion-aware source sampler (OSS) module which efficiently transfers the pixels of source views to the target view’s frustum in an occlusion-aware manner. An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF. The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles, but also proves to be able to effectively enhance the visual rendering quality. Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.


Introduction
One critical assumption for a successful novel view synthesis is accurate depth/disparity estimation, which is also a fundamental application of local light field (LF) imaging. In this paper, we propose a novel LF synthesis algorithm in a global multi-view stereo framework that can take large baseline input reference LFs to estimate accurate depth in the novel/target view. The estimated accurate depth in the novel/target view is used for warping the target view image into a novel LF.
In order to predict an accurate disparity in the target view, we fully exploit the advantages of LF on an accurate disparity estimation. We first calculate disparity probability volume (DPV) from the slope of lines in Epipolar-plane images (EPI) [1] in each source LF. Then, we fuse these source DPVs from source LFs into the target view′s camera frustum. But the source DPVs are in different scales and cannot be fused directly. So, we use the DPV rescaling and fusion methods in [2] to align these DPVs before fusion. After the fusion process, a novel DPV in the target view is generated. An accurate dispar-ity map can be estimated from the novel DPV. The rationality of fusing DPVs rather than directly using depth projections is that DPVs contain the probability of a spatial point being occupied, which is much more indicative than single-plane depth when fused from multi-views.
Then, a preliminary LF is first synthesized by backward warping pixels of the target view image according to the estimated disparity map. To fully exploit the known valuable color information from multiple input views, occlusion-aware source sampler (OSS) and deep visual fusion (DVF) modules are proposed. The OSS module takes plane sweep volumes (PSV) [3] as input and locates depth planes with maximum confidence for the image patches in a PSV. A global background image is composed by the inverse-over composition of pixels from further depth layers of the PSV. Then the global background image is warped into a background LF, which is fused with the preliminary LF to enhance visual quality by eliminating image noises and recovering the occluded contents. Then the fused LF is fed into a spatial-angular regularisation module for improving spatial-angular consistency and visual quality. As shown in the evaluation results, our method outperforms other state-of-the-art novel view synthesis methods in the Stanford Lytro multi-view light field dataset (MVLF) [4] that contains challenging large baseline and discrete views in each scene. The proposed OSS and DVF modules can greatly improve the visual quality by suppressing noise and revealing occluded con-tents.
Our contributions are summarised as follows: 1) The OSS module is proposed to sample visual clues from source LFs in the depth planes of a PSV with minimum pixel-matching errors.
2) The DVF module is proposed to fuse the deep visual features of a background LF and a preliminary LF.

Related work
Novel view synthesis techniques usually use depthbased warping operations [5−9] that warp image pixels to produce novel views. Therefore, accurate depth estimation is crucial for accurate warping, and occlusions further complicate the rendering. Learning-based LF/novel view synthesis methods can be classified into three categories according to their input images′ sampling patterns: sparse angular inputs, single RGB input, and small baseline multi-view inputs.

LF synthesis based on sparse angular references
Sparse-input LF synthesis takes a sparse set of subaperture images (SAIs) captured within a target LF′s aperture to synthesize novel neighboring SAIs by interpolation or extrapolation. Kalantari et al. [5] introduced the first learning-based LF synthesis solution. Their method takes SAIs in four corners as input to synthesize a 4D LF using two sequential convolution neural networks to estimate disparity and color. But explicit scene geometry is not a necessary condition for LF synthesis, Zhang et al. [10] proposed a phase-based LF synthesis method from a micro-baseline stereo pair. Yeung et al. [9] reconstructed dense-sampled SAIs from sparse-sampled SAIs using spatial-angular alternative convolutions to exploit dense spatial and angular clues. The sparse inputs within the LF′s aperture usually require fixed input subaperture positions, e.g., four corner views in [5]. So, FlexLF [11] was proposed for LF synthesis with sparse input SAIs in varying aperture positions. The angular correlations among SAIs are revealed by building a cost volume to calculate pixel intensity matching errors. After predicting depth by pixel intensity matching errors, depth discontinuity can help locate edges. Liu et al. [12] proposed an edge-aware painting network to complement the preliminary LF for LF angular super-resolution task.

Novel view synthesis based on single image as input
Single-input novel view synthesis takes a single RGB image as input to synthesize novel views. In the context of LF, Srinivasan et al. [6] made the first attempt to synthesize an LF from a single image by utilizing the imagebased rendering (IBR) technique. However, the IBR methods are constrained to Lambertian surfaces and are unable to handle occlusions effectively. Given the high similarity between sub-aperture views, Ruan et al. [13] used a Wasserstein generative adversarial networks (GAN) with a gradient penalty to synthesize complete LF images. Couillaud and Ziou [14] synthesized LF from a single RGB image and depth map using optical geometry and light ray radiometry. Outside the context of LF, single image view synthesis (SynSin) [15] represents a scene by forming feature point clouds that are rotated and rendered at a novel angle. Shih et al. [16] separated a scene into different floating islands (objects with depth discontinuities around the edges), and the occluded regions around the edges of the floating island are painted to avoid showing blankness when rendered to novel angles.

Novel view synthesis based on multiview references
Learning is an attractive tool for learning representations of scenes. Volume representation is highly differentiable and can learn complex shapes. Multi-plane images (MPI) is a volume-based approach but with discrete depth planes that help improve efficiency. A recent strand of learning-based research generates MPI for view synthesis in forward-facing scenes, either with single image input [17] or a set of images as input [18−21] . Each input view is expanded into a layered representation that can render high-quality local LF. Mildenhall et al. [18] proposed local light field fusion (LLFF), which can synthesize dense paths of novel views by blending adjacent layered representations together. In addition to the layered scene representation, the neural radiance fields (NeRF) proposed by Mildenhall et al. [22] learns a continuous volumetric scene function and encodes the inward-facing scene into a fully connected deep network. Moreover, Dai et al. [23] transforms point clouds into voxels, and the relative positions among voxelized points can be encoded as descriptors, which are learned and updated by gradients back-propagated from multi-plane rendering.
We summarise the existing view synthesis methods in terms of their input sampling requirements and rendering capability in Table 1. Compared with other methods, ours is flexible in dealing with large baseline sparse inputs with various capturing angles rather than requiring fixed or optimal sampling patterns in conventional novel view synthesis methods.

Proposed method
Synthesizing novel views over a wide baseline is challenging and is important for virtual reality systems [28][29][30] . Accurate depth estimation is one of the most critical assumptions for image warping in novel view synthesis. In order to generate accurate depth in a target view, depth from multiple reference views can be transferred by projections/warping according to camera extrinsic and intrinsic parameters. In our framework shown in Fig. 1, we take LF images as input and estimate the DPV in these source LFs. Then, the DPVs are warped to the target view for an accurate disparity/depth estimation. A preliminary LF is first synthesized by backward warping pixels of the target view image according to the estimated disparity map. The OSS and deep visual fusion (DVF) modules are proposed to fetch known valuable color information from multiple input views to complement the final rendering. Then, the spatial-angular regularisation module is adopted to improve spatial-angular consistency and visual quality. . The fused DPV in the target view is converted to a disparity map by the probability weighted compositing methods used in [2, 31−33]. We adopt the multi-scale residual fusion module in [2] that combines visual features from a target view image to restore the fine details and surface smoothness of the target view′s disparity map. The refined disparity map in the target view is denoted as D.

DPV estimation
The DPV of LF can be estimated by comparing the pixel intensities around a given point along different EPI lines, e.g., , , and in Fig. 2. The calculation of pixel intensity variance along the slopes of EPI lines for a candidate depth in a query point is given as (1) in [34]: where is the index in angular domain, is the total number of angular views, and is the focal length. As shown in Fig. 2, the candidate line that produces the least variance of pixel intensity is taken as the best response, and its slope is proportional to the query point′s depth [35] . Synsin [15] Fig. 1 The overall pipeline of our proposed method. The disparity probability volumes from source LFs are transferred and fused in a target camera. Then, a raw disparity map is generated by the 3D cost volume regularisation and the RGB-D fusion process proposed by [2]. A preliminary LF is synthesized by backward warping pixels of the target view image. The OSS module is proposed to sample and fetch RGB colors from varying source viewpoints to recover the background. Then the preliminary LF is fused with the recovered background weighted by fusion attentions produced by the DVF module. The fused LF is further refined by a final spatial-angular regularization module that will render the final outputs.

Depth estimation
We leverage the observation from [2] that multi-view disparity probability values are very informative for accurate disparity estimation when fused together in a target viewpoint. The advantage of warping the DPV over directly warping the depth values is that the fusion of multi-view depth probabilities can generate more accurate and cross-view consistent depth estimations. This is a general approach as other cost volume based depth estimation methods in [32,36]. In order to warp these DPVs, the scale consistency of DPV for multi-view projections are important. Thus, we adopt the DPV estimation method in [34] and the DPV rescaling and fusion pipeline proposed by [2] that involves rescaling the DPV to a multi-view scale consistent , projecting to a target camera′s viewing frustum as and fusing as :

Preliminary light field synthesis
It 0

Lpre
Our local immersive novel view/LF synthesis starts from synthesizing a preliminary LF by backward warping pixels from the target view image to novel subaperture positions. Then the preliminary LF is improved and refined by the proposed OSS and DVF modules to reveal occlusions and eliminate noise.

Generating disparity field
Based on the accurate disparity map , a cross-view consistent disparity field is estimated. By using angular up-sampling layers followed by a pseudo-4D spatial-angular separable convolution network (spatial-angular regularization) [7,9,37] to process the disparity map, the spatial and angular consistency among local rendering instances are implicitly regularised: represents the spatial-angular regularization that has the following two advantages [37] : 1) It is memory-efficient compared with 4D convolutions; 2) It alleviates separable filtering in digital signal processing by performing separable 2D spatial and angular convolu- Especially, the disparity map is first processed by an angular up-sampling convolution layer that increases the channel number of from 1 to the number of angular dimensions , then followed by a ReLU activation function: Conv up_chan C then, is processed by a channel up-sampling layer that increases the number of the channel from 1 to , then followed by a ReLU activation function: Convang then is reshaped to for 2D angular convolutions : then, the angular filtered disparity is reshaped to for 2D spatial convolutions :D spa = ReLU (Convspa(Dspa)).

(7)
Convres C x y The above 2D spatial and angular convolutions are repeated six times, and each layer′s network parameters are separately learned. The final spatial and angular regularized output is processed by a residual convolutional layer that decreases the number of channels from to 2 (disparity along and dimensions respectively):

Disparity based pixel warping
is synthesized by backward warping pixels from the input target view image according to the disparity field : Iv v Dv v Lpre where represents the synthesized -th sub-aperture view image, and denotes the disparity map in the -th sub-aperture view of the target preliminary LF . Although the disparity field preserves geometry and intensity consistencies among angular views, it is still impossible to predict occluded contents based on target view image . The image quality based on a single capture is also limited without reference to other source captures. In addition to the scene geometry , visual features from source LFs are important for improving the rendering quality of a target LF for two reasons: first, occluded regions are impossible to be correctly rendered only based on the target central view . With references from different observation angles of , these occluded visual contents can be located and transferred to the offcenter views in the target LF by the proposed OSS module, as illustrated in Fig. 3; Second, single image capture can be visually noisy. With aligned references from the source captures , the visual quality of the target LF can be greatly improved via the DVF module.

Plane-sweep volume generation
The PSV is first introduced by [3] to determine pixel correspondences and 3D locations across multiple images. To build a PSV, we first transfer the central views of the source LFs to the target view′s camera frustum. In theory, the source image is swept through the volume of the space along the principal axis of the source camera. In practice, the source image is warped to the target view camera′s frustum via homography warping according to (10): where represent camera intrinsic and extrinsic parameters, respectively; represents the depth plane, denotes the principle axis of the camera frustum; and the subscripts and denote the index of target and source cameras.
For a better illustration, we separately draw the warped image planes from the source view and the target view in Fig. 3(a) and 3(b), respectively. Due to the different angles of the source and target cameras′ principle axes, the aligning direction of warped sweeping image planes from the source view is different from that of the target view. So the image planes warped from the source view to the target view′s frustum are oblique in Fig. 3(a). The image planes in the target view are sweeping along the target view camera′s principle axis, so the target view′s image planes are vertical in Fig. 3(b).

The cost volume estimation
We decompose the PSV into two parts, warped from the source view and warped from the target view . For the purpose of transferring pixels from the source view to the target view, finding the pixel′s corresponding relationship between and is important. So, a pixel matching cost volume is calculated by the difference of pixel intensity between and in their sweeping planes

Vcost
where is the index and is the total number of the sweeping depth planes in PSV. The best matching depth layer for each pixel in the and is found by calculating the minimum pixel intensity errors in : where stores the index of depth plane for pixel that is with minimum matching errors between and .

Occlusion-aware source sampling module
We propose an OSS module to extract and transfer the occluded pixels (denoted green in Fig. 3 Fig. 3 The OSS module transfers occluded background pixel clues to the target view from the source view′s best matched depth plane. The target object is homography warped from the source view′ s depth plane to the target view. The occluded contents around the object′ s boundary (depth discontinuities) in the target view are to be replaced by pixels from the source view′ s best matched depth plane. The final novel view is synthesized via inverse-over compositing.
ferring. The calculation of the attention mask is given in (13): Here is a gradient function to calculate the gradient of the disparity map .
is a threshold of the gradients. If the gradient is larger than the given threshold (0.05 in our experiments), the pixel is taken as being around the edges.
To transfer edge pixels indicated by from to , we find the corresponding pixels in each depth layer of and by . Hence, the can be updated. The updating process can be represented as S(x, y) = d.
Using the updated , a global background can be composed by inverse-over operation [38] of pixels on all depth planes in that contains transferred pixels from the source view. The inverse-over operation can help generate a global background rather than a local background. Because the pixels from nearer layers of can be overwritten by revealed occluded contents from layers further away .
To implement the inverse-over operation, the compositing algorithm starts from the furthest plane to the nearest depth plane , and the composed output in the nearest/first depth plane is the final composed background image . More specifically, when compositing pixels in a nearer depth plane of , a newer intermediate background image is composed by fusing the selected pixels from the older background image and . The selection process and conditions are shown in (16), where the will be the pixel from or if the is with minimum matching errors in depth plane and also meets other constraints. The finally composed contains the furthest edge pixels, also with minimum matching errors in . The OSS module aims at preserving as many occluded visual features as possible to enable perspective rendering of the target view.

Deep visual fusion module
Subsequently, we have the sampled global background from the source LF captures , which will be first warped into a background LF using the same method as in Section 3.3. The background LF is first spatial and angular regularized, then fused with a preliminary LF that is also spatial-angular regularized. The DVF module learns fusion attention between the contents of and : The attention mechanism [39][40][41] is also proven to be able to extract representative features from ambiguous regions. Finally, the fused light field features will go through another spatial-angular regularisation module to implicitly regularise the structure of LF contents: where represents the spatial angular regularizations, which has the same network structure as .

Training setup
The proposed framework has been implemented with PyTorch. The disparity estimation model and the LF synthesis model were separately trained in two stages. In the first stage, the disparity probability volumes are precalculated and fused in a target camera′s frustum for efficient training. The training of the disparity estimation model initiates the learning rate as 1E-2 and decays by 0.1 since the second epoch. The training of the disparity estimation model needs 128 epochs that take 50 hours to finish on two NVIDIA Tesla V100S GPUs. In the second stage, the OSS and DVF modules are trained, and the learning rate is initialized to 1E-5 and decays by 0.5 since the second epoch. The patch size is set to 128, and number of depth planes P is set to 128 across the disparity range of [-4, 4]. All network parameters are initialized as normal, and the momentum term of the Adam optimizer [42] is set to 0.5. The training needs 148 epochs that take 20 hours on one NVIDIA Tesla V100S GPU.

The dataset K, R, τ
The Stanford Lytro multi-view light field dataset (MVLF) [4] was used for training and evaluating models. In each scene, there are 3 to 5 LF captures, but without camera parameters and good ground truth disparity maps. Hence, we estimated the camera parameters and by COLMAP [43] . The ground truth of the disparity It 0 maps was estimated by the state-of-the-art LF disparity estimation method introduced in [34]. The proposed pipeline relies on the accuracy of the large-baseline disparity estimation method from [2] that involves volume rescaling, homography warping, and fusion. Due to the limitations of computing memory and the resolution of the disparity estimated from the slope of the EPI line, the number of planes of the DPV is limited. As introduced in LLFF [18] , the number of planes in MPI determines the extrapolation boundary. This also applies to homography warped DPV. So, the outdoor scenes with large-baseline views require more planes in DPV than indoor scenes, thus pixels of source views can be accurately allocated into equal-disparity-distant planes of DPV. Therefore, both the disparity estimation algorithm in [2] and our source sampler fail in outdoor scenes that extend out to infinity. After filtering out the outdoor scenes by a threshold of disparity estimation error, 133 scenes are finally reserved, and most are indoor scenes, as expected. We randomly selected 123 scenes for training and ten scenes for model evaluation and comparison. For each rendering instance, we selected two LF captures from each scene and selected the central view of the target LF as .

Lg Lt
Mean square errors (MSE) between the predicted disparity maps and the ground truth of disparity maps were used to supervise the training of the disparity estimation model given by (19). Mean absolute errors (MAE) between the ground truth LF and the final rendered output LF , the preliminary rendering output based on central view warping , and the features of DVF module were calculated to supervise the learning of network parameters according to (20). and are weight coefficients for the losses of the direct rendering, OSS, and DVF modules, respectively. In our experiments, these two weights are set as 0.2 and 0.1. (20) 5 Experimental results

Evaluation of view synthesis quality
The proposed method is evaluated against the stateof-the-art novel view synthesis methods, including local light field fusion (LLFF) [18] , learning based view synthesis (LBVS) [5] , and single image view synthesis (SynSin) [15] . 7 × 7 Qualitative comparisons are shown in Fig. 4. The LBVS are trained using the MVLF dataset, and the evaluations of SynSin and LLFF use pre-trained models from their official repository. Forty nine novel virtual viewpoints are arranged in parallel planes to a target camera′s focal plane, so a set of novel view positions are arranged in a array neighbouring the target view. The number of Planes P in PSV is 128. Metrics of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are calculated to evaluate the quantitative quality of view synthesis. As can be seen from Table 2, our method produces higher PSNR results than others and shows competing SSIM results. Note that LBVS uses four corner SAIs of target LF as input, which is a much less challenging scenario for LF view synthesis in terms of angular variations. Our experiments demonstrated that, even when LBVS is doing a much simpler task than ours, our method still produces similar, and most of the time, better results than LBVS, which further validates the efficiency of our model. LLFF generates and fuses neighbouring multiplane images (MPIs) to render novel views, which can adapt to large-baseline parallax inputs. But, it cannot handle the camera rotations well that largely exist in the MVLF dataset, which will directly affect the MPI fusion process. Thus, the LLFF′s image rendering quality degraded. One of the approaches most closely related to ours is SRF [26] which is designed for large-baseline spherical-surrounding views. Because our inputs are configured as two source LFs that have tens of SAIs in the micro-baseline, which are too close to each other, thus making it almost uninformative for the multi-view correspondence searching method used in SRF. The correspondences among SAIs can only be effectively established by searching for the minimal pixel intensity variance along the slope of EPI lines, as shown in Fig. 2. Thus, the SRF will have consequently degraded performance on the LF dataset. So, we did not make comparisons with SRF due to unfair inputs.

Ablation study
We carry out experiments to validate the contributions of the OSS and DVF modules. Table 3 shows quantitative ablations of LF synthesis without the OSS and DVF modules. Full model ours in Table 3 has the best novel view synthesis quality. L bg In the experiments w/o OSS module, the source pixels sampling is disabled. So was the synthesis of the background LF . This proves that our source pixels sampling approach is important for completing the final rendering results with revealed occluded contents. Fig. 5 shows that occluded contents around depth discontinuities have been successfully recovered. Fig. 6 shows qualitative ablations of the OSS module.

A
In the experiments w/o DVF module, the fusion attention is removed. We can find that the performance drops without the DVF module. The degraded performance proves that the attention mechanism in the DVF L bg Lpre module is important for accurately fusing background LF and preliminary LF . We visually compare the  output from direct rendering based on the backward warping of central views in Fig. 7(b) and the final output LF image in Fig. 7(c). Fig. 7(c) has much less noise than Fig. 7(b). Therefore, the visualization in Fig. 7 proves that the DVF module can further suppress noise in output images, the effectiveness of attention-guided CNN for image denoising was also validated in a previous study [44] . Fig. 8 further proves that the DVF module is important in complementing the background LF.

Limitation
Our method is configured as a multi-view LF framework that adopts methodologies from multi-view stereo techniques, such as homography warping and pixel-intensity-based cost volume estimation. Therefore, our method has inherited limitations just like other multiview stereo algorithms [32,33,45] .
First, our method suffers from low-texture regions. Compared to other cost volume estimation methods in [32,33,45] that use convolution neural networks to extract high-level features, our pixel-consistency based cost volume estimation method in the OSS module could be less robust in low-texture regions.
Second, our method cannot handle large-baseline images of outdoor scenes well; our method uses DPV, which has a limited number of planes to transfer disparity from source views to the target view. The limited number of disparity planes in DPV is inefficient in resolution for large-baseline outdoor scenes with a large depth range. Additionally, the performance of the OSS module is also affected by the inefficient number of planes limited by computing memory. Therefore, our method does not perform well on the scene flowers and leaves.
Additionally, the noise suppression brought by the DVF module may further result in the loss of high-frequency details in output images.

Conclusions
Rendering a local immersive LF based on arbitrary large baseline references is a challenging task. Our method takes large baseline LF captures as input and can synthesize immersive novel views in a novel target viewpoint. Conventional view synthesis methods require a small baseline or hundreds of dense input views, while ours only requires two LF captures, which are convenient with existing commercial LF cameras. Furthermore, the OSS and  DVF modules are proposed to fuse sampled occluded source features into a final refined LF. Such source sampling and fusion mechanisms not only help provide occlusion information from varying observation angles, but also prove to be able to effectively enhance the visual quality by suppressing sensor noise. Experimental results show that our proposed method is able to render high-quality LF images with sparse LF references and significantly outperforms the other state-of-the-art LF rendering and novel view synthesis methods.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons li-cence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.