Image De-occlusion via Event-enhanced Multi-modal Fusion Hybrid Network

Seeing through dense occlusions and reconstructing scene images is an important but challenging task. Traditional frame-based image de-occlusion methods may lead to fatal errors when facing extremely dense occlusions due to the lack of valid information available from the limited input occluded frames. Event cameras are bio-inspired vision sensors that record the brightness changes at each pixel asynchronously with high temporal resolution. However, synthesizing images solely from event streams is ill-posed since only the brightness changes are recorded in the event stream, and the initial brightness is unknown. In this paper, we propose an event-enhanced multi-modal fusion hybrid network for image de-occlusion, which uses event streams to provide complete scene information and frames to provide color and texture information. An event stream encoder based on the spiking neural network (SNN) is proposed to encode and denoise the event stream efficiently. A comparison loss is proposed to generate clearer results. Experimental results on a large-scale event-based and frame-based image de-occlusion dataset demonstrate that our proposed method achieves state-of-the-art performance.


Introduction
Seeing through dense occlusions to capture the scene and obtaining clear images without occlusions is a challenging task. Due to the existence of dense occlusions, the valid visual information of the scene available to a single traditional frame-based camera is limited. Image de-occlusion methods, e.g., synthetic aperture imaging [1,2] , aim to reconstruct clear scene images without occlusions using the visual information acquired from multiple viewpoints, e.g., images from the camera array, which is important for many computer vision tasks, e.g., obstacle avoidance [3] , tracking [4,5] , and object detection [6] .
Traditional frame-based image de-occlusion methods [1,2,[7][8][9] take the light-field images captured by a camera array as input and reconstruct the clear scene image without occlusions. These methods are based on a basic assumption that complementary visual information can be obtained by the camera array since the images are acquired from different viewpoints are occluded differently. Therefore, the fusion of these light field images may obtain a clear scene image. However, although the camera array consisting of sufficient cameras can theoretically see through the occlusion and obtain complete scene information, it is often tough to acquire enough complementary visual information in practice, especially in the case of extremely dense occlusion. Meanwhile, acquiring plenty of images may lead to tremendous information redundancy, which is unnecessary. Therefore, it is urgent to explore new visual data acquisition and processing methods without redundancy to achieve efficient image de-occlusion.
Event cameras have developed significantly in recent years, bringing a new visual paradigm. Event cameras, e.g., dynamic vision sensors [10] , are bio-inspired vision sensors that asynchronously respond to the brightness intensity changes at each pixel and output in the form of event streams. Different from traditional frame-based cameras, each pixel of the event camera works asynchronously, and the logarithm of the brightness intensity of the pixel is recorded when an event is triggered at that pixel. A new event will be triggered whenever the change in the I(x, y, t) (x, y) t |Δ t (I(x, y, t))| > C e = (x, y, t, p) p ∈ {+1, −1} μ logarithm of the brightness intensity compared to this recorded value exceeds a certain threshold. Let denote the brightness intensity in pixel and timestamp . An event will be triggered when . The event is denoted as , where is the polarity of the event, indicating whether the brightness intensity is increasing or decreasing. Due to the principle of event triggering, event cameras have high temporal resolution (in the order of ), high dynamic range (about 140 dB versus 60 dB for frame-based cameras), low power consumption (about 10 mW), and are free of motion blur. These advantages make event cameras widely used in object recognition [11,12] , high frame rate video generation [13][14][15][16] , optical flow estimation [17][18][19] , and 3D reconstruction [20,21] .
Since the event camera has high temporal resolution and is free of motion blur, a moving event camera can record visual information of the scene behind occlusions completely. Based on this concept, Zhang et al. [22] proposed the first event-based image de-occlusion method. High temporal resolution event streams are captured by an event camera moving in a straight slide, providing complete visual information of the scene. A hybrid model is proposed to extract event stream features and reconstruct clear images without occlusions. However, it should be noticed that the event stream could only record the changes of the brightness intensity and does not record the brightness intensity value, i.e., information such as color is not recorded. Therefore, using only event streams to synthesize images, i.e., to predict the brightness intensity value at each pixel solely based on the input recorded brightness intensity changes, is an underdetermined task and may lead to fatal errors.
Although the event camera has high temporal resolution and can record the visual information of the scene completely, it also brings new challenges to the data processing. On the one hand, the event stream is in the form of a tuple list and cannot be processed directly using convolutional neural networks. Therefore, it is necessary to explore efficient event stream encoding methods. On the other hand, the captured raw event stream contains a large amount of noise, which may dramatically affect the quality of the reconstructed image. To tackle these problems, we leverage the spiking neural network (SNN) to encode the asynchronous event stream, which has been proven to be effective in some previous works [22,23] . SNNs are bionic artificial neural networks (ANNs), whose spiking neurons use spike trains to transfer information. Different from ANNs, whose neurons use continuous-valued and differentiable activation functions, spiking neurons of SNNs take spike trains as input, and each input spike will conduct a change in the hidden state of the neuron. An output spike will be triggered whenever the hidden state exceeds a certain threshold. Therefore, SNNs can naturally encode the asynchronous event streams. Meanwhile, due to the spike triggering principle, discrete noise events may not be able to conduct a large enough change in the hidden state of the spiking neurons to trigger an output spike. Therefore, discrete noise events will be well inhibited.
To tackle the challenges above, in this paper we propose an event-based and frame-based multi-modal fusion hybrid model that uses both event stream and occluded frames as inputs to synthesize clear scene images without occlusions. Compared with existing frame-based de-occlusion methods, high temporal resolution event streams are leveraged to tackle the lack of valid visual information relied on by frame-based methods in the case of dense occlusion. Compared with the existing event-based method, occluded frames are leveraged to provide low-level visual information, e.g., color and texture, to solve the problem that the event-solely image reconstruction is underdetermined. Furthermore, we leverage an SNN to efficiently encode the event stream and inhibit the noise in the raw event stream. Meanwhile, a comparison loss is proposed to achieve better image de-occlusion performance. To demonstrate the effectiveness of our proposed method, we collect and test our proposed method on a real large-scale event-based and frame-based image de-occlusion dataset. The experimental results show that our proposed method achieves state-of-the-art performance. All the source code and dataset could be found at https:// github.com/lisiqi19971013/Event_Enhanced_DeOcc.
Our contribution can be summarized as follow: 1) An event-enhanced multi-modal fusion framework is proposed for the event-based and frame-based image de-occlusion task, which could take the complementary advantages of occluded frames and event streams to achieve efficient image de-occlusion under dense occlusion.
2) An SNN-based event stream encoder is leveraged for efficient event stream encoding and denoising. In addition, a comparison loss is proposed to generate clearer scene images without occlusions.
3) Qualitative and quantitative experimental results on a large-scale event-based and frame-based image deocclusion dataset demonstrate that our proposed method achieves state-of-the-art performance.

Frame-based image de-occlusion
Recent decades have witnessed great advances in image de-occlusion methods [1,2,[7][8][9]24] . Inspired by the fact that the output of a camera array is equivalent to a camera with a large aperture lens, Vaish et al. [1] propose a framework based on plane and parallax to focus the occluded frames on a target plane and achieve image de-occlusion. They further explore alternative cost functions, e.g., cost functions based on color medians and entropy, to achieve robust image de-occlusion [2] . Xiao et al. [7] leverage a pixel-wise clustering method to select better views and generate better results. Pei et al. [8] treat image de-occlusion as a labeling task and propose an energy minimization method to solve this task. Recently, Wang et al. [9] propose the first deep learning framework to tackle the image de-occlusion task. An encoder-decoder architecture is proposed to make full use of the spatial and angular information. However, in the case of extremely dense occlusions, the performance of these frame-based image de-occlusion methods may be insufficient since the valid visual information available to the frame-based camera array is limited.

Event-based image de-occlusion
Since the moving event camera can record complete scene information with a high temporal resolution, Zhang et al. [22] propose the first event-based image de-occlusion method. The raw event stream captured by an event camera moving linearly is firstly refocused based on prior information of the acquisition system, including the depth of the scene, the moving speed of the event camera, and the intrinsic matrix of the event camera. Then, the refocused event stream is forwarded to a hybrid network to generate clear scene images without occlusions. However, this method only takes the event stream as input. Synthesizing images solely from event streams is an ill-posed problem since only the changes of the brightness intensity are provided, and the initial brightness intensity is lacking, which may lead to fatal errors.

Spiking neural network
Spiking neural networks (SNNs) are bio-inspired artificial neural networks (ANNs) that use bionic spiking neurons as their computing units. The spiking neuron uses discrete spike trains to transmit information. The hidden state of the spiking neuron is described by an internal variable called membrane potential. Each input spike will generate a post synaptic potential (PSP), leading to the increase of the membrane potential. Whenever the membrane potential exceeds a certain threshold, an output spike will be triggered, and the membrane potential will decrease due to a self-suppression mechanism called refractory response. The membrane potential of a spiking neuron is the sum of all PSPs and refractory responses. There are several spiking neuron models proposed in neuroscience, e.g., Izhikevich model [25] , Hodgkin-Huxley model [26] , and spike response model [27] .

Method
In this section, we introduce our proposed event-enhanced multi-modal fusion hybrid network for image deocclusion. In Section 3.1, we first introduce the spiking neuron model and the spiking neural network used in our model. Then, in Section 3.2, we introduce the network architecture of our proposed method. Finally, the loss functions used in our method are introduced in Section 3.3.

Spiking neuron and spiking neural network
In our proposed method, we use the spike response model (SRM) [27] as our spiking neuron model to compose the SNNs. Here, we formulate the neuronal dynamic of our spiking neuron. Let and be the input spike trains and the output spike train of a spiking neuron described by SRM, containing and 1 spike train, respectively. Each input spike train will generate a PSP, which is defined as the convolution of and the spike response kernel, denoted as , and then multiplied by a synapse weight , where denotes the -th spike train. The internal variable of the spiking neuron, i.e., the membrane potential, is denoted as . As soon as exceeds the threshold , an output spike will be triggered, and will be inhibited due to the refractory response. Similarly, the refractory response is defined as the convolution of and the refractory response kernel, denoted as . The membrane potential is the sum of all PSPs and the refractory responses: where , and is the timestamp of the -th input spike in the -th input spike train, and , and is the timestamp of the -th output spike in the output spike train. For a better description, we illustrate a spiking neuron and the corresponding neuron dynamic in Fig. 1. As shown in Fig. 1(a), the spiking neuron takes 3 spike trains as input, and each input spike train contains 2 input spikes. The corresponding synaptic weight of each input spike train is depicted using the height of the spike. The PSP generated by each input spike is shown as the curve in the corresponding color in Fig. 1(b), and the membrane potential of the neuron is shown as the black curve in Fig. 1(b). When the membrane potential exceeds the threshold, an output spike is triggered, shown as the red arrow, and the membrane potential declines due to the refractory response, shown as the red curve. In practice, the spike response kernel and the refractory response kernel are formulated as is the Heaviside step function, and are the time constants of the spike response kernel and the refractory response kernel, respectively.
Further, consider a feed-forward spiking neural network with M layers, and the -th layer contains spiking neurons. Then, the forward propagation of this SNN can be defined as is the synaptic weight of the -th layer. Fig. 2 shows the network architecture of our proposed event-enhanced multi-modal fusion hybrid network for image de-occlusion. The input event stream is firstly encoded and denoised by an SNN-based event stream encoder, and then forwarded into a CNN-based encoder for further feature extraction. Meanwhile, the input occluded frames are concatenated with the corresponding event surfaces generated by a handcraft representation module from the asynchronous event stream, and are for-warded into the frame encoder to extract image features. The extracted multi-modal features are fused by fusion layers and forwarded into the joint decoder to generate the final clear image.

Network architecture
SNN-based event stream encoder. Our proposed SNN-based event stream encoder contains 3 convolutional layers, directly taking the raw event stream as input. The input layer encodes the event stream using 16 convolutional kernels with a size of , where the two channels correspond to the positive and negative events, respectively. Then the hidden layer leverages 16 kernels with a size of to further extract features. The output layer contains 32 kernels with the size of to generate the encoded event stream. To better preserve visual information, the input event stream is concatenated with the output event streams of the first and second layers, respectively. The membrane potentials of the output neurons are averaged along the temporal dimension to generate the output coarse encoded event voxel.
Compared to [22], our proposed method leverages SRM as our spiking neuron to further consider the refractory responses caused by the output spikes, which is more similar to the real biological neuron and could achieve better performance. Meanwhile, Zhang et al. [22] take the number of output spikes as the output encoded Fig. 1 A spiking neuron (a) and the neuron dynamics (b) described by the spike response model Fig. 2 The framework of our proposed method. Taking both the event stream and occluded frames as input, our proposed method uses the event stream to provide complete scene information and the occluded frames to provide color and texture information for efficient image de-occlusion. feature of the SNN-based event stream encoder, which is discrete and inefficient. In contrast, we suggest using the membrane potential of the output neurons as the encoded feature, which contains more complete information.
Event encoder. After the event stream is denoised and coarse encoded by the SNN-based event stream encoder, the encoded event voxel is forwarded into a CNNbased event encoder for multi-scale feature extraction. The input event voxel is firstly forwarded into a conditionally parameterized convolutional layer [28] with 64 output channels, which can learn specialized convolutional kernels for each sample and enhance feature extraction, and then following forwarded into 7 down-sampling layers with output channels of 128, 128, 256, 256, 512, 512, 512, respectively. In each down-sampling layer, the input feature is down-sampled by a convolutional layer with a kernel size of 4 and a stride of 2. With our proposed event encoder, multi-scale event features will be extracted. Frame encoder. Since the occluded frames can provide visual information, e.g., color and texture, we select 11 occluded frames symmetrically for each scene as the input of the frame encoder. Considering that the occluded frames contain a large proportion of useless occlusion information and it is tough for the model to filter out the invalid parts, we convert the event stream to event surfaces to provide boundary information. For each occluded frame, the event surface is calculated as the event count map within a time bin of length around the timestamp of the frame. For the event stream , the time surface for the -th frame can be defined as is the timestamp of the -th frame. In practice, the length of the time bin is set to the exposure time of the occluded frame. The occluded frames and the event surfaces are concatenated and forwarded into the frame encoder, which has the same structure as the event encoder except for the input channel but does not share weights. Using the proposed frame encoder, multi-scale image features could be obtained from the occluded frames and the event surfaces.
Event and frame joint decoder. After the multiscale features are respectively extracted from the event stream and the occluded frames, a fusion module is leveraged to fuse the multi-modal features. Since the event stream only records the brightness changes at each pixel, which is extremely sparse, while the image is dense and contains visual information such as color and texture, there is large modality gap between the event stream and image. Therefore, directly concatenating the features of the two modalities will not be sufficiently fused and may lead to performance degradation. To deal with this prob-lem, we design the feature fusion layer to fuse the features of different modalities. The features with the same scale from event streams and occluded frames are concatenated and convolved by 2 convolutional layers with a kernel size of 3. Then the fused feature is added with the input features by the shortcut connection to obtain the output multi-modal fused feature. After the multi-scale fused features are obtained, they are forwarded into the event and frame joint decoder to generate the final output image without occlusions. The joint decoder contains 7 up-sampling layers with output channels of 512, 512, 256, 256, 128, 128, 64, respectively, and a conditionally parameterized convolutional layer. Inspired by the success of convolutional layers with multi-scale perceptual fields in low-level vision tasks such as semantic segmentation [29] , in each up-sampling layer, the input feature is firstly interpolated using bilinear interpolation and then forwarded into an atrous spatial pyramid pooling (ASPP) module [30] to obtain multi-scale features. Meanwhile, the fused feature of the corresponding scale is added by the skip connection. Using our proposed joint decoder, the clear image without occlusions can be synthesized from multi-modal features.

Loss function
In this section, we introduce the loss function used in our proposed method. Firstly, a pixel-wise Manhattan distance is computed as the pixel loss to maintain the low-level vision features, e.g., color and texture, formulated as where and are the pixel values of the predicted image and ground truth image at coordinate , respectively, and the predicted image is in the shape of . To further maintain the high-level vision features, the learnable perceptual image patch similarity loss [31] is leveraged in our method to generate images with better visual effects, which is calculated as the Euclidean distance between the multi-scale features extracted from the output image and the ground truth image, respectively, by a fixed pre-trained network together with a learnable linear layer, formulated as where is the feature extracted by the -th layer of the pre-trained network, which is in the shape of , and are the output and ground truth images, respectively, and is the -th learnable linear layer. In practice, we use the visual geometry group i = 2, 4, 7, 10, 13 L lpips network (VGG) network [32] pre-trained on the ImageNet dataset [33] as the fixed network and leverage the features extracted by convolution layers to calculate . Inspired by the contrast learning mechanism, we propose a comparison loss to improve the image de-occlusion performance. The main purpose of our proposed comparison loss is to pull the output closer to the positive sample and push the output further away from the negative sample. For the image de-occlusion task, we set the clear image without occlusions as the positive sample, and input dense occluded frame as the negative sample. Specifically, the comparison loss is defined as the quotient of the perceptual loss [34] calculated from the output frame with the positive sample and the negative sample, formulated as where is the weight of the -th feature. In practice, the negative sample is set as the middle input occluded frame. The perceptual loss is calculated using the features obtained from the convolution layers of the pretrained VGG network, and the corresponding weights are set at . Finally, the total loss function can be defined as are the hyper-parameters to balance the loss functions.
With our proposed event-enhanced multi-modal fusion hybrid network, valid visual information from the event stream and occluded frames could be fused efficiently, and a clear scene image without occlusions could be obtained.

Experiments
To prove the effectiveness of our proposed method, we conduct experiments on a large-scale event-based and frame-based image de-occlusion dataset, which is introduced in Section 4.1 with the experimental setting. The qualitative and quantitative analyses are introduced in Sections 4.2 and 4.3, respectively. Finally, in Section 4.4, we conduct ablation experiments to demonstrate the utility of each proposed module.

Dataset and experimental setting
Dataset. To demonstrate the utility of our proposed method, we collect a large-scale event-based and framebased image de-occlusion dataset named Occlusion-400 and conduct experiments on it. The dataset contains 400 samples captured by a DAVIS346 color camera moving linearly on a straight slide indoors and outdoors, as shown in Fig. 3(a). The indoor parts of the dataset contain 244 samples and are collected in the laboratory using fences and baffles as occlusions, as shown in Figs. 3(b)  and 3(c). The outdoor parts contain 156 samples and are occluded with fences, recording mainly remote scenes, e.g., buildings and cars. For each sample, occluded frames, event stream, and the ground truth scene image without occlusions are recorded. In practice, the event camera we used is the DAVIS346 color camera, and it is placed on a straight slide with a length of 50 cm. The moving speed of the event camera is 4.3 cm/s. The occlusion is placed between the slide and the scene, and is placed parallel to the slide and vertical to the optical axis of the DAVIS346 camera. Some example samples are shown in Fig. 3, including the event streams, the occluded frames, and the ground truth clear scene images without occlusions. In our experiments, 358 samples are randomly selected as the training set and the remaining are selected as the test set. α = 1, β = 0.1, γ = 0.05 Experimental setting. Since the spiking function of the spiking neuron is non-differentiable, the SNNs could not be trained directly by the traditional backpropagation algorithm. To deal with this problem, Shrestha and Orchard [35] propose a backpropagation algorithm named SLAYER, which can propagate the errors to the previous layer based on the temporal credit assignment policy, making the SNNs able to be trained using gradient descent like ANNs. Our proposed method is implemented based on PyTorch and the SNN library SLAYER. The model is trained for 1 200 epochs with a batch size of 16. The optimization method is Adam [36] , and a cosine annealing schedule with warm restarts [37] is used to adjust the learning rate. The hyper-parameters are selected as in (8). Table 1 shows the quantitative experimental results on the Occlusion-400 dataset. Following the setting in previous works [9,22], peak signal to noise ratio (PSNR, higher is better) and structural similarity (SSIM, higher is better) are selected as the evaluation metrics. Our proposed method is compared to the state-of-the-art framebased image de-occlusion method DeOccNet [9] and the event-based de-occlusion method E-SAI [22] . All methods are trained on our Occlusion-400 dataset for a fair comparison. From Table 1, we can observe that our proposed method achieves state-of-the-art performance on the Occlusion-400 dataset. Specifically, our proposed method achieves improvements of 0.97 dB and 0.015 in terms of PSNR and SSIM, respectively, compared to the state-ofthe-art frame-based image de-occlusion method DeOc-cNet [9] , and achieves significant improvements of 5.43 dB in PSNR and 0.1 in SSIM compared to the event-solely method E-SAI [22] . Compared to the conventional method [1] , our proposed method achieves greater advant-ages, i.e., 8.96 dB and 0.212 in terms of PSNR and SSIM, respectively.

Quantitative analysis
We also evaluate the performance of each method under different scenarios, i.e., indoors and outdoors. From Table 1, we can observe that in the scenarios with relatively sparse occlusion, i.e., outdoor scenarios, larger advantages could be achieved by frame-based methods due to the fact that generating images using only event streams is underdetermined. In contrast, in densely occluded indoor scenarios, the advantage of the frame-only method is slight, and greater benefits could be obtained by using event streams, which provide complete scene information behind the occlusions. Compared to the eventonly method E-SAI, our proposed method leverages the occluded frames as input to provide low-level visual information, e.g., color and texture, which achieves greater advantages in relatively sparsely occluded outdoor scen-arios. Compared to the frame-only method DeOccNet, we use event streams to provide complete scene information and achieve better performance. Fig. 4 shows the visualization results on the Occlusion-400 dataset. The occluded frames, the results generated by Vanish et al. [1] , E-SAI [22] , DeOccNet [9] , and our proposed method, are visualized from left to right. The ground truth clear scene images without occlusions are also provided for reference. From Fig. 4, we can observe that our proposed method could generate clearer scene images without occlusions. More precise details can be synthesized by our proposed method, e.g., the grass in the last row.

Qualitative analysis
Compared to the frame-based image de-occlusion method DeOccNet, our proposed method could generate  results with higher precision under dense occlusions, e.g., the digital number in the red box in row 2, which can be clearly reconstructed by our proposed method, while the frame-based method acquires limited valid visual information due to the dense occlusion, resulting in its poor performance. Another typical example is the hole shown in the red box in row 5. Due to the lack of valid visual information, some tiny details, e.g., holes, may not be recorded in the limited input occluded frames, which causes the frame-based image de-occlusion method to fail to reconstruct such tiny details, as shown in row Fig. 5(d).
Compared to the event-based method E-SAI, our proposed method has greater advantages and can generate significantly clearer results, e.g., the bicycle in the red box in row 6, and the doorbell in the blue box in row 3.
Although the event-based method could reconstruct images under dense occlusions due to the complete visual information provided by the event stream, the output images are blurred and lack details, e.g., textures, since it is an ill-posed problem.

Lcmp
To demonstrate the utility of our proposed modules, we further conduct ablation experiments on the Occlusion-400 dataset. We test the performance of our model without the proposed comparison loss (denoted as Ours (w/o )), our model with the SNN event stream encoder removed (denoted as Ours (w/o SNN)), and our basic model (denoted as Ours (basic)), i.e., both the SNN encoder and the comparison loss are removed, and the quantitative results are shown in Table 2. From Table 2, we can observe that the removal of our proposed comparison loss will lead to a performance decline of 0.39 dB in PSNR and 0.011 in SSIM, which shows the effectiveness of the comparison loss. To demonstrate the feasibility of our proposed SNN event stream encoder, we use the handcraft event representation method proposed in [38] to convert the event stream into the grid-based representation with 32 channels to better preserve the information. Then, the event voxel is forwarded into our proposed model with the SNN event stream encoder removed. The quantitative results show that the removal of our proposed SNN event stream encoder will lead to a performance decline of 0.53 dB and 0.021 in terms of PSNR and SSIM, respectively. When the SNN encoder is removed and the event stream is stacked by the handcrafted event representation method, the event stream may not be effectively denoised and the temporal information contained in the event stream may be lost, which leads to this performance decline. We further visualize the results of our ablation experiments, as shown in Fig. 5. From  Fig. 5, we can observe that artifacts appear in the output image when the comparison loss is removed, as shown in row 3 Fig. 5(c), which shows the usefulness of our proposed comparison loss. Meanwhile, from the figure, we can further observe that the removal of our SNN event stream encoder will lead to the blurry in the output results. As shown in the red box in row 1 Fig. 5(d), the edges of the chessboard generated by our model with the SNN encoder removed are blurry, which is due to the lack of efficient event stream denoising. Table 3 shows the results of our proposed method solely taking events or frames as input. From Table 3, we can observe that the removal of the input frames (denoted as Ours (w/o Frame)) leads to a performance decline of 5.35 dB and 0.129 in terms of PSNR and SSIM, respectively. This is due to the fact that solely synthesizing images from event streams is an ill-posed problem, since only the changes of the brightness intensity are provided and the initial brightness intensity is lacking, which will lead to fatal errors. From Table 3, we can also observe that the removal of the input events, denoted as Ours (w/o Event), will lead to a performance decline of 0.7 dB and 0.026 in terms of PSNR and SSIM, respectively. This is because the event stream could provide complete visual information of the occluded scene. These ablation experiments could demonstrate the validity of different modalities.
To further validate the detailed architecture of our proposed method, we conduct ablation experiments, and the results are shown in Table 3. To demonstrate the effectiveness of our proposed SNN-based event stream encoder, we use the SNN-based encoder proposed in [22] to replace our proposed event stream encoder while keeping the remaining part of the network structure unchanged. The model performance under such conditions is denoted as Ours (Change SNN). From Table 3, we can observe that our proposed SNN-based encoder could achieve performance improvements of 0.39 dB and 0.003 in terms of PSNR and SSIM, respectively, which demonstrates the effectiveness of our proposed SNN-based event stream encoder. To demonstrate the validity of the input event surface, we conduct ablation experiments on our Occlusion-400 dataset. From Table 3, we can observe that the removal of the input event surface, denoted as Ours (w/o Event Surface), leads to a performance decline of 0.57 dB and 0.008 in terms of PSNR and SSIM, respectively. To demonstrate the effectiveness of our proposed feature fusion layer, we conduct ablation experiments in which the feature fusion layers are removed and the features from the event encoder and frame encoder are directly concatenated and forwarded into the joint decoder. The performance of our model in this case is denoted as Ours (w/o Feat. Fusion). From Table 3, we can observe that the removal of the feature fusion layers results in a performance decline of 0.54 dB and 0.009 in terms of PSNR and SSIM, respectively. For our proposed multi-scale upsampling layer, we conduct ablation experiments in which the ASPP module is replaced by a convolutional layer with a kernel size of 3. The performance of our model, in this case, denoted as Ours (w/o M.s. Up-Samp.), is shown in Table 3. From Table 3, we can observe that the removal of the ASPP module results in a performance decline of 0.5 dB and 0.01 in terms of PSNR and SSIM, respectively.

Conclusions
In this paper, we propose an event-enhanced multimodal fusion hybrid network for image de-occlusion. Our proposed method takes both occluded images and the event stream as input. It achieves effective image de-occlusion by leveraging the event stream to provide complete visual information and frames to provide color and texture information. An SNN-based event stream encoder is leveraged to encode and denoise the event stream effectively, and a comparison loss is proposed to generate clear scene images without occlusions. We collect a real large-scale event-based and frame-based image de-occlu-

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.