Occlusion relationship reasoning with a feature separation and interaction network

Occlusion relationship reasoning aims to locate where an object occludes others and estimate the depth order of these objects in three-dimensional (3D) space from a two-dimensional (2D) image. The former sub-task demands both the accurate location and the semantic indication of the objects, while the latter sub-task needs the depth order among the objects. Although several insightful studies have been proposed, a key characteristic of occlusion relationship reasoning, i.e., the specialty and complementarity between occlusion boundary detection and occlusion orientation estimation, is rarely discussed. To verify this claim, in this paper, we integrate these properties into a unified end-to-end trainable network, namely the feature separation and interaction network (FSINet). It contains a shared encoder-decoder structure to learn the complementary property between the two sub-tasks, and two separated paths to learn specialized properties of the two sub-tasks. Concretely, the occlusion boundary path contains an image-level cue extractor to capture rich location information of the boundary, a detail-perceived semantic feature extractor, and a contextual correlation extractor to acquire refined semantic features of objects. In addition, a dual-flow cross detector has been customized to alleviate false-positive and false-negative boundaries. For the occlusion orientation estimation path, a scene context learner has been designed to capture the depth order cue around the boundary. In addition, two stripe convolutions are built to judge the depth order between objects. The shared decoder supplies the feature interaction, which plays a key role in exploiting the complementarity of the two paths. Extensive experimental results on the PIOD and BSDS ownership datasets reveal the superior performance of FSINet over state-of-the-art alternatives. Additionally, abundant ablation studies are offered to demonstrate the effectiveness of our design.


Introduction
Reasoning the occlusion relationship of objects from a single image itself is an important topic in computer vision [1][2][3][4][5], which also acts as a crucial component in other (higher-level) vision tasks, such as object detection [6,7], instance segmentation [8], light field depth estimation [9], depth ordering [10], tiny obstacle discovery [11], object proposal extraction [12] and scene de-occlusion [13].From the perspective of an observer, the occlusion reflects the depth discontinuity in the scene.It appears at the intersection of things (e.g., people and cars) or between things and stuff (e.g., sky and sea).As illustrated in Fig. 1 (a), the cows occlude the grass.Following the definition in [2], 1-pixelwide object boundaries, which reveal the accurate location and semantic indication of the objects, are employed to indicate where occlusion occurs.The orientation of each boundary point is utilized to represent the depth order.Therefore, the orientation value makes sense only at the boundary pixel.
Over the past years, a number of influential studies [1,3,[14][15][16][17] have concentrated on modeling the hand-crafted  π , π ] (tangent direction of the boundary), using the "left" rule where the left side of the arrow indicates the foreground area.Notably, "red" pixels with arrows: correctly labeled occlusion boundaries; "yellow": correctly labeled boundaries but mislabeled occlusion; "green": false-negative boundaries; "cyan": false-positive boundaries (best viewed in color)."Fg" represents foreground area and "Bg" represents background area features to reason the occlusion relationship.These approaches frequently over-segment the image into abundant regions and employ edge cues or region cues to judge the local figure/ground (FG) relationship, followed by global optimization to estimate the global depth order and alleviate local conflicts.Hoiem et al. [1] utilized boundary cues, region cues, surface layout cues and the depth-based cues to estimate the local FG.The conditional random field (CRF) model is also employed for global FG reasoning.Amer et al. [17] adopted convexity to compute the local FG and the hierarchical relationships between regions to aggregate the local FG relationship.A novel global optimization method was also proposed.These approaches can generally retain decent pixel-level details to locate boundaries but often suffer from poor exploits of the objects' semantic cues.In other words, they can hardly recover the complete boundary information of objects and fully judge the depth order between the objects.
Recently, with the emergence of powerful deep learning techniques, several deep learning based approaches [2,4] have been developed to greatly improve the performance of this task, especially in terms of learning semantic features.Wang and Yuille [2] treated the task as two completely separated sub-tasks, i.e., occlusion boundary detection and occlusion orientation estimation.They proposed two separated deep convolutional networks to exploit different image cues for the two sub-tasks, respectively.In contrast, Wang et al. [4] established a unified end-to-end multi-task network that shared deep features to simultaneously predict both the occlusion boundary and the occlu-sion orientation.Although these methods achieve significant improvement by exploiting semantic features, they are limited in considering the complementarity or specialty of the two sub-tasks.In addition, the depth order cue, which is a fundamental element in occlusion reasoning, is rarely studied.
In this work, we customize a feature separation and interaction network, FSINet for short, to fully exploit the specialty and complementarity of the two sub-tasks in occlusion relationship reasoning.FSINet contains a shared encoder, two separated paths for occlusion boundary detection and occlusion orientation estimation, and a shared decoder.Specifically, the occlusion boundary detection path contains three modules.First, an image-level cue extractor (ICE) is designed to supply rich location information of the boundary.Second, a detail-perceived semantic feature extractor (DSFE) is proposed to embed the lowlevel details into the semantic features, producing a refined semantic feature map.Third, a contextual correlation extractor (CCE) is proposed to capture the long-range constraints between semantic features of the boundary.We fuse these three cues to obtain the features containing sufficient local details and a semantic indication of the objects, as visualized in Fig. 1(b).In addition, a dualflow cross detector (DCD) is proposed to detect the occlusion boundary, which achieves a better trade-off in limiting false positives and false negatives by a pair of mutually promoting flows.Accordingly, compared to the existing approaches (Fig. 1(e) and (Fig. 1(f )), our method alleviates the false-positive and false-negative occlusion boundaries efficiently (Fig. 1(d)).In the occlusion orientation estimation path, a scene context learner (SCL) is designed to capture rich multi-scale depth order cues around the boundary for occlusion relationship reasoning.The learned feature map is visualized in Fig. 1(c).In addition, two stripe convolutions are adopted to exploit the depth order cue to distinguish the foreground and background areas.The shared decoder is designed to establish the feature interaction between these two paths, making each path capable of boosting the other path.Extensive experimental results demonstrate that FSINet achieves state-of-the-art performance on both the PIOD [2] and BSDS ownership [3] datasets.
In summary, the key contributions of this work are summarized as follows (also see Fig. 2): 1) We propose a feature separation and interaction network, which integrates the specialty and complementarity of the two sub-tasks in occlusion relationship reasoning into a unified end-to-end trainable network.3) The depth order cues are exploited by SCL, which are employed to distinguish the foreground and background areas via two stripe convolutions.Note that, through the feature interaction, the depth order cues show a significant impact in boosting the occlusion boundary detection.The major extensions over the conference version [18] are illustrated as follows: (1) We design a DSFE module to embed the low-level details into the semantic features stepby-step, resulting in a refined semantic feature map.(2) We design a DCD module for occlusion boundary detection, which efficiently alleviates false-positive and false-negative detection.(3) We optimize the occlusion orientation estimation path to increase computational efficiency.Note that, a recent work, MT-ORL [19], is a variant of our conference version [18], which primarily focuses on orthogonal occlusion representation for occlusion relationships.The extensions and contributions of this paper are quite different from those of MT-ORL.

Background
In this section, we briefly review the concepts closely related to the proposed occlusion relationship reasoning network.
Encoder-decoder structure has been successfully applied in semantic segmentation [20], object detection [21] and other computer vision tasks [22].It is composed of an encoder network and a decoder network.The former extracts multi-layer features gradually by several convolution and down-sampling layers.Based on these features, the latter makes specific predictions for different tasks.Although our method utilizes a similar structure, the shared decoder is motivated to supply the feature interaction between two sub-tasks.
Spatial features represent pixel-level details before sequential down-sampling operations.Due to reserving resolution and representing low-level details, the spatial feature is usually employed in several low-level visual tasks, such as depth prediction [23] and boundary detection [24][25][26].It is also helpful for several high-level visual tasks for a refined prediction around the boundary, such as semantic segmentation [27].In this paper, the proposed method extracts abundant low-level cues directly from the original image by ICE to obtain an accurate occlusion boundary.
Multi-level features are widely used in multiple visual tasks [28].The features are extracted from different stages of a backbone.They represent the scene in various phases.A series of operations, i.e., normalization, concatenation and dimension reduction are employed to acquire feasible features.Different from simply concatenating multilevel features, the proposed method extracts multi-level features for embedding the low-level detail to the highlevel features, which further obtains the enclosed boundary with an accurate location.
Context features model the long-distance dependencies in later layers of the convolutional neural network (CNN).The context is important in both scene understanding and perception [29][30][31][32].For instance, Mostajabi et al. [33] used context features to promote feedforward semantic labeling of superpixels and acquire good results.Liu et al. [34] designed an architecture based on the fully convolutional network (FCN), and they proposed a global context for semantic segmentation.Afterward, Chen et al. [35] employed the atrous spatial pyramid pooling to extract context features and encode multi-scale information.In addition, several more advanced contextual modules were pro-posed.Fu et al. [36] established a dual attention module to capture position attention and channel attention to improve context features.Yuan et al. [37] employed object region to learn the contextual relation between pixels and regions.In this paper, the proposed network extracts two different context features for two sub-tasks, namely the long-range correlations between the semantic features of the boundary point and the depth order cues around the boundaries.

Problem formulation
Following [2,38], the goal of occlusion relationship reasoning can be expressed as follows: for an input image I, an occlusion boundary (1-pixel-wide) map Ê and an occlusion orientation map Ô are jointly predicted to represent the occlusion relationship.Each of them has the same resolution as I.Note that Qiu et al. [39] presented a novel pixel-pair occlusion relationship not long ago, in which the occlusion boundary is defined as 2-pixel-wide (occluder and occludee).Since our work mainly focuses on exploiting the specialty and complementarity between occlusion boundary detection and occlusion orientation estimation, we follow the common definition of the occlusion relationship in [2,38] to fairly demonstrate the effectiveness of our design.
Occlusion boundary: Occlusion boundary represents the location where occlusion occurs and is generally depicted as the object boundary.For image I, the predicted boundary probability map is denoted as E, and e p is the probability value of pixel p in map E, where e p ∈ [0, 1].Correspondingly, the ground truth of the boundary is denoted as E, which is a binary boundary map.The boundary ground truth at pixel p is denoted as e p , where e p ∈ {0, 1}.
Occlusion orientation: Occlusion orientation indicates the depth order between objects by employing the pixelwise tangent values.The tangent values are only defined at the boundary pixels and follow a left-hand rule [2] (i.e., the foreground area is on the left side of the background area), as demonstrated in Fig. 1(a).Following the mathematical definition above, for an image I, the predicted orientation map is denoted as O, and o p is the orientation value of pixel p in map O, where o p ∈ (-π, π].The ground truth of orientation is denoted as O, and the orientation ground truth at pixel p is denoted as o p , where o p ∈ (-π, π]. Occlusion relationship: During the testing phase, we first conduct a non-maximum suppression (NMS) operation on E to obtain thinned boundaries: Ê = NMS(E).Then, all pixels of Ê are rounded by using a threshold τ to form a mask of occlusion boundary: M = {[ê p > τ ]|p ∈ Ê}, where τ denotes the threshold, and [•] denotes the indicator function, which is set to 1 when the predicted boundary probability is larger than τ , and êp denotes the probability value of pixel p in map Ê.We reserve only the orientation values of the boundary pixels: Ô = M ⊗ O, where ⊗ denotes the element-wise product operation.Finally, similar to DOC [2], we adjust the orientation map Ô to ensure that neighboring boundary pixels have a similar tangent direction and gain the occlusion relationship map.

Feature separation and interaction network
This section details the proposed feature separation and interaction network, named FSINet, and the pipeline is illustrated in Fig. 3. Given an input red-green-blue (RGB) image I, a fine-tuned ResNet [40] is employed to extract deep features.Assuming that f i (i = {2, 3, 4, 5}) denotes the output feature of the i-th stage in the ResNet, the feature of the last stage f 5 is convolved by two extra 3 × 3 convolution blocks to reduce the channel number to 256, and this resulting feature is denoted as f 5 .Subsequently, a decoder composed of two residual blocks and an upsampling operation is designed.Each residual block comprises two branches: one with a 3 × 3 convolution and the other with two 1 × 1 convolutions and a 3 × 3 convolution.The output of the decoder is denoted as F d .Then, we design two separated paths: the occlusion boundary detection path to locate the pixels where occlusion occurs, i.e., boundary pixels, and the occlusion orientation estimation path to determine the occlusion relationship between objects.Note that the decoder plays a key role in our design and it supplies the feature interaction between the two paths introduced in this section.Benefiting from the interaction, the depth order cues learned in the occlusion orientation path show a significant impact on occlusion boundary detection.Such a claim is also experimentally proven in Sect.5.3.1.

Occlusion boundary detection path
The occlusion boundary path aims to accurately locate the pixels where occlusion occurs.We propose three feature extractors to obtain sufficient occlusion boundary features: (1) ICE supplies rich location information of the boundary.( 2) DSFE embeds the low-level details to semantic features step-by-step, which produces a refined semantic feature map.(3) CCE measures the long-range constraints between the semantic features of the boundary.In addition, a dual-flow cross detector named DCD is proposed to detect occlusion boundaries by fusing these features.
Image-level cue extractor: Motivated by the conventional occlusion reasoning approach [41], which computes the occlusion boundary based on the over-segmented image, we propose ICE to capture pixel-level location information of the boundary.As illustrated in Fig. 4 (a), ICE contains two flows: one flow suppresses non-boundary pixels by utilizing three 3 × 3 convolution blocks, and the other flow directly employs a 3 × 3 convolution block to preserve the detailed location of the occlusion boundary.Subsequently, the features of the two flows are summed element-wisely to combine the advantages of both flows.Each convolution block consists of a 3 × 3 convolution, batch normalization, and ReLU activation.We define C 3 (•) as the 3 × 3 convolution block.The feature obtained by ICE is formulated as follows: Due to the need to avoid resolution reduction operations, such as pooling and down-sampling, ICE obtains location Moreover, ICE can be interpreted as a filter operation that retains the boundary pixels with high response since the first flow avoids introducing massive color and texture from the original image, as demonstrated in Fig. 4(d).
Detail-perceived semantic feature extractor: The highlevel features contain sufficient semantic information but lose details since several pooling and down-sampling operations are performed in the backbone.Directly detecting the occlusion boundary based on such features may result in performance degradation.Aiming to address this issue, we design a bottom-up structure to embed the lowlevel details into the semantic feature step-by-step.Between adjacent stages, the common features of the occlusion boundary are preserved, and the non-boundary noises from the lower layers are suppressed.In particular, we first use a 3 × 3 convolution block to refine features f i and f i+1 , and reduce the channel number of both features to 32.Then, both the resulting feature maps are upsampled to 1/4 resolution of the input image and concatenated.Another 3 × 3 convolution block is employed to fuse the two features and fix the channel number to 32, resulting in a refined feature map Y i .Assuming that Y 2 = C 3 (f 2 ) denotes the refined feature map of stage 2, the i-th refined feature map Y i is formulated as: where up(•) is the upsampling operation, and C is the concatenation operation.Finally, the last refined feature Y 5 is upsampled to the same size as the original image, i.e., For two adjacent layers, because f i contains fewer nonboundary pixels than f i-1 , their fusion removes these nonboundary pixels and accurately locates the boundary pixels.In this way, the DSFE eventually passes the location information in low-level cues to high-level cues to obtain more accurate semantic indicators of the occlusion boundary and suppress the non-boundary noise from the lower layers.As displayed in Y 2 of Fig. 4(e), the initial low-level cue is full of redundant responses at non-boundary pixels.By gradual fusion, these responses are gradually alleviated, and the boundary pixels are located accurately, as shown in Y 5 of Fig. 4(e).
Contextual correlation extractor: To obtain the closed boundary of the object, we design a CCE to model the long-range correlations between the semantic features.Specifically, taking the feature f 5 as input, CCE first utilizes a dilated convolution with a kernel size of 3 × 3 and a dilation rate of 2.Then, a 3 × 3 convolution block is employed to reduce the grid artifacts caused by the dilated convolution.Furthermore, a 1 × 1 convolution block is employed to reduce the channel number to 16.Finally, the generated feature is upsampled to the size of the original image.Assuming that the dilated convolution with a kernel size of 3 × 3 and rate of n is denoted as D n (•), the extracted contextual cue can be formulated as: Since there are always occlusion boundary pixels around the true boundary pixels, the intensities of these pixels are increased by long-range contextual correlation, which improves the consistency of the occlusion boundary.As illustrated in Fig. 4(f ), the feature of CCE encodes the occlusion boundary that separates the object and the background with different responses.
Dual-flow cross detector and boundary loss: DCD is proposed to eliminate the false-positive detection of the occlusion boundary, which utilizes two mutually promoting flows, as displayed in Fig. 5. Specifically, given the feature as input, a 3 × 3 convolution block is first employed to fuse the concatenated feature and reduce the feature channels from 64 to 16.Then, we design two separated flows.The first flow employs three 3 × 3 convolution blocks to reduce the channel consecutively, as depicted by the orange arrow in Fig. 5.The second flow also consists of three 3 × 3 convolution blocks, as depicted by the blue arrow in Fig. 5.To promote each flow by the other flow, the loss of the second flow must be backpropagated to the first flow.For this reason, the output of the first block in the first flow is added to the corresponding feature in the second flow, and the summed feature is the input of the second block in the second flow.The second block follows the same way.The output features of both flows are concatenated and followed by a 1 × 1 convolution to fuse them.Eventually, a sigmoid layer is utilized to fit the boundary.
During training the network, we employ two loss functions to supervise the two flows.First, taking the first flow output as input, a 1 × 1 convolution and a sigmoid layer are employed to generate the probability map of boundary E, which is supervised by the cross-entropy loss (CEL) function.According to Sect. 3, the pixel of the predicted occlusion boundary map is denoted as e p = E(p), and the ground truth corresponds to e p = E(p).The CEL is formulated as: -ln(e p ), if e p = 1, -ln(1e p ), otherwise. (4) Second, given the summed feature of both flows, another 1 × 1 convolution and a sigmoid layer are employed to generate a boundary map, which is supervised by the attention-based loss (AL) function [4].The AL is formu- where l 1 , l 2 and l f denote the loss values of the first flow, the second flow and the final output in a batch, respectively, M is the mini-batch size, I j is the j-th image in a batch, and p is the p-th pixel in image I j .λ 1 , λ 2 and λ 3 are the fixed weight parameters.
Due to the low proportion of occlusion boundary pixels in an image, the value of CEL is generally very large, which suppresses the false-positive detection of the boundary, but easily misses weak boundary pixels.In contrast, AL reduces the sensitivity of the loss value to negative samples to perceive weak boundary pixels.However, it is easy to cause false-positive detection.Compared to the conventional approaches [4,18] that only use AL, DCD passes the

Occlusion orientation estimation path
To precisely estimate the foreground and background separated by the occlusion boundary, it is important to capture sufficient scene cues near the boundary, namely the depth order cue.It encodes sufficient context to distinguish the foreground or background areas.Thus, we attempt to provide a sufficient receptive field for the perception of context information.For this reason, we design the scene context learner to perceive surrounding objects near the boundary with multiple receptive fields and stripe convolution to determine the occlusion relationship by making full use of the depth order cue around the occlusion boundary.
Scene context learner: The scene context learner named SCL is designed to capture the depth order cue in various ranges, as illustrated in Fig. 6.First, taking the feature f 5 as input, we employ multiple dilated convolutions to perceive the foreground and background objects around the boundary as completely as possible.With various dilated rates of the dilated convolutions, the learner perceives the scene cue at different scales from the foreground and background areas, which is beneficial to deduce which side of the region is in front.Second, a 1 × 1 convolution is used to integrate the scene cue between various channels and promote cross-channel depth order cue aggregation at the same location.Compared to dilated convolution, this 1 × 1 convolution retains the local cues near the boundary.Third, another 1 × 1 convolution is applied to normalize the values near the boundary, where the depth order cue is further enhanced, and other irrelevant cues are suppressed.Fourth, all the branches are concatenated, followed by a 1 × 1 convolution block and a 3 × 3 convolution block that fuse all the branches.Furthermore, the obtained feature is upsampled four times ensuring its size to 1/4 of the original image.Consequently, the SCL captures the multi-scale cues of the foreground and background objects.Assuming that the dilated convolution with a kernel size of 3 × 3 and rate of n is denoted as D n (•), the depth order cue denoted as F o S is formulated as: where P n denotes the output of the n-th parallel branch and C denotes the concatenate operation.
Different from ASPP: Notably, our SCL is inspired by atrous spatial pyramid pooling (ASPP) [35] but has several differences.First, we add an element-wise convolution branch parallel to these dilated convolutions, which addi-tionally gains local cues of the specific region.It compensates for the deficiency that dilated convolution is not sensitive to nearby information.Second, after each branch, the 1 × 1 convolution blocks remove the gridding artifacts [42] caused by the dilated convolution.It also adjusts channel numbers and explores relevance between channels.Third, two other 1 × 1 convolution blocks and 3 × 3 convolution blocks are utilized to fuse features from all branches.
Stripe convolution based estimator: By utilizing the SCL, the depth order cue is extracted.Based on this cue, an estimator is necessary to distinguish the foreground and background areas.The previous estimator [4] consists of five 3 × 3 convolutions and a 1 × 1 convolution, which suffers from the feature at the local pixel patch extracted by this small convolution kernel.The reason is that the small kernels only have a small perceptive field and hardly perceive the whole object from the depth order cue.Thus, a large convolution kernel is essential to utilize the decoder output and the depth order cue, i.e., F o S C F d , which has a 1/4 resolution of the input image.
Nevertheless, large convolution kernels are computation demanding and memory consuming.Instead, two stripe convolutions are proposed, which are orthogonal to each other.Compared to the 3 × 3 convolution, which captures only nine pixels around the center (as illustrated in Fig. 7 (a)), the vertical and horizontal stripe convolutions have 7 × 3 and 3 × 7 receptive fields (as demonstrated in Fig. 7 (b)).Specifically, for a boundary pixel with arbitrary orientation, its tangent direction can be decomposed into vertical and horizontal directions.The depth order cue along orthogonal directions makes a varied amount of contributions in expressing the orientation representation.Thus, the depth order between two objects is recognized.After the two parallel stripe convolutions, the resulting features are element-wisely summed and upsampled four times to resize the feature to the original size, followed by three 3 × 3 convolution blocks to estimate the occlusion orientation.Consequently, the design mentioned above achieves three advantages.
1) The large receptive field aggregates contextual information of objects to determine the depth order without large memory consumption.2) Although the slope of the boundary is not exactly perpendicular or parallel to the ground, one of the stripe convolutions can successfully perceive the foreground and background objects.3) Compared with OFNet [18], the resolution of the input feature is reduced by 3/4, and meanwhile the size of the stripe convolution is reduced from 11 × 3 to 7 × 3, which makes the module obtain a larger receptive field but with a smaller number of parameters.Orientation loss: Following [4], we use the following orientation loss to supervise the orientation path.The predicted orientation value of pixel p is denoted as o p , and its ground truth corresponds to o p .The orientation loss is formulated as: where SL denotes the smooth L 1 loss (SL) [4], I j denotes the j-th image, p is the p-th pixel in an image, and M is the mini-batch size.The loss formulation of the whole network can be stated as: where λ is the loss proportion of the two paths.

Experiments
In this section, we demonstrate numerous experiments to evaluate the proposed FSINet.

Implementation details
Datasets: We validate our approach on two datasets, namely the PIOD dataset [2] and the BSDS ownership dataset [3].The PIOD dataset contains 9,175 training images and 925 testing images.The BSDS ownership dataset includes 100 training images and 100 testing images of natural scenes.For the two datasets, each image is annotated with a ground truth occlusion boundary map and its corresponding orientation map.Both of these datasets are employed to compare our method with other methods.Due to the abundance of images, the PIOD dataset is used to conduct ablation studies.
Model training: Our network is implemented in Caffe and trained on an NVIDIA GeForce RTX 2080Ti GPU.Following [2,4,18], we augment the PIOD data by horizontally flipping each image two times, and additionally augment the BSDS ownership data by rotating each image to different angles eight times.To save training time and improve generalization, all images are randomly cropped to 320 × 320 during training.During testing, we operate on an input image at its original size.The backbone is initialized with a pre-trained model, and all the added convolution layers are initialized with MSRA [43].During training, we use stochastic gradient descent (SGD) to optimize the network with hyper-parameters including: mini-batch size per GPU (3), iter size (2), momentum (0.9), weight decay (0.0002), β (4.0), γ (0.5) in Eq. ( 5), λ 1 (1.0), λ 2 (0.3), λ 3 (2.1) in Eq. ( 6) and λ (1.0) in Eq. ( 9).The learning rate is 1 × 10 -5 initially, and is scheduled by multiplying the initial learning rate by 0.1 after every 20,000 iterations for the PIOD dataset and 2,000 iterations for the BSDS ownership dataset.The number of training iterations for the PIOD dataset is 30,000 and that for the BSDS ownership dataset is 5,000.
Evaluation criteria: Following [4], we adopt two criteria: the precision and recall of the detected boundary pixel (i.e., EPR) and the precision and recall of the estimated orientation (i.e., OPR).EPR performs by three standard metrics: F-score with fixed boundary threshold for all images (ODS), F-score with the best threshold of each image (OIS), and average precision (AP).The OPR follows the same method of calculation.Note that the orientation recall and precision are only calculated at the correctly detected boundary pixels.The predicted boundary map after NMS is utilized to calculate all the metrics.

Evaluation results
In this subsection, we demonstrate the quantitative and qualitative performance of our proposed method.
First, we compare the performance of our model using different backbones.As displayed in Table 1, our method achieves state-of-the-art performance on different backbones [40,45,46].By using an advanced backbone, i.e., ResNeXt50 [45], the performance of FSINet improves steadily.To ensure the fairness of the comparison, in the following experiments, we use the same backbone as OFNet, namely ResNet50 [40].
The EPR results are presented in Table 1 and Fig. 8(a) and Fig. 8(c).For the PIOD dataset, while MT-ORL [19] achieves the best performance, it actually utilizes a staggering amount of parameters, reaching 187.05M, which  exceeds our method (36.014M for FSINet) by more than five times.It is unfair to conduct a comparison under such a significant difference in the number of parameters.Our approach outperforms all other methods with a similar number of parameters, which proves the effectiveness of the separated feature learning of the two subtasks and the feature interaction.It even surpasses OFNet using a similar structure by 1.1% ODS, 1.2% OIS, and 0.6% AP in occlusion boundary detection, which is mainly achieved by the newly proposed DSFE and DCD, as the DSFE and DCD are the main improvements of the boundary path.For the BSDS ownership dataset, our approach achieves the second-highest OIS and AP as well, following the heavy-parameter model MT-ORL.Although our method is slightly lower than OFNet, it still outperforms other methods with a similar parameter count for recall lower than 0.7, as illustrated in Fig. 8 (c).
The OPR results are presented in Table 1 and Fig. 8(b)(d), and our approach outperforms all other state-of-the-art methods with a similar parameter count.Specifically, our method performs the second best in terms of the PIOD dataset.It surpasses DOOBNet by 3.1% ODS, 3.1% OIS, and 5.5% AP, which illustrates the effectiveness of all modules in our network.It also surpasses our early version by 1.5% ODS, 1.5% OIS, and 0.9% AP, which achieves a greater improvement than the results in occlusion boundary detection.The reason is that the optimization of the orientation path boosts the orientation estimation.For the BSDS ownership dataset, the proposed FSINet obtains gains of 3.6% ODS, 5.0% OIS, and 7.5% AP compared to DOOB-Net, and gains of 0.8% ODS, 1.3% OIS, and 1.4% AP compared to the OFNet.Observably, although our method achieves a similar EPR performance to OFNet, it is fully ahead of OFNet in the OPR performance, which indicates the effectiveness of the optimized orientation path for occlusion relationship reasoning.

Qualitative performance
Figure 9 presents the qualitative results in the two datasets.The first and second columns show the original RGB images from the datasets and the corresponding ground truth.The third and fourth columns show the resulting Figure 9 Qualitative comparisons.1st-2nd columns: input images and the corresponding ground truth; 3rd-5th columns: visualization results of DOOBNet, OFNet, and our method; 6th-7th columns: boundary maps and orientation maps of ours.Notably, "red" pixels with arrows: correctly labeled occlusion boundaries; "yellow": correctly labeled boundaries but mislabeled occlusion; "green": false-negative boundaries; "cyan": false-positive boundaries (best viewed in color) occlusion relationship predicted by DOOBNet [4] and OFNet [18].The fifth to seventh columns show the occlusion relationship results, the obtained boundary map, and the obtained orientation map of the proposed FSINet, respectively.In the visualization of the occlusion relationship, the right side of the arrow direction is the background, and the left side is the foreground area.Specifically, the "red" pixels with arrows represent correctly labeled occlusion boundaries, "yellow" is the color for correctly labeled boundaries but mislabeled occlusion, "green" for false-negative boundaries, and "cyan" for falsepositive boundaries.
The first two rows illustrate the reduction of the falsepositive occlusion boundary pixels.In the first row, four cattle with similar colors occlude each other, making it difficult to accurately locate the boundaries between every two cattle.It is intuitive that DOOBNet and OFNet generate many false-positive boundary pixels around the true occlusion boundary, but our method avoids this issue.In the second row, the sparrow has a similar appearance to the ground.DOOBNet and OFNet suffer from false-positive boundary pixels and a fragmented boundary.Our method completely eliminates the false positives in the background and obtains a more consistent boundary.
The third and fourth rows illustrate that our method further reduces the false-negative occlusion boundary pixels.In the third row, the appearance of the cruise ship is very similar to the mountain in the background, which is hard to distinguish.Obviously, other methods cannot discover the entire boundary of this ship.Due to the refined semantic feature provided by the DSFE, our method reserves the whole boundary of this ship more accurately.In the fourth row, the boat occupies a large area of the image, so the boundary of the boat is very long.Due to the limited receptive field of the contextual feature.DOOBNet and OFNet miss many boundary pixels.Our method intuitively outperforms them in capturing the complete occlusion boundary.
The last rows depict the comparison of the occlusion orientation estimation.The airplane has a similar color to the distant mountain, so the occlusion relationship is difficult to determine.DOOBNet suffers from the wrong occlusion relationship.OFNet handles this issue better, and our method achieves fewer false-positive and false-negative boundary pixels.

Analysis and discussion
In this subsection, we investigate the effect of each component in our proposed method step by step.In the ablation experiments, our method is evaluated on the PIOD dataset [2].In addition, we compare the performance of our method and OFNet in detail to illustrate the effectiveness of our new designs.

Effectiveness of each path and feature interaction
To evaluate the effectiveness of each path, we separate the whole network into two individual networks: the boundary path network and the orientation path network.By training each path network, we compare the performance of our method and OFNet on the corresponding path separately.Such decomposition learning also proves the advantage of joint learning.In addition, we also remove the shared decoder to reveal the usefulness of the feature interaction.
Table 2 displays the comparisons.First, when conducting decomposition learning, FSINet surpasses OFNet by 1.2% ODS, 1.4% OIS, and 9.0% AP in occlusion boundary detection, which indicates the effectiveness of our boundary path.With the same setting, FSINet outperforms OFNet by 1.8% ODS, 1.8% OIS, and 5.9% AP in occlusion orientation estimation, which is entirely due to the proposed optimization of the orientation path.Second, compared to decomposition learning, our joint learning design elevates the performance of the two sub-tasks simultaneously.It is noteworthy that the performance of occlusion boundary detection is improved, which verifies our assumption, i.e., there exists specialty and complementarity between occlusion boundary detection and occlusion orientation estimation.FSINet exploits the complementarity between the two sub-tasks by constructing the feature interaction, and thus improves the performance of the two sub-tasks.Furthermore, when removing the decoder, all performances are reduced.In terms of the OIS performance of EPR and OPR, the variant that removes the decoder is even worse than the variant that uses decomposition learning, which shows that the decoder plays a significant role in feature interaction.
Figure 10 demonstrates the qualitative comparison.The first two columns show the original RGB images and the the FSINet with or without the decoder.In the fifth row, a beige sofa is surrounded by a messy background.When removing the decoder, many false-positive boundary pixels and incorrectly estimated orientation values are generated, which verifies the importance of the feature interaction.
The same phenomena can be found in the last row.The last row shows a battle plane.FSINet outperforms the variant without the decoder due to the reduction in false-positive occlusion boundary pixels and incorrectly estimated orientation values.

Effectiveness of each module
To verify the effectiveness of the modules in our approach, each module is removed to construct an independent variant for evaluation, as summarized in Table 3.
In the occlusion boundary detection path, if ICE is removed, the accuracy in all metrics is reduced because the occlusion boundary is difficult to be located accurately, leading to a decrease in the accuracy of the occlusion relationship reasoning.If the DSFE is removed, the performances of EPR and OPR are all degraded.DSFE embeds the low-level cues to the high-level cue gradually, which obtains a refined semantic feature and ensures the resolution is as high as possible.If the CCE is removed, the occlusion boundary fails to be detected consistently, which decreases the accuracy at a large margin.In particular, the AP of EPR dropped by 5.2%, and the AP of OPR dropped by 5.0%.The reason is that the CCE captures the longdistance correlation between pixels of semantic features, which plays an important role in obtaining continuous object boundary.To further evaluate the DSFE, we compare the DSFE and directly concatenating side-outputs (used by OFNet).This comparison is illustrated in Table 4.The proposed DSFE leads to more than 0.5% improvement in all metrics.The reason is that the DSFE fuses the multilevel features gradually and avoids introducing an amount of boundary noise contained in the low-level features.In the occlusion orientation path, if SCL is removed, the ability to reason the occlusion relationship decreases sharply.In more detail, the ODS of OPR is dropped by 1.6%, 1.7% for OIS, and 1.0% for AP.The intrinsic reason is that the depth order cue acquired by the SCL plays an important role in occlusion relationship reasoning.Note that not only the occlusion orientation but also the accuracy of occlusion boundary detection is reduced at a large margin.Intrinsically, the reason is that the feature interaction makes the modules of one path affect the other path.When removing the SCL, the loss of the orientation path greatly affects the decoder's feature and thus damages the occlusion boundary path.When the stripe convolutions are removed, the occlusion boundary detection and orientation estimation are also affected by performance degradation, which proves the effectiveness of stripe convolutions in our network.

Effectiveness of DCD
To evaluate the proposed boundary detection module, we compare DCD with the conventional detector, i.e., a structure consisting of four 3 × 3 convolution blocks and a sigmoid layer supervised by AL, used in DOOBNet [4] and OFNet [18], which is displayed in Table 5. Intuitively, if the DCD is replaced by the conventional detector, the accuracy of occlusion boundary detection is degraded by 0.6% ODS, 0.6% OIS, and 0.1% AP.The reason is that the crossentropy loss (CEL) function enhances the punishment to the false-positive boundary pixels generated by the flow of AL, and AL restores the correct boundary missed by CEL.Thus, compared with the previous detector that only uses AL, the DCD suppresses the false-positive boundary.Meanwhile, when using the conventional detector, the accuracy of orientation estimation is also degraded.This is because the improvement of occlusion boundary detection promotes orientation estimation.
In addition, we design several variants of DCD to illustrate the effectiveness of DCD design.Since each flow of DCD has two cross-connections (sum operations in Fig. 5) and one loss cross-connection, as illustrated in Fig. 5, we firstly remove all cross-connections in DCD, only preserving the sum operation of the two flows in DCD.Then, we only use the cross-connection of loss.Third, we only remove the cross-connection of loss and preserve other cross-connections.Table 6 summarizes the performance of different DCD variants.It can be seen that the loss crossconnection is more important for DCD to achieve a better boundary performance than other cross-connections.The result can be attributed to the fact that the loss crossconnection is closer to the supervision than others.Thus, the loss cross-connection directly affects the parameter update of the two flows.In addition, we change the layer number of each flow in the DCD.The results are listed in Table 7. Intuitively, the performance does not change much when the layer number is larger than 2. Thus, we argue that three 3 × 3 layers are the optimal setting of DCD.

Different settings of stripe convolution
To obtain the best setting of stripe convolution, we compare different sizes of the stripe convolution, including 3 × 3, 3 × 5, 3 × 7, 3 × 9 and 3 × 11, which is presented in Table 8.Consequently, the 3 × 7 convolution achieves the best performance in the orientation estimation.Furthermore, Table 9 illustrates the comparison between the stripe convolution of FSINet and that of OFNet [18].Compared to OFNet, we use a smaller stripe convolution to

Conclusion
In this paper, a novel feature separation and interaction network, named FSINet, is proposed.We integrate the specialty and complementarity of the two sub-tasks of occlusion relationship reasoning into this unified network.For occlusion boundary detection, three modules, i.e., ICE, DSFE and CCE are proposed to capture the accurate location and semantic indication for occlusion boundary detection.The detection module, i.e., DCD is proposed to achieve a trade-off in limiting false positives and false negatives.For the occlusion orientation estimation, the SCL is proposed to capture the depth order cue among ob-jects.This cue is employed to distinguish the foreground and background areas by two stripe convolutions.In addition, a shared decoder is designed to conduct feature interaction between the two sub-tasks.Extensive experimental results on the PIOD and BSDS ownership datasets demonstrate that our network performs favorably against the state-of-the-art methods.Although this paper only focuses on pixel-level occlusion relationship reasoning, which is the same as previous methods, we are going to extend our current work to higher-level (such as regionlevel and instance-level) occlusion relationship reasoning for broader tasks.

Figure 1
Figure 1 Visualization of occlusion relationship and features.(a) The input image and its ground truth provided by Pascal instance occlusion dataset (PIOD) dataset [2]; (b) The feature for the boundary detection; (c) The feature for the orientation estimation; (d) Visualization result of our method; (e) Visualization result of DOOBNet; (f ) Visualization result of DOC.The occlusion relationship (the red arrows) is represented by orientation θ ∈ (-π , π ] (tangent direction of

2 )
Three modules, including ICE, DSFE and CCE, are designed to supply an accurate location and semantic indication of the occlusion boundary.A new detector named DCD is designed to achieve a decent trade-off between fully detecting

Figure 2
Figure 2 Comparison between the state-of-the-art network architectures and ours.(a) shows two separated deep convolutional networks that exploit different image cues for occlusion boundary detection and occlusion orientation estimation.(b) shows the unified end-to-end multi-task network that shares deep features to simultaneously predict both the occlusion boundary and the occlusion orientation.(c) shows our network, which contains a shared encoder, two separated paths for occlusion boundary detection and occlusion orientation estimation, and shared decoder

Figure 3
Figure 3 Illustration of FSINet.(a) is the input image.(b) is the output of the network.The length of the block expresses the map resolution, and the thickness of the block indicates the channel number

Figure 4
Figure 4 All modules used to extract features in the occlusion boundary path, and the visualization of the extracted features.(a) Image-level cue extractor (ICE); (b) Detail-perceived semantic feature extractor (DSFE); (c) Contextual correlation extractor (CCE); (d) The output feature of ICE; (e) The feature of each level in DSFE; (f ) The output feature of CCE

Figure 5
Figure 5 The structure of the dual-flow cross detector.(a) is the input RGB image.(b) visualizes the last convolution's output feature of the conventional detector

Figure 6
Figure 6 The structure of scene context learner

Figure 7
Figure 7 Schematic illustration of orientation information disseminating in the feature learning phase.(a) Plain convolution; (b) Stripe convolution

Figure 8
Figure 8 (a) The precision and recall of the detected boundary pixel (EPR) result on the PIOD dataset; (b) The precision and recall of the estimated orientation (OPR) result on the PIOD dataset; (c) EPR result on the BSDS dataset; (d) OPR result on the BSDS dataset

Table 2 Figure 10
Figure 10 Qualitative comparisons of different structures.The first two rows show the comparison between OFNet and FSINet when using decomposition learning.The third to fourth rows show the comparison between decomposition learning and joint learning.The last two rows show the comparison between FSINet and FSINet that removes the decoder

Table 1
EPR and OPR results on two datasets.Bold and underlined numbers indicate the 1-st and 2-nd performances, respectively.EPR refers to the precision and recall of the detected boundary pixel and OPR refers to the precision and recall of the estimated orientation

Table 3
Experimental results of our model without low-level cues, without high-level cues for boundary, without high-level cues for orientation, without stripe convolution, and our model.The experiments are conducted on the PIOD dataset (the same below).EPR refers to the precision and recall of the detected boundary pixel and OPR refers to the precision and recall of the estimated orientation

Table 4
Comparison between the DSFE and simply concatenating side-outputs.EPR refers to the precision and recall of the detected boundary pixel and OPR refers to the precision and recall of the estimated orientation

Table 5
Comparison between the DCD and the conventional boundary detector.EPR refers to the precision and recall of the detected boundary pixel and OPR refers to the precision and recall of the estimated orientation

Table 6
EPR performance of models using different DCD variants.EPR refers to the precision and recall of the detected boundary pixel

Table 7
EPR performance of models using different layers in DCD.EPR refers to the precision and recall of the detected boundary pixel

Table 8
Experimental results of using stripe convolutions with different aspect ratios.EPR refers to the precision and recall of the detected boundary pixel and OPR refers to the precision and recall of the estimated orientation