Neighborhood co-occurrence modeling in 3D point cloud segmentation

A significant performance boost has been achieved in point cloud semantic segmentation by utilization of the encoder-decoder architecture and novel convolution operations for point clouds. However, co-occurrence relationships within a local region which can directly influence segmentation results are usually ignored by current works. In this paper, we propose a neighborhood co-occurrence matrix (NCM) to model local co-occurrence relationships in a point cloud. We generate target NCM and prediction NCM from semantic labels and a prediction map respectively. Then, Kullback-Leibler (KL) divergence is used to maximize the similarity between the target and prediction NCMs to learn the co-occurrence relationship. Moreover, for large scenes where the NCMs for a sampled point cloud and the whole scene differ greatly, we introduce a reverse form of KL divergence which can better handle the difference to supervise the prediction NCMs. We integrate our method into an existing backbone and conduct comprehensive experiments on three datasets: Semantic3D for outdoor space segmentation, and S3DIS and ScanNet v2 for indoor scene segmentation. Results indicate that our method can significantly improve upon the backbone and outperform many leading competitors.


Introduction
With advances in scanning devices, much 3D data has been produced and widely used in augmented and virtual reality, 3D games, and robotics. As a basic form of 3D data, the point cloud is very popular and can be easily converted into meshes or voxels [1]. Semantic segmentation of point clouds is an essential 3D scene comprehension task yet remains challenging due to its inherent irregularity [2].
PointNet [3] was the first neural network to directly process point clouds for 3D segmentation. They proposed to apply shared multi-layer perceptrons (MLPs) to point clouds to learn point-wise features and utilized max/mean pooling to aggregate global features. These were concatenated with point-wise features before a few MLPs were used for final semantic segmentation. Later, PointConv [4] and KPConv [5] used novel 3D convolution operations to extract informative point features and achieved good performance in 3D scene segmentation. An encoderdecoder framework is usually used to gradually extract global features and fuse them with local features to predict the semantic labels. While global contextual information and local region information are both used for point-wise labeling, local cooccurrence relationships are usually ignored or used in a implicit way.
In spite of the rich categories of objects in real world, there is a strong relation between the categories of neighboring objects (i.e., occurrence of specific category pairs). We note that this neighborhood cooccurrence relationship can be used during semantic prediction to rule out neighboring pairs that cannot co-occur in a local region. For example, whiteboards are always adjacent to walls, and chairs are usually close to tables as shown in Fig. 1. These category pairs of objects can co-occur in a local region of real scene while other pairs (e.g., whiteboards and tables) cannot. However, this idea is usually ignored in point cloud segmentation.
Patch-level co-occurrence relationships have been exploited and used to optimize semantic labeling results, but optimization problems had to be solved in the testing stage [6].
In order to obtain segmentation results which follow real-world cooccurrence relationships without extra computations during inferencing, we propose a neighborhood cooccurrence matrix (NCM) to model this relation. The NCM is a two-dimensional matrix. Each row represents one category of center points, and each column represents one category for the neighbors of center points. Each element (i, j) of the NCM shows the probability that the semantic label of the center point is i and the semantic label of neighboring points is j. Based upon this definition, the whole NCM is a joint distribution over categories of center points and neighbors.
In our method, the network can directly learn the local co-occurrence relationship from the neighborhood co-occurrence matrix. In the training stage, we randomly sample points from the original point cloud as center points for simplicity and lack of bias, and collect the category labels of these center points and their K nearest neighbors. Next, these category labels are used to generate the target NCM. Meanwhile, prediction maps of center points and their neighbors will be collected to generate the prediction NCM. To learn the local co-occurrence relationship and make the prediction NCM approximate to target NCM, we introduce Kullback-Leibler (KL) divergence to estimate the distance between these two distributions.
For large-scale scenes, especially outdoor scenes, the neighborhood co-occurrence matrix of a sampled point cloud may not fully reflect the real-world neighbor relationships, and this can be treated as noise in the target NCM. Therefore, we introduce the reverse form of KL divergence into NCM learning to handle the scalar difference between sampled point clouds and whole scenes. In this way, the network can learn to make predictions that are in better accordance with local co-occurrence relationships in real world.
Our major contributions can be summarized as follows: • a neighborhood co-occurrence matrix to model local co-occurrence of point-wise semantic labels, utilizing KL divergence to minimize difference between the prediction NCM and target NCM; • introduction of the reverse form of KL divergence into NCM learning to handle the difference between the target NCM generated from sampled data and the real-world NCM for large-scale scenes; • integration of our method into an existing backbone and experiments on three challenging benchmark datasets demonstrating significantly improved performance over the backbone for point cloud segmentation task, and outperformance of many state-of-the-art competitors.  [9] utilized a random sampling strategy which is more efficient for large-scale point cloud segmentation. JSNet [10] and JSIS [11] took instance segmentation and semantic segmentation as joint tasks for improved semantic segmentation. JSENet [12] and BAGEM [13] introduced boundary information into the semantic segmentation task for better contours in the prediction map. Fusion-Aware Conv [14] extracted semantic features both spatially and temporally from RGBD scans for online scene segmentation. On the other side, patch-based methods were proposed to cluster patches with similar features to segment point clouds [15]. However, these methods usually ignored the category-based co-occurrence relationship in point cloud segmentation. Compared to the methods aforementioned, we propose a neighborhood co-occurrence matrix (NCM) to model the neighborhood co-occurrence relationship. Then, two forms of KL divergence are introduced to minimize the difference between prediction NCM and target NCM, leading to predictions that are more consistent with realistic neighborhood co-occurrence relationships.

Neighborhood context learning
Neighborhood contexts in the vicinity of an object have proved useful for 2D semantic segmentation [16].
RMI [17] utilized region mutual information to model the local relationship between neighboring pixels, and achieved high consistency in the final predictions for image segmentation. Conditional random fields (CRF) were introduced into point cloud segmentation to model the relationships between neighboring labels [18], leading to better segmentation. However, CRF is a post-processing method and extra computations are required during inferencing. In point cloud segmentation, 3P-RNN [19] utilized RNNs to explore long-range spatial context. HPEIN [20] extracted features of edges between neighboring points to implicitly model the neighborhood relation. Region similarity loss was proposed to propagate distinguishing features of center points to neighbors with the same categories in a local neighborhood [2].
Compared to these methods, our method focuses on explicitly learning the neighborhood category cooccurrence relationship for point cloud segmentation.

Co-occurrence modeling
Given the target features, CFNet [21] predicted the probability of co-occurring features and used them as weights to fuse co-occurrent contexts. A global co-occurrence constraint was introduced by Ref. [22] to eliminate configurations that violate common sense or physical law. However, these methods failed to exploit the semantic label co-occurrence relationship in a neighborhood. Segment-based and patch-based contextual relationships were exploited to optimize the label assignment problem for semantic labeling during inferencing [6,23]. However, extra time is needed then to obtain the segmentation results. Co-occurrence matrices have usually been used to describe the co-occurrence of words in natural language processing [24].
Unlike previous methods, we design a neighborhood co-occurrence matrix to directly model the local category co-occurrence relationship to eliminate impossible neighboring pairs in point cloud segmentation. Additionally, our method can train the network in an end-to-end manner, and it does not require extra time during inferencing.

Method
In this section, we first introduce the overall architecture of our method in Section 3.1. Then, we describe the proposed neighborhood co-occurrence matrix (NCM) used to model local co-occurrence relationships in Section 3.2. Finally, we describe how the target NCM supervises the output prediction and makes the network learn local co-occurrence relationships as well as the reverse form of KL divergence in Section 3.3. Figure 2 shows the overall framework of our method. First of all, we use a common encoder-decoder network to extract features and make category predictions. Then, ground truth semantic labels are directly used to supervise the segmentation results through cross entropy loss. Meanwhile, we generate the prediction neighborhood co-occurrence matrix (prediction NCM) from the prediction maps and generate the target NCM from point-wise semantic labels. Later, KL divergence is used to supervise the prediction NCM and make it approximate the target NCM. In this way, anomalous co-occurring neighboring pairs will be punished and the cooccurrence relationships in our prediction will be more reasonable.

Neighborhood co-occurrence matrix
Co-occurrence relationships in the local neighborhood can directly influence the results of point cloud semantic segmentation. For instance, a whiteboard usually co-occurs with a wall in a local region and is unlikely to be adjacent to other categories such as the ceiling or floor (see Fig. 1). Based upon this observation, we hope our final semantic prediction to accord with real-world neighborhood co-occurrence relationships.
While co-occurrence relationships are seldom explicitly exploited in segmentation tasks, they are usually modeled by a co-occurrence matrix in natural language processing to find co-occurring words within a sentence. Inspired by the design of co-occurrence matrices for words, we propose a neighborhood co-occurrence matrix to model the relationship of neighboring co-occurring categories in local regions of point clouds. Here, we attempt to exploit the category relationship between a randomly selected point (referred to as the center point) and its neighbors. For a semantic segmentation task where we need to categorize each point as one of C classes, our designed NCM will be a C × C matrix. Each row of NCM represents a category for center points and each column represents a category of their neighboring points. Specifically, the ij-th element indicates the probability that the center point belongs to the i-th class and a neighboring point belongs to the j-th class.
In order to effectively utilize computational resources and storage, we only sample a fixed ratio of center points from the original point clouds, naming N = αN as shown in Fig. 2. To generate the target NCM, we first collect the one-hot labels of Fig. 2 Architecture of our proposed method. An encoder-decoder network is used to produce the prediction map. Then, prediction NCM and target NCM are generated from the prediction map and ground truth respectively. The KL divergence of these two distributions is minimized to learn local co-occurrence relationships in the real world. these center points, denoted A c ∈ R N ×C . Then, for each center point, we search for its K nearest neighbors and collect their corresponding one-hot labels. The collected neighbors' one-hot labels are denoted A n ∈ R N ×K×C . Then, the target NCM M ∈ R C×C is given by , so M is a normalized probability density function which models the target co-occurrence relationship in local regions of real 3D scenes.

Learning neighborhood co-occurrence relationship
In order to learn the real-world neighborhood cooccurrence relationship, we directly utilize the target NCM M to supervise the prediction NCM generated from the output prediction maps. Our prediction NCM will approach the target NCM to learn a more reasonable co-occurrence relationship in the local region.
Unlike generating the target NCM, we directly utilize the prediction map of these N center pointŝ A c ∈ R N ×C whereÂ c [n, i] represents the predicted probability that the n-th center point belongs to the i-th category. As in the counterpart of target NCM, we also aggregate the prediction maps of center points' neighborsÂ n ∈ R N ×K×C where K is the number of neighbors. The prediction NCM can be calculated by the following formula: according to the definition of probability distribution. In order to make the prediction NCM approach the target NCM, KL divergence, which measures the distance between two probability distributions, is introduced to narrow the difference between target NCM and prediction NCM. A high KL divergence indicates a large difference between two distributions. A common way of formulating KL divergence is where p and q are two distributions over variable x. In our method, integration over continuous x is replaced by summation over the discrete pairs (i, j).
According to the choice of p and q, we have two forms of KL divergence loss.
In the common form of KL divergence, we can simply set M to be p andM to be q. Based upon this, we can reformulate the KL divergence of these two NCM distributions as is not differentiable with respect to hidden parameters in the network because the category labels of points are fixed. Thus we only need to optimize the second term to minimize the KL divergence. Thus, our loss for NCM is where is a small quantity to prevent invalid numerical operations.
In order to handle the difference between the target NCM generated from the sampled point cloud and the real-world NCM, especially for large-scale scenes, we treat this difference as a kind of noise and introduce the reverse form of KL divergence which conversely setsM to be p and M to be q; we call this the reverse KL divergence. Then, the loss for the neighborhood co-occurrence matrix can be reformulated as (7) where both terms are differentiable with respect to hidden parameters in this case.
Compared to Eq. (6), this form of KL divergence is more complex because both terms are differentiable. The first term maximizes the entropy of the prediction NCM which can be treated as a constraint to give a prior on a uniform distribution. The second term is the reverse cross entropy which has been shown to be more tolerant to noise in labels [25]. The differences in performance between these two forms of loss and analysis of KL divergence will be discussed in detail in Section 4.3.
The total loss for the network consists of two parts: L seg and L co : where L seg represents the cross entropy loss for pointwise segmentation. In our implementation, α is set to 0.3, and 8 nearest neighbors are collected to generate the target NCM and prediction NCM. and λ are set to 10 −8 and 1 respectively, giving good results.

Experiments
Our experiments consist of five parts. First, we test our method on the large-scale outdoor semantic segmentation task Semantic3D reduced-8 [26] in Section 4.1. Then, we evaluate the performance of our method on the indoor scene semantic segmentation benchmarks S3DIS [27] and ScanNet v2 [28] in Section 4.2. Next, we conduct studies to analyze the two forms of KL divergence in Section 4.3. Later, we analyse the influence of number of neighbor and sampling density on segmentation performance in Section 4.4. Finally, we visualize the prediction NCMs and target NCMs for some real scenes in Section 4.5.

Dataset
We evaluate the effectiveness of our method for outdoor space semantic segmentation on the Semantic3D task [26]. This dataset contains 15 large-scale outdoor areas for training and another 15 areas for testing. There are more than 4 billion points in this dataset and all points can be divided into 8 categories. For easier evaluation, Semantic3D proposed another segmentation task with fewer points in the test set: Semantic3D reduced-8. Only labels for training data are available, and predictions on the test set must be submitted to their online servers for evaluation.

Implementation
We utilize KPConv deform [5] as our backbone and embed our method into it. In the training stage, we randomly sample spheres of 3 m in radius from the outdoor scenes and feed them into the network for training following Refs. [5,29]. Eqs. (1) and (2) are used to generate the target NCM and prediction NCM. Eq. (8) is used to optimize the whole network, and Eq. (7) is used as the KL divergence loss for NCM. In the test stage, we utilize the trained model to predict the results on all points in the test set and submit the results to the Semantic3D server [26] for evaluation. The momentum optimizer is utilized to train the network, and the batch size is set to 10 on a single GTX 1080Ti GPU.

Results
We report the mean IoU (mIoU) over categories and IoUs for different categories on Semantic3D reduced-8 task in Table 1. Our method achieves 76.6% mIoU in this task, outperforming many existing methods. Our method also brings a satisfying 3.5% mIoU improvement over the backbone which shows its effectiveness. We also provide the categorywise IoUs, but IoUs for KPConv deform are not listed because they are not available in their paper and the benchmark.

Dataset
We evaluate the performance of our method for indoor semantic segmentation on the S3DIS [27] and  ScanNet v2 [28] datasets. S3DIS contains 271 rooms in 6 large indoor areas from three different buildings. About 273 million points are collected and annotated in this dataset; all points are categorized into 13 classes. Following previous work [5, 13,36], we take rooms in Area-5 as the test set and samples from the other areas as the training set.
ScanNet v2 contains 1513 cluttered indoor scenes with annotations. 1201 scenes are used for training and 312 scenes are used for validation. All annotated points are categorized into 20 classes or unlabeled. Additionally, another 100 scenes are published without label annotations as the test set.

Implementation
We apply our method to KPConv deform [5] and take it as our baseline. Following KPConv deform, we randomly sample spheres with a 2 m radius from rooms in the training set and feed them into the network for training. Again, Eqs. (1) and (2) are used to generate the target NCM and prediction NCM. Eq. (8) is used to optimize the whole network. However, Eq. (6) is used as the KL divergence loss for NCM. In the testing stage, spheres are sampled regularly and all points are included in at least one sphere. The network is trained by a momentum optimizer, with batch size 5 for S3DIS and 10 for ScanNet v2, using a single GTX 1080Ti GPU.

Results
We report the results of our method and many stateof-the-art competitors on S3DIS Area-5 in Table 2; mean IoU (mIoU) is taken as a metric to evaluate segmentation performance. Our method achieves a 68.29% mIoU (1.19% higher than the backbone) and outperforms many existing methods. We also list the IoU scores for different categories in this table. No methods perform well in the beam category because beams in S3DIS Area-5 are tilted while beams in other areas are horizontal. Our method improves IoU for most categories except the sofa class. This is because NCM learning punishes those pairs that seldom appear, thus making the network less likely to categorize a point into a minor class. Thus, we do not observe score improvement for the sofa category which has least points. Additionally, we visualize the improvement over our backbone (KPConv deform) in Fig. 4, with yellow dashed-dotted circles indicating obvious improvements. For ScanNet v2, we report the results in Table 3, and mean IoU over category is also used to estimate the performance. In this dataset, our method achieves 69.0% mIoU which is 0.6% higher than our backbone, and our method achieves a state-of-the-art performance in this benchmark. Category-wise scores are also shown in this table. We can also see that our method improves the performance for most categories, but degrades the performance for categories with few points like sofa and refrigerator. The reason is the same to S3DIS that NCM learning punishes pairs that appear less frequently, thus degrading the performance of minor categories.

Choice of KL divergence
In this section, we conduct experiments to compare the difference in performance between the usual and reverse forms of KL divergence for both indoor scene and outdoor area semantic segmentation.
As shown in Section 3.3, there are two forms of KL divergence for NCM learning. Although both forms lead the prediction NCM to approximate the target  NCM, the gradients are quite different. Thus, we conduct a study to analyze the difference between these two forms of KL divergence on Semantic3D reduced-8 and S3DIS Area-5 tasks. The results are reported in Table 4. It demonstrates that these two forms of KL divergence achieve similar improvements on the S3DIS Area-5 task. However, reverse KL divergence achieves a 0.8% higher mIoU on the Semantic3D reduced-8 task. This results from differences between the target NCM generated from the sampled data and the real-world NCM. Specifically, the scale of the whole scene is much greater than that of the sampled point cloud in the Semantic3D dataset. The second term in Eq. (7), , is the reverse cross entropy which is more tolerant to such discrepancy noise [25], thus providing better performance. Furthermore, the first termM [i, j]log(M [i, j] + ) is the negative entropy, and minimizing it will maximize the entropy of the prediction NCM. This will give a preference to a uniform distribution over the prediction NCM, thus alleviating imbalance between the number of points in each category during NCM learning.

Hyper-parameter analysis
Here, we first conduct experiments to analyse the influence of number of neighbor used for NCM generation. Then, we study how the sampling density impacts segmentation performance, using the Seman-tic3D reduced-8 task and reverse KL divergence.

Number of neighbors in NCM
In this section, we conduct a study on changing the number of neighbors in the neighborhood cooccurrence matrix (NCM), which affects the size of neighborhood. We set the number of neighbors to 4, 8, and 12, respectively, with all other settings remaining unchanged from our original method. The experimental results are reported in Table 5 which shows the neighborhood consisting of 8 nearest neighbors brings the largest improvement to point cloud segmentation.

Sampling density in NCM
We also conduct experiments to study the influence of sampling density on segmentation performance. We attempt to control the sampling density by changing the hyper-parameter α. We set α to 30%, 10%, and 3% in turn, with all other settings remaining the same. Results are reported in Table 6. A higher sampling density results in better segmentation results because more samples in the NCM lead to more stable cooccurrence relationship learning.

Visualization of NCM
In order to reflect the improvement of segmentation performance in NCM, we visualize target NCMs and prediction NCMs of our baseline and our method for some scenes of S3DIS in Fig. 5. It shows that  our method removes many impossible pairs in the scenes as reflected by the NCM and improves the segmentation performance. For instance, there are no column-clutter pairs in the scenes and this is reflected in the NCM where segmentation improvement is achieved.

Conclusions
In this paper, we propose a neighborhood cooccurrence matrix to model the local category cooccurrence relationship and introduce it into the point cloud segmentation task. KL divergence is used to maximize the similarity of target NCM and prediction NCM. For better learning of local co-occurrence relationship for large-scale areas, we introduce the reverse form of KL divergence to NCM learning which is more robust to the difference between the NCM of a sampled point cloud and that of a whole scene. Additionally, our proposed method achieves stateof-the-art performance on Semantic3D for outdoor space segmentation as well as S3DIS and ScanNet v2 for indoor scene segmentation. Finally, we compare and analyze the difference in performance between the two forms of KL divergence used in our method, and conduct experiments to analyse the influence of number of neighbors and sampling density in NCM generation.