BLNet: Bidirectional learning network for point clouds

The key challenge in processing point clouds lies in the inherent lack of ordering and irregularity of the 3D points. By relying on perpoint multi-layer perceptions (MLPs), most existing point-based approaches only address the first issue yet ignore the second one. Directly convolving kernels with irregular points will result in loss of shape information. This paper introduces a novel point-based bidirectional learning network (BLNet) to analyze irregular 3D points. BLNet optimizes the learning of 3D points through two iterative operations: feature-guided point shifting and feature learning from shifted points, so as to minimise intra-class variances, leading to a more regular distribution. On the other hand, explicitly modeling point positions leads to a new feature encoding with increased structure-awareness. Then, an attention pooling unit selectively combines important features. This bidirectional learning alternately regularizes the point cloud and learns its geometric features, with these two procedures iteratively promoting each other for more effective feature learning. Experiments show that BLNet is able to learn deep point features robustly and efficiently, and outperforms the prior state-of-the-art on multiple challenging tasks.


Introduction
3D point cloud understanding is critical in many realworld vision applications such as autonomous driving, robotics, and augmented reality. A key challenge for effectively learning point cloud features is that point clouds captured by depth cameras or LiDAR sensors are often unordered and irregular, so many effective deep learning architectures, e.g., as in Refs. [1][2][3] are not directly applicable.
To tackle this, many approaches convert irregular point clouds into regular data formats such as multiview images [4][5][6] or 3D voxels [7][8][9][10][11]. However, such conversion processes result in loss of geometric detail and have large memory requirements. Alternatively, some recent studies focus on directly processing point clouds. A seminal approach, PointNet [12], individually learns per-point features using shared MLPs and gathers a global representation with maxpooling. Although effective, this design ignores local structures that constitute the semantics of the whole object. To solve this problem, many subsequent approaches [13][14][15][16][17][18] partition the point cloud into nested subsets, and then build a hierarchical framework to learn contextual representations from local to global. Nevertheless, these methods operate directly on raw point clouds, whose spatial irregularity limits the methods' inductive learning ability.
3D acquisition typically produces irregular and nonuniformly distributed raw point clouds. Figure 1(b) provides an example of irregular points sampled from a square. Suppose we have shared MLPs G together with their learnable weights W . If we apply these convolutions to the points in Fig. 1, then the convolutional output is f b = G([p 1 , p 2 , p 3 , p 4 ] x , W ), where x = (a, b). The shared point-wise MLPs utilized for encoding points can ensure permutationinvariance and address the lack of ordering. However, due to the irregular sampling in (b), we usually get f a = f b . Therefore, local features extracted on noisy or irregularly sampled points are often unstable, causing loss of shape information. We observation that: (i) as the sampling of (b) becomes more regular, its learned feature will converge to f a and become more stable, but on the other hand (ii) if a more accurate feature description (e.g., f a ) is available, deforming the shape (e.g., shape (b)) according to this feature will allow the point cloud to adjust its points towards a more regular distribution. Thus, making sampled points more regular and obtaining more accurate features are both important issues to address.
Thus, we formulate BLNet, the first work to apply bidirectional learning to point clouds and analyze irregular 3D points through bidirectional interaction between points and features. The key to BLNet is to address two issues iteratively: featureguided point shifting and feature learning from shifted points. On the one hand, taking the task loss as feedback, we use a position feedback module associated with adaptive 3D displacements to automatically adjust the positions of points. By minimizing intra-class variances, 3D points are regularized towards a certain distribution that fits the network well. On the other hand, we use a new feature modeling module, which explicitly encodes point positions with increased structure-awareness. We further use attention pooling to selectively focus upon and combine important features. This bidirectional learning approach alternately regularizes the point cloud and learns its geometric features; these two procedures iteratively support each other to provide more effective feature learning. Extensive experiments have verified the superiority of our BLNet on multiple challenging datasets including ModelNet40 [8], ShapeNet Parts [19], S3DIS [20], and ScanNet [9]. Moreover, we show ablation experiments and visual results to provide a better understanding of BLNet.

Deep learning on regular domain
To leverage the impressive success of traditional convolution on regular data representations (e.g., images), many approaches transform irregular point clouds into regular data formats such as multi-view images and 3D voxels. In the former, view-based methods [4][5][6] render multiple images from point clouds from different viewpoints and apply standard CNNs to the rendered images. In the latter, voxelbased methods [7][8][9][10][11] structure the point cloud using 3D regular voxels. Afterwards, 3D CNNs can be directly applied as to images. However, these regular formats need projection or voxelization, which result in a quantifiable loss of geometric information. By contrast, our BLNet directly processes point clouds and does not rely on any transformations.

Bidirectional learning
Bidirectional learning has been shown effective, with enhanced performance over uni-directional learning in multiple tasks such as language translation [31,32], image generation [33], and image translation [34]. It utilizes additional top-down (target-to-source) training to reduce the uni-directional dependency between source and target. However, such methods typically train the two directions separately and fuse them afterwards. In contrast, our BLNet is the first work to provide bidirectional interaction between points and features on point clouds. It alternately combines two learning directions, consequently forming a complete network for training.

Locality
Most recent point cloud learning frameworks [13,14,18] are trained to extract representations based on local features, which has been shown more effective than earlier work [12] that learns global descriptions. Thus, we design our network and modules to work locally.

Bidirectional learning
The position feedback module performs featureguided point shifting towards a more regular distribution, while the feature modeling module explores discriminative features from shifted points. To realize our bidirectional learning pipeline, we integrate a position feedback module and a feature modeling module into one bidirectional convolution operator, namely, BidConv. Stacking multiple BidConv operations enables the two modules to execute alternately and assist each other, as illustrated in Fig. 2.

BLNet architecture
As shown in Fig. 2, we have devised a hierarchical framework, BLNet, which can be applied to multiple tasks including point cloud classification and segmentation. In both tasks, we use four BidConv units to learn dimension-increasing features with progressive downsampling. The final global representation is followed by fully connected layers for classification tasks. For segmentation tasks, highresolution point-wise predictions are required, and this can be realized by deconvolution; we still utilize BidConv to recover resolution, and progressively upsample the compact features obtained from the encoder to the original resolution. Higher-resolution points are forwarded from earlier corresponding convolution layers, following the coarse-to-fine design of U-Net. Inspired by Ref. [35], features at the same resolution are skip-connected to preserve previous information. Both classification and segmentation models can be trained in an end-to-end manner.

Feature modeling
In order to learn more accurate features, and use them to guide the shifting of points, we have developed a new feature modeling module. This module includes a position encoding unit and an attention pooling unit, which can more discriminatively capture and combine features.

Position encoding
Given a point cloud P together with corresponding point features (e.g., raw RGB, or intermediate learned features), this unit aims to explicitly encode the spatial layout of 3D points, which plays a crucial role in shape analysis. Existing approaches [13,15,16] typically concatenate position information with point features, and then transform the concatenated results for feature learning. But these approaches are suboptimal at capturing meaningful geometry patches. In contrast, we perform an explicit encoding of point positions first and then combine the output with point features for further enhancement. This enables each 3D point to observe its local geometry, thus eventually enriching the entire network with Points downsampling (using FPS [13]) and upsampling are also included in convolution, depending on its use. N1 > N2 > N3 > N4 indicates points downsampling in each convolution, and C1 < C2 < C3 < C4 denotes dimension-increasing feature channels at each point. The first BidConv does not include a position feedback module, and no features extracted from points exist at the beginning of the network.
increased structure-awareness. Particularly, this unit consists of the following steps: Localisation. For the i-th point, to increase its receptive field, we index its neighboring points with a dilated K-nearest neighbours (KNN) algorithm. Specifically, we sample K points at equal intervals from the top K × r neighboring points, where r denotes the dilation rate.
Position encoding. For each of the K neighboring points {p 1 i . . . p k i . . . p K i } of the center point p i , we explicitly encode their relative position as Eq. (1): where p i and p k i are global (x, y, z) coordinates of points, and p k i − p i gives relative coordinates, providing translation invariance. G r () is a shared function, with learnable weights W r . This function can be implemented with any differentiable architecture, and we use point-wise shared MLPs in this work to address the lack of ordering of 3D points. Note that all the functions in our paper are complemented by per-point MLPs, and we omit explaining this in the remaining of this paper.
Feature enhancement. The prior semantic information contained in per-point features can further enhance the distinctiveness of learned position features. For each neighboring point p k i , we concatenate its position features r k i with corresponding point features f k i , and then use a shared function G f () to combine them. Thus, we obtain the enhanced feature vectorf k i with the following formulation: is the concatenation operation.

Attention pooling
This unit is used to aggregate the set of neighboring point features. Different features in the local region have varied impacts on the local representation. The leading strategy for integrating the neighboring features in existing work [13,16] is max/mean pooling. However, this frequently results in a loss of useful information. In contrast, our new attention pooling unit selectively focuses on the most relevant features. In particular, this unit includes the following steps: Attention scoring. Given the set of local fea- our shared function G s () learns a unique attention score for each channel of point features. To make weight coefficients comparable for different channels, this function consists of shared MLPs followed by a channel-level softmax. It is formulated as where a pairwise function R 1 indicates a highlevel relationship between the centroid point and any particular neighbor. Here we define R 1 as: which measures the feature difference between point pairs and assigns a higher attention scores to more similar neighbors.
Weighted summation. We consider the learned attention scores as a soft mask that selectively focuses on important features. Then, the local representatioñ f i is obtained by summing weighted features as

Position feedback
Given the extracted features, this module aims to perform feature-guided point shifting. To achieve this, we regress an adaptive 3D displacement for each point by considering its feature, as well as position.
Taking the task loss (cross-entropy loss) as feedback, these displacements can learn to adjust the positions of points. By minimizing intra-class variances (i.e., minimizing feature difference of intra-class points), local points are shifted towards a certain distribution that fits the network well. With respect to the original irregular points, this distribution is more regular and leads to more effective feature learning. In detail, this module comprises the following steps.

3D displacement
Using dilated KNN, we index K neighboring points of the center point p i as For each one, we define a dual relation between spatial and semantic levels to generate a point-wise 3D displacement, formulated as where R spa (p i , p k i ) = (p i − p k i ) indicates a spatial relation between point p k i and its center, and R sem (f i ,f k i ) = |f i −f k i | is defined using a formulation similar to that for R 1 . The first term R spa ensures that the corresponding point features are always aware of their relative spatial locations, and the second term R sem helps to learn shift distance according to feature difference. For each 3D displacement, the former provides direction and the latter, magnitude, which jointly guide the irregular neighbors to meaningful patches under the feedback of the task loss.

Position updating
The learned adaptive displacements can be viewed as semantics-driven factors. With them, the points are automatically shifted, generally towards a more regular distribution. Formally, each neighboring point is updated as Eq. (6): where α is a scale coefficient, N (·) is a normalization function mapping the value to range [−1, 1], andp k i are shifted global coordinates. As the network is very fragile at the beginning of training, α is initialized to 0 and gradually assigned an appropriate weight to adapt to local structures. As shown in Fig. 2, each position feedback module utilizes features extracted from the previous feature modeling module to perform feature-guided point shifting. With iterative downsampling and upsampling, these two modules need to run at different resolutions. We integrate the two consecutive modules operating at the same resolution together into one operation, BidConv, as shown in Fig. 3. We utilize KNN only once after each sampling. Stacking multiple BidConv units enables the two modules to execute alternately and assist each other, thereby forming an effective learning network.

Datasets
We evaluated our method on four datasets and multiple tasks ranging from shape classification (ModelNet40 [8]) and part segmentation (ShapeNet Parts [19]) to semantic segmentation (S3DIS [20] and ScanNet [9]). The experimental setting for each dataset is listed below: • ModelNet40 has 12,311 3D mesh models in 40 object categories. We followed the official split with 9843/2468 models for training and testing, respectively.
• ShapeNet Parts has 16,881 CAD models in 16 object categories. Each point is annotated with a certain one of 50 part classes and each point cloud consists of 2-5 parts. We followed the official split with with 14,006 objects for training and 2874 for testing. • ScanNet has 1512 reconstructed indoor scenes in 21 semantic categories. We split this dataset into 1201 and 312 scenes for training and testing, respectively. • S3DIS has 271 real rooms from three different buildings, in 13 semantic categories. Following the experimental settings in Refs. [12,36], we used two dominant settings for training and testing: 6-fold and Area-5 cross validation, respectively.

Experiment
We evaluated our network on classifying point clouds sampled from ModelNet40 [8]. Using a widelyused sampling density in the existing literature, we uniformly sampled 1024 points from each 3D mesh model and normalized them to a unit sphere. We used overall accuracy (OA) as the evaluation metric.
To reduce over-fitting, we employed the dropout technique [44] with 80% ratio in the penultimate FC layers.

Results
For fair comparison, we present overall shape accuracy as well as input settings in Table 1. BLNet clearly surpasses all previous approaches for various input settings. Specifically, when only using xyz information, BLNet achieves an OA of 93.5% and outperforms a set of representative methods such as PointNet++ [13] (90.7%), PointCNN [14] (92.2%), RSCNN (at a single scale, all that is available at

Experiment
Part segmentation is a challenging task for finegrained shape analysis. We evaluated our network on ShapeNet Parts [19] benchmark and randomly sampled 2048 points as the input. We report class average IoU (mcIoU) as the evaluation metric.

Experiments
This task takes 3D point clouds as input and assigns one semantic class label to each point. We evaluated our network on two datasets: S3DIS [20] and ScanNet [9]. Following PointCNN [14], we first split points according to room and sliced the rooms into 1.5 m × 1.5 m blocks, with 0.3 m padding on each side. We sampled 2048 points for each block in training, and adopted all points for evaluation during testing. For S3DIS, we adopted two widelyused evaluation metrics: overall accuracy (OA) and mean class IoU (mIoU). For ScanNet, to make a fair comparison with other approaches, we did not use RGB information and converted the segmentation results from the testing data into semantic voxel labeling for evaluation. Here we report voxel-wise overall accuracy (OA) as the evaluation metric.

Ablation and other studies
To validate the contribution of each module to our framework, we conducted ablation studies with results in Table 4. These experiments used S3DIS [20] with Area-5 cross validation, with mIoU used as the standard metric. We also considered robustness, model size, and latency.

Remove position encoding
Position encoding enables each 3D point to observe its local geometry. After removing this unit, we directly transform point features using per-point MLPs and feed the output features into the subsequent attention pooling. As shown in Table 4, removing position encoding causes a significant performance drop. This is because the spatial distributions of points play a crucial role in 3D shape analysis, and our position encoding unit effectively increases structure-awareness by explicitly encoding relative point positions.

Replace attention pooling with max/mean pooling
Attention pooling learns to selectively focus on important features and then combine them. As a comparison, we replaces this unit with the widely-used max/mean pooling. As Table 4 shows, our attention pooling considerably improves upon max/mean pooling. This demonstrates that attention pooling is able to keep important features and gather a discriminative representation.

Remove position feedback
The position feedback module aims to adaptively shift 3D points towards a more regular distribution. After removing this module, we directly feed original irregular points into the subsequent feature modeling module. As shown in Table 4, removing position feedback considerably harms performance. This indicates that shifted distributions can fit the network better than original ones and promote feature learning. To further verify the effectiveness of position feedback, we visualize T-SNE results of shape features for different categories without and with this module in Fig. 6. Note that since differences in distributions of input points result in different features (see Section 1 Introduction), we visualize features to better view the effect of different input distributions without and with position feedback. In Fig. 6(left), features extracted from irregular points (without position feedback) are mixed and less distinguishable from each other, showing that directly using raw point clouds can cause more shape information loss. In contrast, in Fig. 6(right), features extracted from shifted points (with position feedback) are better partitioned. This shows that shifted distributions induced by adaptive 3D displacements can lead to more intra-class consistency and interclass distinctiveness with respect to the original irregular distributions.

Robustness to noise
We next demonstrate the robustness of our BLNet with respect to two representative baselines, PointNet++ [13] and PointCNN [14]. When applying random noise (increasing irregularity) in the range [−0.01, 0.01] to each point, the mIoUs of BLNet, PointCNN, and PointNet++ on S3DIS Area-5 [20] decrease by 1.2%, 3.4%, and 4.5%, respectively: BLNet is more robust to random noise. This is because the position feedback module generates adaptive 3D displacements to shift each point, which can be effective in resisting random noise.

Model size and latency
Figure 7 qualitatively compares model size for different approaches in terms of accuracy versus parameter memory requirements, as well as comparing accuracy versus latency. Compared to modern competitive methods [12-14, 16, 22, 42], BLNet achieves a significantly better accuracy versus complexity trade-off, demonstrating its effectiveness and efficiency. Note that the BLNet model with the highest accuracy (middle) is our original version, and we increase or decrease its model size by simply scaling feature channels in Fig. 2.

Conclusions
This is the first work to apply bidirectional learning to point clouds, with bidirectional interaction between points and features. Specifically, we propose a novel point-based bidirectional learning network (BLNet) to analyze irregular 3D points. BLNet optimizes learning for 3D points through two iterative methods: feature-guided point shifting and feature learning from shifted points. The position feedback module utilizes adaptive 3D displacements to automatically shift points, leading to a more regular distribution.
A new feature modeling module explicitly encodes point positions with increased structure-awareness, and a powerful attention pooling in this module selectively combines important features. These two modules alternately regularize the point cloud and learn its geometric features, and iteratively assist each other for more effective feature learning. Extensive experiments have verified the superiority of our BLNet on various challenging benchmarks.