PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

Shi, Shaoshuai; Jiang, Li; Deng, Jiajun; Wang, Zhe; Guo, Chaoxu; Shi, Jianping; Wang, Xiaogang; Li, Hongsheng

doi:10.1007/s11263-022-01710-9

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

Open access
Published: 24 November 2022

Volume 131, pages 531–551, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

Download PDF

Shaoshuai Shi ORCID: orcid.org/0000-0003-2558-181X^1,2,
Li Jiang^1,2,
Jiajun Deng³,
Zhe Wang⁴,
Chaoxu Guo⁴,
Jianping Shi⁴,
Xiaogang Wang¹ &
…
Hongsheng Li¹

9643 Accesses
142 Citations
1 Altmetric
Explore all metrics

Abstract

3D object detection is receiving increasing attention from both industry and academia thanks to its wide applications in various fields. In this paper, we propose Point-Voxel Region-based Convolution Neural Networks (PV-RCNNs) for 3D object detection on point clouds. First, we propose a novel 3D detector, PV-RCNN, which boosts the 3D detection performance by deeply integrating the feature learning of both point-based set abstraction and voxel-based sparse convolution through two novel steps, i.e., the voxel-to-keypoint scene encoding and the keypoint-to-grid RoI feature abstraction. Second, we propose an advanced framework, PV-RCNN++, for more efficient and accurate 3D object detection. It consists of two major improvements: sectorized proposal-centric sampling for efficiently producing more representative keypoints, and VectorPool aggregation for better aggregating local point features with much less resource consumption. With these two strategies, our PV-RCNN++ is about $3\times $ faster than PV-RCNN, while also achieving better performance. The experiments demonstrate that our proposed PV-RCNN++ framework achieves state-of-the-art 3D detection performance on the large-scale and highly-competitive Waymo Open Dataset with 10 FPS inference speed on the detection range of $150m \times 150m$.

GridPointNet: Grid and Point-Based 3D Object Detection from Point Cloud

IMAM: Incorporating Multiple Attention Mechanisms for 3D Object Detection from Point Cloud

ET-PointPillars: improved PointPillars for 3D object detection based on optimized voxel downsampling

Article 21 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

3D object detection on point clouds aims to localize and recognize 3D objects from a set of 3D points, which is a fundamental task of 3D scene understanding and is widely-adopted in lots of real-world applications like autonomous driving, intelligent traffic system and robotics. Compared to 2D detection methods on images (Girshick, 2015; Ren et al., 2015; Liu et al., 2016; Redmon et al., 2016; Lin et al., 2017, 2018), the sparsity and irregularity of point clouds make it challenging to directly apply 2D detection techniques to 3D detection on point clouds.

To tackle these challenges, most of existing 3D detection methods (Chen et al., 2017; Zhou and Tuzel, 2018; Yang et al., 2018b; Lang et al., 2019; Yan et al., 2018) transform the points into regular voxels that can be processed with conventional 2D/3D convolutional neural networks and well-studied 2D detection heads. But the voxelization operation inevitably brings quantization errors, thus degrading their localization accuracy. In contrast, the point-based methods (Qi et al., 2018; Shi et al., 2019; Wang and Jia, 2019) naturally preserve accurate point locations in feature extraction but are generally computationally-intensive on handling large-scale points. There are also some existing approaches (Chen et al., 2019b; Li et al., 2021) that simply combine these two strategies by adopting the voxel-based methods for feature extraction and 3D proposal generation in the first stage, but introducing the raw point representation in a second stage to compensate the quantization errors for fine-grained proposal refinement. However, this simple stacked combination ignores deep fusion of their basic operators (e.g., sparse convolution (Graham et al., 2018) and set abstraction (Qi et al., 2017b)) and can not fully explore the feature intertwining of both strategies to take the best of both worlds.

Therefore, we propose a unified framework, namely, Point-Voxel Region-based Convolutional Neural Networks (PV-RCNNs), to take the best of both voxel and point representations by deeply integrating the feature learning strategies from both of them. The principle lies in the fact that the voxel-based strategy can more efficiently encode multi-scale features and generate high-quality 3D proposals from large-scale point clouds, while the point-based strategy can preserve accurate location information with flexible receptive fields for fine-grained proposal refinement. We demonstrate that our proposed point-voxel intertwining framework can effectively improve the 3D detection performance by deeply fusing the feature learning of both point and voxel representations.

Firstly, we introduce our initial 3D detection framework, PV-RCNN, which is a two-stage 3D detector on point clouds. It consists of two novel steps for point-voxel feature aggregation. The first step is voxel-to-keypoint scene encoding, where a 3D voxel CNN with sparse convolutions is adopted for feature learning and proposal generation. The multi-scale voxel features are then summarized into a small set of keypoints by point-based set abstraction, where the keypoints with accurate point locations are sampled by farthest point sampling from the raw points. The second step is keypoint-to-grid RoI feature abstraction, where we propose RoI-grid pooling module to aggregate the above keypoint features back to regular RoI grids of each proposal. It encodes multi-scale contextual information to form regular grid features for proposal refinement. These two steps establish feature intertwining between point-based set abstraction and voxel-based sparse convolution, which have been experimentally evidenced to improve the model representative ability as well as the detection performance.

Secondly, we propose an advanced two-stage detection network, PV-RCNN++, on top of PV-RCNN, for achieving more accurate, efficient and practical 3D object detection. The improvements of PV-RCNN++ lie in two aspects. The first aspect is a novel sectorized proposal-centric keypoint sampling strategy, where we concentrate the limited number of keypoints in and around the 3D proposals to encode more effective scene features. Meanwhile, by considering radial distribution of LiDAR points, we propose to conduct point sampling parallelly in different sectors, which accelerates keypoint sampling process, while also ensuring uniform distribution of keypoints. Our proposed keypoint sampling strategy is much faster and more effective than vanilla farthest point sampling that has a quadratic complexity. The efficiency of the whole framework is thus greatly improved, which is particularly important for large-scale 3D scenes with millions of points. The second aspect is a novel local feature aggregation module, VectorPool aggregation, for more effective and efficient local feature encoding on point clouds. We argue that the relative point locations in a local region are robust, effective and discriminative features for describing local geometry. We propose to split 3D local space into regular and compact sub-voxels, the features of which are sequentially concatenated to form a hyper feature vector. The sub-voxel features in different locations are encoded with separate kernel weights to generate position-sensitive local features. In this way, different local location information is encoded with different feature channels in the hyper feature vector. Compared with set abstraction, our VectorPool aggregation can efficiently handle a very large number of centric points due to the compact local feature representation. Equipped with VectorPool aggregation in both voxel-based backbone and RoI-grid pooling module, our PV-RCNN++ is more memory-friendly and faster than previous counterparts with comparable or even better performance, which helps in establishing a practical 3D detector for resource-limited devices.

In a nutshell, our contributions are three-fold: 1) Our PV-RCNN adopts two novel strategies, voxel-to-keypoint scene encoding and keypoint-to-grid RoI feature abstraction, to deeply integrate the advantages of both point-based and voxel-based feature learning strategies. 2) Our PV-RCNN++ takes a step in more practical 3D detection system with better performance, less resource consumption and faster running speed. This is enabled by our proposed sectorized proposal-centric keypoint sampling to obtain more representative keypoints with faster speed, and is also powered by our novel VectorPool aggregation for achieving local aggregation on a large number of central points with less resource consumption and more effective representation. (3) Our proposed 3D detectors surpass all published methods with remarkable margins on the challenging large-scale Waymo Open Dataset. In particular, our PV-RCNN++ achieves state-of-the-art results with 10 FPS inference speed for $150m \times 150m$ detection range. The source code is available at https://github.com/open-mmlab/OpenPCDet.

2 Related Work

3D Object Detection with 2D images Image-based 3D detection aims to estimate 3D bounding boxes from a monocular image or stereo images. Mono3D (Chen et al., 2016) generates 3D proposals with ground-plane assumption, which are scored by exploiting semantic knowledge from images. The following works (Mousavian et al., 2017; Li et al., 2019a) incorporate the relations between 2D and 3D boxes as geometric constraint. M3D-RPN (Brazil and Liu, 2019) introduces a 3D region proposal network with depth-aware convolutions. (Chabot et al., 2017; Murthy et al., 2017; Manhardt et al., 2019) predict 3D boxes based on a wire-frame template obtained from CAD models. RTM3D (Li et al., 2020) performs coarse keypoints detection to localize 3d objects. On the stereo side, Stereo R-CNN (Li et al., 2019b; Qian et al., 2020) capitalizes on a stereo RPN to associate proposals from left and right images. DSGN (Chen et al., 2020) introduces differentiable 3D volume to learn depth information and semantic cues in an end-to-end optimized pipelines. -LiDARs (Wang et al., 2019a; Qian et al., 2020; You et al., 2020) propose to covert the image pixels to artificial point clouds, where the LiDAR-based detectors can operate on them for 3D box estimation. These image-based 3D detection methods suffer from inaccurate depth estimation and can only generate coarse 3D bounding boxes.

Recently, in addition to image-based 3D detection from monocular image or stereo images, a comprehensive scene understanding with surrounding cameras has drawn a lot of attention, where the well-known bird’s-eye-view (BEV) representation is generally adopted for better feature fusion from multiple surrounding images. LSS (Philion and Fidler, 2020) and CaDDN (Reading et al., 2021) predicts depth distribution to “lift” the 2D image features to a BEV feature map for 3D detection. Their follow-up works (Huang et al., 2021; Huang and Huang, 2022; Li et al., 2022a; Xie et al., 2022) learn a depth-based implicit projection to project image features to BEV space. Some other works also explore transformer structure to project image features from perspective view to BEV space via cross attention, such as DETR3D (Wang et al., 2022), PETR (Liu et al., 2022a, b), BEVFormer (Li et al., 2022b), PolarFormer (Jiang et al., 2022), etc. Although these works greatly improve the performance of image-based 3D detection by projecting multi-view images to a unified BEV space, the inaccurate depth estimation is still the main challenge for image-based 3D detection.

Representation Learning on Point Clouds Recently representation learning on point clouds has drawn lots of attention for improving the performance of 3D classification and segmentation (Qi et al., 2017a, b; Wang et al., 2019b; Huang et al., 2018; Zhao et al., 2019; Li et al., 2018; Su et al., 2018; Wu et al., 2019; Jaritz et al., 2019; Jiang et al., 2019; Thomas et al., 2019; Choy et al., 2019; Liu et al., 2020). In terms of 3D detection, previous methods generally project the points to regular 2D pixels (Chen et al., 2017; Yang et al., 2018b) or 3D voxels (Zhou and Tuzel, 2018; Chen et al., 2019b) for processing them with 2D/3D CNN. Sparse convolution (Graham et al., 2018) is adopted in (Yan et al., 2018; Shi et al., 2020b) to effectively learn sparse voxel features from point clouds. Qi et al. (Qi et al., 2017a, b) proposes PointNet to directly learn point features from raw points, where set abstraction enables flexible receptive fields by setting different search radii. (Liu et al., 2019) combines both voxel CNN and point multi-layer percetron network for efficient point feature learning. In comparison, our PV-RCNNs take advantages from both voxel-based (i.e., 3D sparse convolution) and point-based (i.e., set abstraction) strategies to enable both high-quality 3D proposal generation with dense BEV detection heads and flexible receptive fields in 3D space for improving 3D detection performance.

3D Object Detection with Point Clouds Most of existing 3D detection approaches can be roughly classified into three categories in terms of different strategies to learn point cloud features, i.e., the voxel-based methods, the point-based methods as well as the methods combining both points and voxels.

The voxel-based methods project point clouds to regular grids to tackle the irregular data format problem. MV3D (Chen et al., 2017) projects points to 2D bird view grids and places lots of predefined 3D anchors for generating 3D boxes, and the following works (Ku et al., 2018; Liang et al., 2018, 2019; Vora et al., 2020; Yoo et al., 2020; Huang et al., 2020) develop better strategies for multi-sensor fusion. (Yang et al., 2018b, a; Lang et al., 2019) introduce more efficient frameworks with bird-eye view representation while (Ye et al., 2020) proposes to fuse grid features of multiple scales. MVF (Zhou et al., 2020) integrates 2D features from bird-eye view and perspective view before projecting points into pillar representations (Lang et al., 2019). Some other works (Song and Xiao, 2016; Zhou and Tuzel, 2018) divide the points into 3D voxels to be processed by 3D CNN. 3D sparse convolution (Graham et al., 2018) is introduced by (Yan et al., 2018) for efficient 3D voxel processing. (Kuang et al., 2020) utilizes multiple detection heads for detecting 3D objects with different scales. In addition, (Wang et al., 2020; Chen et al., 2019a) predicts bounding box parameters following anchor-free paradigm. These grid-based methods are generally efficient for accurate 3D proposal generation but the receptive fields are constraint by the kernel size of 2D/3D convolutions.

The point-based methods directly detect 3D objects from raw points. F-PointNet (Qi et al., 2018) applies PointNet (Qi et al., 2017a, b) for 3D detection from the cropped points based on 2D image boxes. PointRCNN (Shi et al., 2019) generates 3D proposals directly from raw points by only taking 3D points. (Qi et al., 2019) proposes hough voting strategy for feature grouping. 3DSSD (Yang et al., 2020) introduces hybrid feature-distance based farthest point sampling on raw points. These point-based methods are mostly based on PointNet series, especially set abstraction (Qi et al., 2017b), which enables flexible receptive fields for point cloud feature learning. However, it is challenging to extend these point-based methods to large-scale point clouds since they generally consume much more memory/computation resources than the above voxel-based methods.

There are also some works that utilize both the point-based and voxel-based representations. STD (Yang et al., 2019) transforms point-wise features to dense voxels for refining the proposals. Fast Point R-CNN (Chen et al., 2019b) fuses the deep voxel features with raw points for 3D detection. Part-A2-Net (Shi et al., 2020b) aggregates the point-wise part locations by the voxel-based RoI-aware pooling to improve 3D detection performance. However, these methods generally simply transform features between two representations and do not fuse the deeper features from the specific basic operations of these two representations. In contrast, our PV-RCNN frameworks explore on how to deeply aggregate features by learning with both point-based (i.e., set abstraction) and voxel-based (i.e., sparse convolution) feature learning modules to boost the 3D detection performance.

3 PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Most of state-of-the-art 3D detectors (Shi et al., 2020b; Yin et al., 2021; Sheng et al., 2021) adopt 3D sparse convolution for learning representative features from irregular points thanks to its efficiency and effectiveness on handling large-scale point clouds. However, 3D sparse convolution network suffers from losing accurate point information due to the indispensable voxelization process. In contrast, the point-based approaches (Qi et al., 2017a, b) naturally preserve accurate point locations and can capture rich context information with flexible receptive fields, where the accurate point locations are essential for estimating accurate 3D bounding boxes.

In this section, we briefly review our initial 3D detection framework, PV-RCNN (Shi et al., 2020a), for 3D object detection from point clouds. It deeply integrates the voxel-based sparse convolution and point-based set abstraction operations to take the best of both worlds.

As shown in Fig. 1, PV-RCNN is a two-stage 3D detection framework that adopts a 3D voxel CNN with sparse convolution as the backbone for efficient feature encoding and proposal generation (Sec. 3.1), and then we generate the proposal-aligned features for predicting accurate 3D bounding boxes by intertwining point-voxel features through two novel steps, which are voxel-to-keypoint scene encoding (Sec. 3.2) and keypoint-to-grid RoI feature abstraction (Sec. 3.3).

3.1 Voxel Feature Encoding and Proposal Generation

In order to handle 3D object detection on the large-scale point clouds, we adopt the 3D voxel CNN with sparse convolution (Graham et al., 2018) as the backbone network to generate initial 3D proposals.

The input points ${\mathcal {P}}$ are first divided into small voxels with spatial resolution of $L\times W \times H$, where non-empty voxel features are directly calculated by averaging the coordinates of inside points. The network utilizes a series of 3D sparse convolution to gradually convert the points into feature volumes with $1\times , 2\times $, $4\times $, $8\times $ downsampled sizes. We follow (Yan et al., 2018) to stack the 3D feature volumes along Z axis to obtain the $\frac{L}{8}\times \frac{W}{8}$ bird-view feature maps, which can be naturally combined with the 2D detection heads (Liu et al., 2016; Yin et al., 2021) for high quality 3D proposal generation.

It is worth noting that the sparse feature volumes at each level can be viewed as a set of sparse voxel-wise feature vectors, and these multi-scale semantic features are considered as the input of our following voxel-to-keypoint scene encoding step.

3.2 Voxel-to-Keypoint Scene Encoding

Given the multi-scale scene features, we propose to summarize these features into a small number of keypoints, which serve as the courier to propagate features from the above 3D voxel CNN to the refinement network.

Keypoint Sampling We simply adopt farthest point sampling algorithm as in (Qi et al., 2017b) to sample a small number of keypoints ${\mathcal {K}}=\left\{ p_i\mid p_i \in {\mathbb {R}}^3\right\} _{i=1}^{n}$ from the raw points ${\mathcal {P}}$, where n is a hyper-parameter (e.g., n=4,096 for Waymo Open Dataset (Sun et al., 2020)). It encourages that the keypoints are uniformly distributed around non-empty voxels and can be representative to the overall scene.

Voxel Set Abstraction Module To aggregate the multi-scale semantic features from 3D feature volumes to the keypoints, we propose Voxel Set Abstraction (VSA) module. The set abstraction (Qi et al., 2017b) is adopted for aggregating voxel-wise feature volumes. The key difference is that the surrounding local points are now regular voxel-wise semantic features from 3D voxel CNN, instead of the neighboring raw points with features learned by PointNet (Qi et al., 2017a).

Specifically, we denote the number of non-empty voxels in the k-th level of 3D voxel CNN as $N_k$, and the voxel-wise features and 3D coordinates are denoted as ${\mathcal {F}}^{(l_k)}=\left\{ [f_i^{(l_k)}, v_{i}^{(l_k)}]\mid f_i^{(l_k)}\in {\mathbb {R}}^{C}, v_{i}^{(l_k)} \in {\mathbb {R}}^{3} \right\} _{i=1}^{N_k}$, where C indicates the number of feature dimensions.

For each keypoint $p_i\in {\mathcal {K}}$, to retrieve the set of neighboring voxel-wise feature vectors, we first identify its neighboring non-empty voxels at the k-th level within a radius $r_k$ as

$$\begin{aligned} S_i^{(l_k)} =\left\{ \left[ f_j^{(l_k)}, v_j^{(l_k)} - p_i\right] \bigg | \left\| v_j^{(l_k)} - p_i \right\| < r_k \right\} , \end{aligned}$$

(1)

where $[f_j^{(l_k)}, v_{j}^{(l_k)}] \in {\mathcal {F}}^{(l_k)}$, and the local relative position $v_j^{(l_k)}$ $- p_i$ is concatenated to indicate the relative location of $f_j^{(l_k)}$ in this local area. The features within neighboring set $S_i^{(l_k)}$ are then aggregated by a PointNet-block (Qi et al., 2017a) to generate keypoint feature as

$$\begin{aligned} f_i^{(\text {pv}_k)} = \max \left\{ \text {SharedMLP}\left( S_i^{(l_k)}\right) \right\} , \end{aligned}$$

(2)

where $\text {SharedMLP}(\cdot )$ denotes a shared multi-layer perceptron (MLP) network to encode voxel-wise features and relative locations, and $\max \{\cdot \}$ conducts permutation invariant feature aggregation to map diverse number of neighboring voxel features to a single keypoint feature $f_i^{(\text {pv}_k)}$. Here multiple radii are utilized to capture richer contextual information.

The above voxel feature aggregation is performed at the outputs of different levels of 3D voxel CNN, and the aggregated features from different scales are concatenated to obtain the multi-scale semantic feature for keypoint $p_i$ as

$$\begin{aligned} f_i^{(p)} = \text {Concat}\left( \left\{ f_i^{(\text {pv}_k)}\right\} _{k=1}^{4}, f_i^{(\text {raw})}, f_i^{(\text {bev})} \right) , \end{aligned}$$

(3)

where $i \in \{1, \dots , n\}$, and $k \in \{1, \dots , 4\}$ indicates that the keypoint features are aggregated from four-level voxel-wise features of 3D voxel CNN. Note that the keypoint features are further enriched with two extra information sources, where the raw point features $f_i^{(\text {raw})}$ are aggregated as in Eq. (2) to partially make up the quantization loss of point voxelization, while 2D bird-view features $f_i^{(\text {bev})}$ are obtained by bilinear interpolation on the $8\times $ downsampled 2D feature maps to achieve larger receptive fields along the height axis.

Predicted Keypoint Weighting Intuitively, the keypoints belonging to the foreground objects should contribute more to the proposal refinement, while the keypoints from the background regions should contribute less. Hence, we propose a Predicted Keypoint Weighting (PKW) module to re-weight the keypoint features with extra supervisions from point segmentation as

$$\begin{aligned} f_i^{(p)} = \text {MLP}(f_i^{(p)}) \cdot f_i^{(p)}, \end{aligned}$$

(4)

where $\text {MLP}(\cdot )$ is a three-layer MLP network with a sigmoid function to predict foreground confidence. It is trained with focal loss (Lin et al., 2018) by the default parameters and the segmentation labels can be directly generated from the 3D box annotations as in (Shi et al., 2019). Note that this PKW module is optional for our framework as it only leads small gains (see Table. 7).

The keypoint features ${\mathcal {F}}=\left\{ f^{(p)}_i\right\} _{i=1}^n$ not only incorporate multi-scale semantic features from the 3D voxel backbone network, but also naturally preserve accurate location information through its 3D keypoint coordinates ${\mathcal {K}}=\left\{ p_i\right\} _{i=1}^n$, which provides strong capacity of preserving 3D structural information of the entire scene for the following fine-grained proposal refinement.

3.3 Keypoint-to-Grid RoI Feature Abstraction

Given the aggregated keypoint features and their 3d coordinates, in this step, we propose keypoint-to-grid RoI feature abstraction to generate accurate proposal-aligned features for fine-grained proposal refinement.

RoI-grid Pooling via Set Abstraction We propose the RoI-grid pooling module to aggregate the keypoint features to the RoI-grid points by adopting multi-scale local feature grouping. For each given 3D proposal, we uniformly sample $6\times 6\times 6$ grid points according to the 3D proposal box, which are then flattened and denoted as ${\mathcal {G}}=\{g_i\}_{i=1}^{6\times 6\times 6=216}$. To aggregate the features of keypoints to the RoI grid points, we firstly identify the neighboring keypoints of a grid point $g_i$ as

$$\begin{aligned} \varPsi =\left\{ \left[ f_j^{(p)}, p_j - g_i\right] \bigg | \left\| p_j - g_i \right\| < r^{(g)} \right\} , \end{aligned}$$

(5)

where $p_j \in {\mathcal {K}}$ and $f_j^{(p)} \in {\mathcal {F}}$. We concatenate $p_j-g_i$ to indicate the local relative location within the ball of radius $r^{(g)}$. Then we adopt the similar process with Eq. (2) to summarize the neighboring keypoint feature set $\varPsi $ to obtain the features of grid point $g_i$ as

$$\begin{aligned} f_i^{(g)} = \max \left\{ \text {SharedMLP}\left( \varPsi \right) \right\} . \end{aligned}$$

(6)

Note that we set multiple radii $r^{(g)}$ and aggregate keypoint features with different receptive fields, which are concatenated together for capturing richer multi-scale contextual information. Next, all RoI-grid features $\{f_i^{(g)}\}_{i=1}^{216}$ of the same RoI can be vectorized and transformed by a two-layer MLP with 256 feature dimensions to represent the overall features of this proposal box.

Our proposed RoI-grid pooling operation can aggregate much richer contextual information than the previous RoI-pooling/RoI-align operation (Shi et al., 2019; Yang et al., 2019; Shi et al., 2020b). It is because a single keypoint can contribute to multiple RoI-grid points due to the overlapped neighboring balls of RoI-grid points, and their receptive fields are even beyond the RoI boundaries by capturing the contextual keypoint features outside the 3D RoI. In contrast, the previous state-of-the-art methods either simply average all point-wise features within the proposal as the RoI feature (Shi et al., 2019), or pool many uninformative zeros as the RoI features because of the very sparse point-wise features (Shi et al., 2020b; Yang et al., 2019).

Proposal Refinement Given the above RoI-aligned features, the refinement network learns to predict the size and location (i.e. center, size and orientation) residuals relative to the 3D proposal box. Two sibling sub-networks are employed for confidence prediction and proposal refinement. Each sub-network consists of a two-layer MLP and a linear prediction layer. We follow (Shi et al., 2020b) to conduct the IoU-based confidence prediction. The binary cross-entropy loss is adopted to optimize the IoU branch while the box residuals are optimized with smooth-L1 loss.

4 PV-RCNN++: Faster and Better 3D Detection with PV-RCNN Framework

Although our proposed PV-RCNN 3D detection framework achieves state-of-the-art performance (Shi et al., 2020a), it suffers from the efficiency problem when handling large-scale point clouds. To make PV-RCNN framework more practical for real-world applications, we propose an advanced 3D detection framework, i.e., PV-RCNN++, for more accurate and efficient 3D object detection with less resource consumption.

As shown in Fig. 3, we present two novel modules to improve both the accuracy and efficiency of PV-RCNN framework. One is sectorized proposal-centric strategy for much faster and better keypoint sampling, and the other one is VectorPool aggregation module for more effective and efficient local feature aggregation from large-scale point clouds. These two modules are adopted to replace their counterparts in PV-RCNN, which are introduced in Sec. 4.1 and Sec. 4.2, respectively.

4.1 Sectorized Proposal-Centric Sampling for Efficient and Representative Keypoint Sampling

The keypoint sampling is critical for PV-RCNN framework as keypoints bridge the point-voxel representations and heavily influence the performance of proposal refinement. However, previous keypoint sampling algorithm (see Sec. 3.2) has two main drawbacks. (i) Farthest point sampling is time-consuming due to its quadratic complexity, which hinders the training and inference speed of PV-RCNN, especially for keypoint sampling on large-scale point clouds. (ii) It would generate a large number of background keypoints that are generally useless to proposal refinement, since only the keypoints around proposals can be retrieved by RoI-grid pooling module.

To mitigate these drawbacks, we propose the Sectorized Proposal-Centric (SPC) keypoint sampling to uniformly sample keypoints from more concentrated neighboring regions of proposals, while also being much faster than the vanilla farthest point sampling algorithm. It mainly consists of two novel steps, which are the proposal-centric filtering and the sectorized sampling, which are illustrated in the following paragraphs.

Proposal-Centric Filtering To better concentrate the keypoints on the more important areas and also reduce the complexity of the next sampling process, we first adopt the proposal-centric filtering step.

Specifically, we denote a number of $N_p$ 3D proposals as ${\mathcal {D}}=\{[c_i, d_i] \mid c_i \in {\mathbb {R}}^3, d_i \in {\mathbb {R}}^3\}_{i=1}^{N_p}$, where $c_i$ and $d_i$ are the center and size of each proposal box, respectively. We restrict the keypoint candidates ${\mathcal {P}}'$ to the neighboring point sets of all proposals as

$$\begin{aligned} {\mathcal {P}}' =\left\{ p_i \bigg | \left\| p_i - c_j \right\| < \frac{1}{2} \cdot {\max \left( d_j\right) } + r^{(s)} \right\} , \end{aligned}$$

(7)

where $[c_j, d_j] \in {\mathcal {D}}$, $p_i \in {\mathcal {P}}$ indicates the raw point, and $\max (\cdot )$ obtains the maximum length of 3D box size. $r^{(s)}$ is a hyperparameter indicating the maximum extended radius of the proposals. Through this proposal-centric filtering process, the number of candidate keypoints for sampling is greatly reduced from $|{\mathcal {P}}|$ to $|{\mathcal {P}}'|$. For instance, for the Waymo Open Dataset (Sun et al., 2020), generally ${\mathcal {P}}$ is about 180k and ${\mathcal {P}}'$ can be smaller than 90k in most cases (the exact point number depends on the number of proposal boxes in each scene).

Hence, this step not only reduces the time complexity of the follow-up keypoint sampling, but also concentrates the limited number of keypoints to better encode the neighboring regions of the proposals.

Sectorized Keypoint Sampling To further parallelize the keypoint sampling process for acceleration, as shown in Fig. 4, we propose the sectorized keypoint sampling strategy, which takes advantage of radial distribution of the LiDAR points to better parallelize and accelerate the keypoint sampling process.

Specifically, we divide proposal-centric point set ${\mathcal {P}}'$ into s sectors centered at the scene center, and the point set of k-th sector can be represented as

$$\begin{aligned} S'_k =\left\{ p_i \bigg | \left\lfloor \left( \text {arctan}\left( p_i^{y}, p_i^{x}\right) + \pi \right) \cdot \frac{s}{2\pi } \right\rfloor = k - 1 \right\} , \end{aligned}$$

(8)

where $k \in \{1, \dots , s\}$, $p_i=(p_i^x, p_i^y, p_i^z) \in {\mathcal {P}}'$ , and $\text {arctan}($ $p_i^y, p_i^x)$ $\in (-\pi , \pi ]$ indicates the angle between the positive X axis and the ray ended with $(p_i^x, p_i^y)$ in terms of the bird’s eye view.

Through this process, we divide the task of sampling n keypoints into s subtasks of sampling local keypoints, where k-th sector samples $\left\lfloor \frac{|S'_k|}{|P'|} \times n \right\rfloor $ keypoints from the point set $S'_k$. These subtasks are eligible to be executed in parallel on GPUs, while the scale of keypoint sampling (i.e., time complexity) is further reduced from $|{\mathcal {P}}'|$ to $\max _{k \in \{1, \dots , s\}} |S'_k|$. Note that we adopt farthest point sampling in each subtask since both the qualitative and quantitative experiments in Sec.?? demonstrate that farthest point sampling can generate more uniformly distributed keypoints to better cover the whole regions, which is critical for the final detection performance.

It is worth noting that our sector-based group partition can roughly produce similar number of points in each group by considering radial distribution of the points generated by LiDAR sensors, which is essential to speed up the keypoint sampling since the overall running time depends on the group with the most points.

Therefore, our proposed keypoint sampling algorithm greatly reduces the scale of keypoint sampling from $|{\mathcal {P}}|$ to the much smaller $\max _{k \in \{1, \dots , s\}} |S'_k|$, which not only effectively accelerates the keypoint sampling process, but also increases the capability of keypoint feature representation by concentrating the keypoints to the more important neighboring regions of 3D proposals.

Although the proposed sectorized keypoint sampling is tailored for LiDAR sensors, the main idea behind it, that is, conducting FPS in spatial groups to speed up the operation, is also effective with other types of sensors. It should be noted that the point group generation should be based on spatially partitioning to keep the overall uniform distribution. As shown in Table 8, randomly dividing the points into groups, while ensuring a balance in the number of points between groups, harms the model performance.

4.2 Local Vector Representation for Structure-Preserved Local Feature Learning

The local feature aggregation of point clouds plays an important role in PV-RCNN framework as it is the fundamental operation to deeply integrate the point-voxel features in both voxel set abstraction and RoI-grid pooling modules. However, we observe that set abstraction (see Eqs. (2) and (6)) in PV-RCNN framework can be extremely time- and resource-consuming on large-scale point clouds, since it applies several shared-parameter MLP layers on the point-wise features of each local point separately. Moreover, the max-pooling operation in set abstraction abandons the spatial distribution information of local points and harms the representation capability of locally aggregated features from point clouds.

Therefore, in this section, we propose VectorPool aggregation module for local feature aggregation on the large-scale point clouds, which can better preserve spatial point distribution of local neighborhoods and also costs less memory/computation resources than the commonly-used set abstraction. Our PV-RCNN++ framework adopts it as a basic module to enable more effective and efficient 3D object detection.

Problem Statement The VectorPool aggregation module aims to generate the informative local features for N target center points (denoted as ${\mathcal {Q}}=\{q_k \mid q_k \in {\mathbb {R}}^3\}_{k=1}^{N}$) by learning from M given support points and their features (denoted as ${\mathcal {I}}=\{[h_i, a_i] \mid h_i \in {\mathbb {R}}^{C_{\text {in}}},$ $a_i \in {\mathbb {R}}^3\}_{i=1}^M$), where $C_{\text {in}}$ is the input feature channels and we are going to extract N local point-wise features with $C_{\text {out}}$ channels for each point in ${\mathcal {Q}}$.

VectorPool Aggregation on Point Clouds In our proposed VectorPool aggregation module, we propose to generate position-sensitive local features by encoding different spatial regions with separate kernel weights and separate feature channels, which are then concatenated as a single vector representation to explicitly represent the spatial structures of local point features.

Specifically, given a target center point $q_k$, we first identify the support points that are within its cubic neighboring region, which can be represented as

$$\begin{aligned} {\mathcal {Y}}_k =\left\{ \left[ h_j, a_j\right] \bigg | \max (a_j - q_k) < 2\times \delta \right\} , \end{aligned}$$

(9)

where $[h_j, a_j] \in {\mathcal {I}}$, $\delta $ is the half length of this cubic space, and $\max (a_j-q_k) \in {\mathbb {R}}$ obtains the maximum axis-aligned value of this 3D distance. Note that we double the half length (e.g., $2\times \delta $) of the original cubic space to contain more neighboring points for local feature aggregation of this target point.

To generate position-sensitive features for this local cubic neighborhood centered at $q_k$, we split its neighboring cubic space into $n_x\times n_y\times n_z$ small local sub-voxels. Inspired by (Qi et al., 2017b), we utilize the inverse distance weighted strategy to interpolate the features of the $t^{th}$ sub-voxel by considering its three nearest neighbors from ${\mathcal {Y}}_k$, where $t\in \{1, \dots , n_x\times n_y\times n_z\}$ indicating the index of each sub-voxel and we denote its corresponding sub-voxel center as $v_t\in {\mathbb {R}}^3$. Then we can generate the features of the $t^{th}$ sub-voxel as

$$\begin{aligned} h^{(v)}_t=\frac{\sum _{i\in {\mathcal {G}}_t}\left( w_i\cdot h_i\right) }{\sum _{i\in {\mathcal {G}}_t} w_i}, ~~w_{i}=(||a_{i} - v_t||)^{-1}, \end{aligned}$$

(10)

where $\left[ h_i, a_{i}\right] \in {\mathcal {Y}}_k$, ${\mathcal {G}}_t$ is the index set indicating the three nearest neighbors (i.e., $(|{\mathcal {G}}_t|=3)$) of $v_t$ in neighboring set ${\mathcal {Y}}_k$. The results $h^{(v)}_t$ encode the local features of the specific $t^{th}$ local sub-voxel in this local cubic.

There are also two other alternative strategies to aggregate the features of local sub-voxels by simply averaging the features within each sub-voxel or by randomly choosing one point within each sub-voxel. Both of them generate lots of empty features in the empty sub-voxels, which may degrade the performance. In contrast, our interpolation based strategy can generate more effective features even on empty local voxels.

Those features in different local sub-voxels may represent very different local features. Hence, instead of encoding the local features with a shared-parameter MLP as in (Qi et al., 2017b), we propose to encode different local sub-voxels with separate local kernel weights for capturing position-sensitive features as

$$\begin{aligned} U_t&= \text {Concat}\left( \left\{ a_i-v_t\right\} _{i\in {\mathcal {G}}_t},~ h^{(v)}_{t} \right) \times W_t, \end{aligned}$$

(11)

where $\left\{ a_i-v_t\right\} _{i\in {\mathcal {G}}_t}\in {\mathbb {R}}^{(3\times 3=9)}$ indicates the relative positions of its three nearest neighbors, $\text {Concat}(\cdot )$ is the concatenation operation to fuse the relative position and features. $W_t \in {\mathbb {R}}^{(9+C_{\text {in}}) \times C_{\text {mid}}}$ is the learnable kernel weights for encoding the specific features of $t^{th}$ local sub-voxel with feature channel $C_{\text {mid}}$, and different positions have different learnable kernel weights for encoding position-sensitive local features.

Finally, we directly sort the local sub-voxel features $U_t$ according to their spatial order along each of 3D axis, and their features are sequentially concatenated to generate the final local vector representation as

$$\begin{aligned} {\mathcal {U}}=\text {MLP}\left( \text {Concat} \left( U_1, U_2, \dots , U_{n_x\times n_y\times n_z}\right) \right) , \end{aligned}$$

(12)

where ${\mathcal {U}} \in {\mathbb {R}}^{C_{\text {out}}}$. The inner sequential concatenation encodes the structure-preserved local features by simply assigning the features of different locations to their corresponding feature channels, which naturally preserves the spatial structures of local features in the neighboring space centered at $q_k$, This local vector representation would be finally processed with several MLPs to encode the local features to $C_{\text {out}}$ feature channels for the follow-up processing.

It is worth noting that our VectorPool aggregation module can also be combined with channel reduction technique as in (Sun et al., 2018) to further reduce the computation/memory resources by summarizing the input feature channels before conducting VectorPool aggregation, and we provide the detailed ablation experiments in Sec. 5.3 and Table 10.

Compared with set abstraction, our VectorPool aggregation can greatly reduce the needed computations and memory resources by adopting channel summation and utilizing the proposed local vector representation before MLPs. Moreover, instead of conducting max-pooling on local point-wise features as in set abstraction, our proposed local vector representation can encode the position-sensitive features with different feature channels, to provide more effective representation for local feature learning.

VectorPool Aggregation on PV-RCNN++ Our proposed VectorPool aggregation is integrated in PV-RCNN++ detection framework, to replace set abstraction in both voxel set abstraction module and RoI-grid pooling module. Thanks to our VectorPool aggregation operation, the experiments demonstrate that our PV-RCNN++ not only consumes much less memory and computation resources than PV-RCNN framework, but also achieves better 3D detection performance.

Table 1 Performance comparison on the validation set of Waymo Open Dataset. $*$: re-implemented by (Zhou et al., 2020). $\dagger $: re-implemented by ourselves. $\ddagger $: performance reported in the official open-source codebase of (Yin et al., 2021). “2f”, “3f”, “16f”: the performance is achieved by using multiple point cloud frames

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection

Abstract

Similar content being viewed by others

GridPointNet: Grid and Point-Based 3D Object Detection from Point Cloud

IMAM: Incorporating Multiple Attention Mechanisms for 3D Object Detection from Point Cloud

ET-PointPillars: improved PointPillars for 3D object detection based on optimized voxel downsampling

Explore related subjects

1 Introduction

2 Related Work

3 PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

3.1 Voxel Feature Encoding and Proposal Generation

3.2 Voxel-to-Keypoint Scene Encoding

3.3 Keypoint-to-Grid RoI Feature Abstraction

4 PV-RCNN++: Faster and Better 3D Detection with PV-RCNN Framework

4.1 Sectorized Proposal-Centric Sampling for Efficient and Representative Keypoint Sampling

4.2 Local Vector Representation for Structure-Preserved Local Feature Learning

5 Experiments

5.1 Experimental Setup

5.2 Main Results

5.3 Ablation Study

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation