Abstract
3D bounding boxes are a widespread intermediate representation in many computer vision applications. However, predicting them is a challenging task, largely due to partial observability, which motivates the need for a strong sense of uncertainty. While many recent methods have explored better architectures for consuming sparse and unstructured point cloud data, we hypothesize that there is room for improvement in the modeling of the output distribution and explore how this can be achieved using an autoregressive prediction head. Additionally, we release a simulated dataset, COB3D, which highlights new types of ambiguity that arise in realworld robotics applications, where 3D bounding box prediction has largely been underexplored. We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures, achieving strong results on SUNRGBD, Scannet, KITTI, and our new dataset (Code and dataset are available at bbox.yuxuanliu.com.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Choi, J., Chun, D., Kim, H., Lee, H.J.: Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 502–511 (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richlyannotated 3D reconstructions of indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Freitag, M., AlOnaizan, Y.: Beam search strategies for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/W173207. https://aclanthology.org/W173207
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) (2013)
Gilitschenski, I., Sahoo, R., Schwarting, W., Amini, A., Karaman, S., Rus, D.: Deep orientation uncertainty learning based on a Bingham loss. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=ryloogSKDS
Hall, D., et al.: Probabilistic object detection: definition and evaluation, November 2018
He, Y., Zhu, C., Wang, J., Savvides, M., Zhang, X.: Bounding box regression with uncertainty for accurate object detection. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 2888–2897 (2019)
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/f0bda020d2470f2e74990a07a607ebd9Paper.pdf
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Groupfree 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)
Metz, L., Ibarz, J., Jaitly, N., Davidson, J.: Discrete sequential prediction of continuous actions for deep RL. arXiv preprint arXiv:1705.05035 (2017)
Meyer, G.P., Laddha, A., Kee, E., VallespiGonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3D object detector for autonomous driving. In: CVPR, pp. 12677–12686. Computer Vision Foundation/IEEE (2019). https://dblp.unitrier.de/db/conf/cvpr/cvpr2019.html
Meyer, G.P., Thakurdesai, N.: Learning an uncertaintyaware object detector for autonomous driving. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10521–10527 (2020)
Misra, I., Girdhar, R., Joulin, A.: An endtoend transformer model for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Peretroukhin, V., Giamou, M., Rosen, D.M., Greene, W.N., Roy, N., Kelly, J.: A smooth representation of SO(3) for deep rotation learning with uncertainty. In: Proceedings of Robotics: Science and Systems (RSS 2020), 12–16 July 2020 (2020)
Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: ImVoteNet: boosting 3D object detection in point clouds with image votes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: ICCV, pp. 9276–9285. IEEE (2019). https://dblp.unitrier.de/db/conf/iccv/iccv2019.html
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGBD data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Rukhovich, D., Vorontsova, A., Konushin, A.: FCAF3D: fully convolutional anchorfree 3D object detection. arXiv preprint arXiv:2112.00322 (2021)
Shi, S., Wang, X., Li, H.P., et al.: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 16–20 (2019)
Shi, S., et al.: PVRCNN: pointvoxel feature set abstraction for 3D object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 10526–10535. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.01054. https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_PVRCNN_PointVoxel_Feature_Set_Abstraction_for_3D_Object_Detection_CVPR_2020_paper.html
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGBD: A RGBD scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional onestage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning, pp. 1747–1756. PMLR (2016)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Xie, Q., et al.: MLCVNet: multilevel context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Zhong, Y., Zhu, M., Peng, H.: Uncertaintyaware voxel based 3D object detection and tracking with vonMises loss. ArXiv abs/2011.02553 (2020)
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
A Model Architecture and Training
1.1 A.1 Autoregressive 3D Bounding Box Estimation
For bounding box estimation, our model operates on 2D detection patch outputs of size 96\(\,\times \,\)96. We take the 2D bounding box from objectdetection to crop and resize the following features for each object: 3D point cloud, depth uncertainty score, normals, instance mask, amodal instance mask (which includes the occluded regions of the object). We normalize each point p in the point cloud with the 0.25 (\(Q_1\)) and 0.75 (\(Q_3\)) quantiles per dimension using \(\frac{pc_0}{s}\) for \(c_0=\frac{Q_1+Q_3}{2}\), \(s=Q_3Q_1\). We omitted RGB since we found it wasn’t necessary for training and improved generalization.
We stack each 2D feature along the channel dimension and embed the features using a 2D Resnet UNet. The features from the top of the UNet are used in a series of selfattention modules across embeddings from all objects in a scene so that information can be shared across objects. The resulting features from selfattention are tiled across the spatial dimension before the downward pass of the UNet. Finally, the features from the highest spatial resolution of the UNet are passed into several stridedconvs, flattened, and projected to a 128dimension feature h per object. Figure 6 shows the overview of our model architecture.
For the autoregressive layers, we use 9 MLPs with hidden layers (128, 256, 512, 1024). For baselines, we keep the same architecture through h and use different sized MLPs depending on the box parameterization. We train using Adam with learning rate 1e−5 with a batch size of 24 scenes per step with varying number of objects per scene. We train for 10000 steps or until convergence.
1.2 A.2 Autoregressive 3D Object Detection
For Autoregressive FCAF3D, we add 7 autoregressive MLPs with hidden dimensions (128, 256, 512). All other parameters of FCAF3D are the same and we train the same hyperparameters as the released code for 30 epochs. For the baseline FCAF3D, we trained the authorreleased model for 30 epochs on 8 gpus. We found that the benchmarked numbers for \(AP_{0.25}\) and \(AP_{0.50}\) were slightly lower than the reported ones in the original paper, so in our table, we use the reported average AP across trials from the original paper. \(AP_{all}\) was calculated in a similar way as in MSCOCO by averaging AP for iou thresholds over 0.05, 0.10, 0.15, ..., 0.95.
B Quantile Box
1.1 B.1 Proof of QuantileConfidence Box
Proof Sketch: Let P(b) be a distribution over an ordered set of boxes where for any two distinct boxes \(b_1, b_2\) in the sample space, one must be contained in the other, \(b_1 \subset b_2\) or \(b_2 \subset b_1\). We’ll show that a quantile box \(b_q\) is a confidence box with \(p=1q\) by 1) constructing a confidence box \(b_p\) for any given q, 2) showing that any \(x\in b_p\) must have \(O(x) > q\), and 3) therefore \(b_p \subseteq Q(q) \subseteq b_q\) so the quantile box is a confidence box.
1.1.1 1) Confidence Box:
For any \(p=1q\), we’ll show how to construct a confidence box \(b_p\). Using the ordered object distribution property of P(b), we can define ordering as containment \(b_1 < b_2 \equiv b_1 \subset b_2\). This ordering defines an inverse cdf:
Let \(b_p = F^{1}(1q)\) be the inverse cdf of p; by definition \(b_p\) is a confidence box with confidence p since \(P(b \le b_p) = P(b \subseteq b_p) \ge p\)
1.1.2 2) Occupancy Of \(b_p\):
We’ll show that any \(x \in b_p\) satisfies \(O(x) > 1p\). First we’ll prove that that \( P(b \ge b_p) > 1p\). Let \(b_0 = \inf \{ b : b < b_p \}\), the smallest box that is strictly contained in \(b_p\). (If no such \(b_0\) exists, then \(b_p\) must be the smallest box in the distribution order such that \(P(b \ge b_p)=1\) and \( P(b \ge b_p) > 1p\) for \(p\ne 0\))
Since \(b_p\) is the inverse cdf of p, we know that \(P(b \le b_0) < p\), otherwise \(b_0\) would be the inverse cdf of p (i.e. \(b_0=b_p\) a contradiction). It follows that
Now consider any point \(x \in b_p\):
where (14) follows from the nonegativity of \(\mathbbm {1}\{ x \in b\} p(b)\). (15) follows from \(x\in b_p\), \(b_p \subseteq b \) which implies \(x \in b\).
1.1.3 3) QuantileConfidence Box:
Since any \(x \in b_p\) satisfies \(O(x) > 1p\), it follows that \(b_p \subseteq Q(1p)\), where \(Q(q) = \{ x : O(x) > q \}\) is the occupancy quantile with quantile q. The quantile box by construction must contain the occupancy quantile \(Q(q) \subseteq b_q\), therefore we have \(b_p \subseteq Q(1p) \subseteq b_q\), and
So \(b_q\) is a confidence box with confidence requirement p.
1.2 B.2 Quantile Box Algorithm
We propose a fast quantile box Algorithm 1 that runs in polynomial time and is easily batchable on GPU. We use a finite sample of k boxes to approximate the occupancy and a sample of km points to approximate the occupancy quantile Q(q). To find the minimum volume box, we assume that one of the sampled box rotations will be close to the optimal quantile box rotation. We take the sampled rotations and calculate the rotationaxisaligned bounding box volume for the occupancy quantile. The minimum volume rotation is selected for the quantile box and corresponding dimension/center calculated accordingly.
Empirically we find that \(k=64\), \(m=4^3\) provides a good tradeoff of variance and inference time. We can efficiently batch all operations on GPU, and find that quantile box inference for 15 objects takes no more than 10 ms on a NVIDIA 1080TI.
C Dataset
Our dataset consists of almost 7000 simulated scenes of common objects in bins. See Fig. 7 for examples. Each scene consists of the following data:

RGB image of shape (H, W, 3)

Depth map of shape (H, W)

Intrinsic Matrix of the camera (3, 3)

Normals Map of shape (H, W, 3)

Instance Masks of shape (N, H, W) where N is the number of objects

Amodal Instance masks of shape (N, H, W) which includes the occluded regions of the object

3D Bounding Box of each object (N, 9) as determined by dimensions, center, and rotation.
D Visualizations
In this section, we show various qualitative comparisons and visualization of our method (Figs. 8, 9, 10 and 11).
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Mishra, N., Sieb, M., Shentu, Y., Abbeel, P., Chen, X. (2022). Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13670. Springer, Cham. https://doi.org/10.1007/9783031200809_39
Download citation
DOI: https://doi.org/10.1007/9783031200809_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031200793
Online ISBN: 9783031200809
eBook Packages: Computer ScienceComputer Science (R0)