Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction

Liu, YuXuan; Mishra, Nikhil; Sieb, Maximilian; Shentu, Yide; Abbeel, Pieter; Chen, Xi

doi:10.1007/978-3-031-20080-9_39

YuXuan Liu^12,13,
Nikhil Mishra^12,13,
Maximilian Sieb¹²,
Yide Shentu^12,13,
Pieter Abbeel^12,13 &
…
Xi Chen¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13670))

Included in the following conference series:

European Conference on Computer Vision

1867 Accesses
2 Citations

Abstract

3D bounding boxes are a widespread intermediate representation in many computer vision applications. However, predicting them is a challenging task, largely due to partial observability, which motivates the need for a strong sense of uncertainty. While many recent methods have explored better architectures for consuming sparse and unstructured point cloud data, we hypothesize that there is room for improvement in the modeling of the output distribution and explore how this can be achieved using an autoregressive prediction head. Additionally, we release a simulated dataset, COB-3D, which highlights new types of ambiguity that arise in real-world robotics applications, where 3D bounding box prediction has largely been underexplored. We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures, achieving strong results on SUN-RGBD, Scannet, KITTI, and our new dataset (Code and dataset are available at bbox.yuxuanliu.com.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Choi, J., Chun, D., Kim, H., Lee, H.J.: Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 502–511 (2019)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/W17-3207. https://aclanthology.org/W17-3207
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) (2013)
Google Scholar
Gilitschenski, I., Sahoo, R., Schwarting, W., Amini, A., Karaman, S., Rus, D.: Deep orientation uncertainty learning based on a Bingham loss. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=ryloogSKDS
Hall, D., et al.: Probabilistic object detection: definition and evaluation, November 2018
Google Scholar
He, Y., Zhu, C., Wang, J., Savvides, M., Zhang, X.: Bounding box regression with uncertainty for accurate object detection. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 2888–2897 (2019)
Google Scholar
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/f0bda020d2470f2e74990a07a607ebd9-Paper.pdf
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)
Google Scholar
Metz, L., Ibarz, J., Jaitly, N., Davidson, J.: Discrete sequential prediction of continuous actions for deep RL. arXiv preprint arXiv:1705.05035 (2017)
Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3D object detector for autonomous driving. In: CVPR, pp. 12677–12686. Computer Vision Foundation/IEEE (2019). https://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html
Meyer, G.P., Thakurdesai, N.: Learning an uncertainty-aware object detector for autonomous driving. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10521–10527 (2020)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Peretroukhin, V., Giamou, M., Rosen, D.M., Greene, W.N., Roy, N., Kelly, J.: A smooth representation of SO(3) for deep rotation learning with uncertainty. In: Proceedings of Robotics: Science and Systems (RSS 2020), 12–16 July 2020 (2020)
Google Scholar
Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: ImVoteNet: boosting 3D object detection in point clouds with image votes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: ICCV, pp. 9276–9285. IEEE (2019). https://dblp.uni-trier.de/db/conf/iccv/iccv2019.html
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Google Scholar
Rukhovich, D., Vorontsova, A., Konushin, A.: FCAF3D: fully convolutional anchor-free 3D object detection. arXiv preprint arXiv:2112.00322 (2021)
Shi, S., Wang, X., Li, H.P., et al.: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 16–20 (2019)
Google Scholar
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 10526–10535. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.01054. https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_PV-RCNN_Point-Voxel_Feature_Set_Abstraction_for_3D_Object_Detection_CVPR_2020_paper.html
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Google Scholar
Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning, pp. 1747–1756. PMLR (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Xie, Q., et al.: MLCVNet: multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Google Scholar
Zhong, Y., Zhu, M., Peng, H.: Uncertainty-aware voxel based 3D object detection and tracking with von-Mises loss. ArXiv abs/2011.02553 (2020)
Google Scholar
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Covariant, Emeryville, USA
YuXuan Liu, Nikhil Mishra, Maximilian Sieb, Yide Shentu, Pieter Abbeel & Xi Chen
UC Berkeley, Berkeley, USA
YuXuan Liu, Nikhil Mishra, Yide Shentu & Pieter Abbeel

Authors

YuXuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Sieb
View author publications
You can also search for this author in PubMed Google Scholar
Yide Shentu
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Abbeel
View author publications
You can also search for this author in PubMed Google Scholar
Xi Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to YuXuan Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7386 KB)

Appendices

A Model Architecture and Training

1.1 A.1 Autoregressive 3D Bounding Box Estimation

For bounding box estimation, our model operates on 2D detection patch outputs of size 96$\,\times \,$96. We take the 2D bounding box from object-detection to crop and resize the following features for each object: 3D point cloud, depth uncertainty score, normals, instance mask, amodal instance mask (which includes the occluded regions of the object). We normalize each point p in the point cloud with the 0.25 ($Q_1$) and 0.75 ($Q_3$) quantiles per dimension using $\frac{p-c_0}{s}$ for $c_0=\frac{Q_1+Q_3}{2}$, $s=Q_3-Q_1$. We omitted RGB since we found it wasn’t necessary for training and improved generalization.

We stack each 2D feature along the channel dimension and embed the features using a 2D Resnet U-Net. The features from the top of the U-Net are used in a series of self-attention modules across embeddings from all objects in a scene so that information can be shared across objects. The resulting features from self-attention are tiled across the spatial dimension before the downward pass of the U-Net. Finally, the features from the highest spatial resolution of the U-Net are passed into several strided-convs, flattened, and projected to a 128-dimension feature h per object. Figure 6 shows the overview of our model architecture.

For the autoregressive layers, we use 9 MLPs with hidden layers (128, 256, 512, 1024). For baselines, we keep the same architecture through h and use different sized MLPs depending on the box parameterization. We train using Adam with learning rate 1e−5 with a batch size of 24 scenes per step with varying number of objects per scene. We train for 10000 steps or until convergence.

1.2 A.2 Autoregressive 3D Object Detection

For Autoregressive FCAF3D, we add 7 autoregressive MLPs with hidden dimensions (128, 256, 512). All other parameters of FCAF3D are the same and we train the same hyperparameters as the released code for 30 epochs. For the baseline FCAF3D, we trained the author-released model for 30 epochs on 8 gpus. We found that the benchmarked numbers for $AP_{0.25}$ and $AP_{0.50}$ were slightly lower than the reported ones in the original paper, so in our table, we use the reported average AP across trials from the original paper. $AP_{all}$ was calculated in a similar way as in MS-COCO by averaging AP for iou thresholds over 0.05, 0.10, 0.15, ..., 0.95.

B Quantile Box

1.1 B.1 Proof of Quantile-Confidence Box

Proof Sketch: Let P(b) be a distribution over an ordered set of boxes where for any two distinct boxes $b_1, b_2$ in the sample space, one must be contained in the other, $b_1 \subset b_2$ or $b_2 \subset b_1$. We’ll show that a quantile box $b_q$ is a confidence box with $p=1-q$ by 1) constructing a confidence box $b_p$ for any given q, 2) showing that any $x\in b_p$ must have $O(x) > q$, and 3) therefore $b_p \subseteq Q(q) \subseteq b_q$ so the quantile box is a confidence box.

1.1.1 1) Confidence Box:

For any $p=1-q$, we’ll show how to construct a confidence box $b_p$. Using the ordered object distribution property of P(b), we can define ordering as containment $b_1 < b_2 \equiv b_1 \subset b_2$. This ordering defines an inverse cdf:

$$\begin{aligned} F^{-1}(p) = \inf \{ x: P(b \le x) \ge p \} \end{aligned}$$

(8)

Let $b_p = F^{-1}(1-q)$ be the inverse cdf of p; by definition $b_p$ is a confidence box with confidence p since $P(b \le b_p) = P(b \subseteq b_p) \ge p$

1.1.2 2) Occupancy Of $b_p$:

We’ll show that any $x \in b_p$ satisfies $O(x) > 1-p$. First we’ll prove that that $ P(b \ge b_p) > 1-p$. Let $b_0 = \inf \{ b : b < b_p \}$, the smallest box that is strictly contained in $b_p$. (If no such $b_0$ exists, then $b_p$ must be the smallest box in the distribution order such that $P(b \ge b_p)=1$ and $ P(b \ge b_p) > 1-p$ for $p\ne 0$)

Since $b_p$ is the inverse cdf of p, we know that $P(b \le b_0) < p$, otherwise $b_0$ would be the inverse cdf of p (i.e. $b_0=b_p$ a contradiction). It follows that

$$\begin{aligned} P(b \ge b_p)&= P(b > b_0)\end{aligned}$$

(9)

$$\begin{aligned}&= 1 - P(b \le b_0)\end{aligned}$$

(10)

$$\begin{aligned}&> 1-p \end{aligned}$$

(11)

Now consider any point $x \in b_p$:

$$\begin{aligned} O(x)&= P(x\in b)\end{aligned}$$

(12)

$$\begin{aligned}&= \int _{b} \mathbbm {1}\{ x \in b\} p(b)db \end{aligned}$$

(13)

$$\begin{aligned}&\ge \int _{b \ge b_p } \mathbbm {1}\{ x \in b\} p(b)db \end{aligned}$$

(14)

$$\begin{aligned}&= \int _{b \ge b_p } p(b)db\end{aligned}$$

(15)

$$\begin{aligned}&= P( b \ge b_p )\end{aligned}$$

(16)

$$\begin{aligned}&> 1-p \end{aligned}$$

(17)

where (14) follows from the nonegativity of $\mathbbm {1}\{ x \in b\} p(b)$. (15) follows from $x\in b_p$, $b_p \subseteq b $ which implies $x \in b$.

1.1.3 3) Quantile-Confidence Box:

Since any $x \in b_p$ satisfies $O(x) > 1-p$, it follows that $b_p \subseteq Q(1-p)$, where $Q(q) = \{ x : O(x) > q \}$ is the occupancy quantile with quantile q. The quantile box by construction must contain the occupancy quantile $Q(q) \subseteq b_q$, therefore we have $b_p \subseteq Q(1-p) \subseteq b_q$, and

$$\begin{aligned} P(b\subseteq b_q)&\ge P(b\subseteq b_p)\end{aligned}$$

(18)

$$\begin{aligned}&\ge p \end{aligned}$$

(19)

So $b_q$ is a confidence box with confidence requirement p.

1.2 B.2 Quantile Box Algorithm

We propose a fast quantile box Algorithm 1 that runs in polynomial time and is easily batchable on GPU. We use a finite sample of k boxes to approximate the occupancy and a sample of km points to approximate the occupancy quantile Q(q). To find the minimum volume box, we assume that one of the sampled box rotations will be close to the optimal quantile box rotation. We take the sampled rotations and calculate the rotation-axis-aligned bounding box volume for the occupancy quantile. The minimum volume rotation is selected for the quantile box and corresponding dimension/center calculated accordingly.

Empirically we find that $k=64$, $m=4^3$ provides a good trade-off of variance and inference time. We can efficiently batch all operations on GPU, and find that quantile box inference for 15 objects takes no more than 10 ms on a NVIDIA 1080TI.

C Dataset

Our dataset consists of almost 7000 simulated scenes of common objects in bins. See Fig. 7 for examples. Each scene consists of the following data:

RGB image of shape (H, W, 3)
Depth map of shape (H, W)
Intrinsic Matrix of the camera (3, 3)
Normals Map of shape (H, W, 3)
Instance Masks of shape (N, H, W) where N is the number of objects
Amodal Instance masks of shape (N, H, W) which includes the occluded regions of the object
3D Bounding Box of each object (N, 9) as determined by dimensions, center, and rotation.

D Visualizations

In this section, we show various qualitative comparisons and visualization of our method (Figs. 8, 9, 10 and 11).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Mishra, N., Sieb, M., Shentu, Y., Abbeel, P., Chen, X. (2022). Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13670. Springer, Cham. https://doi.org/10.1007/978-3-031-20080-9_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-20080-9_39
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20079-3
Online ISBN: 978-3-031-20080-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction

Abstract

Access this chapter

References