Skip to main content

Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13670))

Included in the following conference series:

Abstract

3D bounding boxes are a widespread intermediate representation in many computer vision applications. However, predicting them is a challenging task, largely due to partial observability, which motivates the need for a strong sense of uncertainty. While many recent methods have explored better architectures for consuming sparse and unstructured point cloud data, we hypothesize that there is room for improvement in the modeling of the output distribution and explore how this can be achieved using an autoregressive prediction head. Additionally, we release a simulated dataset, COB-3D, which highlights new types of ambiguity that arise in real-world robotics applications, where 3D bounding box prediction has largely been underexplored. We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures, achieving strong results on SUN-RGBD, Scannet, KITTI, and our new dataset (Code and dataset are available at bbox.yuxuanliu.com.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Choi, J., Chun, D., Kim, H., Lee, H.J.: Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 502–511 (2019)

    Google Scholar 

  2. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

    Google Scholar 

  3. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/W17-3207. https://aclanthology.org/W17-3207

  4. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) (2013)

    Google Scholar 

  5. Gilitschenski, I., Sahoo, R., Schwarting, W., Amini, A., Karaman, S., Rus, D.: Deep orientation uncertainty learning based on a Bingham loss. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=ryloogSKDS

  6. Hall, D., et al.: Probabilistic object detection: definition and evaluation, November 2018

    Google Scholar 

  7. He, Y., Zhu, C., Wang, J., Savvides, M., Zhang, X.: Bounding box regression with uncertainty for accurate object detection. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 2888–2897 (2019)

    Google Scholar 

  8. Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/f0bda020d2470f2e74990a07a607ebd9-Paper.pdf

  9. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)

    Google Scholar 

  10. Metz, L., Ibarz, J., Jaitly, N., Davidson, J.: Discrete sequential prediction of continuous actions for deep RL. arXiv preprint arXiv:1705.05035 (2017)

  11. Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3D object detector for autonomous driving. In: CVPR, pp. 12677–12686. Computer Vision Foundation/IEEE (2019). https://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html

  12. Meyer, G.P., Thakurdesai, N.: Learning an uncertainty-aware object detector for autonomous driving. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10521–10527 (2020)

    Google Scholar 

  13. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)

    Google Scholar 

  14. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)

    Google Scholar 

  15. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  16. Peretroukhin, V., Giamou, M., Rosen, D.M., Greene, W.N., Roy, N., Kelly, J.: A smooth representation of SO(3) for deep rotation learning with uncertainty. In: Proceedings of Robotics: Science and Systems (RSS 2020), 12–16 July 2020 (2020)

    Google Scholar 

  17. Qi, C.R., Chen, X., Litany, O., Guibas, L.J.: ImVoteNet: boosting 3D object detection in point clouds with image votes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  18. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: ICCV, pp. 9276–9285. IEEE (2019). https://dblp.uni-trier.de/db/conf/iccv/iccv2019.html

  19. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)

    Google Scholar 

  20. Rukhovich, D., Vorontsova, A., Konushin, A.: FCAF3D: fully convolutional anchor-free 3D object detection. arXiv preprint arXiv:2112.00322 (2021)

  21. Shi, S., Wang, X., Li, H.P., et al.: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 16–20 (2019)

    Google Scholar 

  22. Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 10526–10535. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.01054. https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_PV-RCNN_Point-Voxel_Feature_Set_Abstraction_for_3D_Object_Detection_CVPR_2020_paper.html

  23. Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  24. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)

    Google Scholar 

  25. Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning, pp. 1747–1756. PMLR (2016)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  27. Xie, Q., et al.: MLCVNet: multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

    Google Scholar 

  28. Zhong, Y., Zhu, M., Peng, H.: Uncertainty-aware voxel based 3D object detection and tracking with von-Mises loss. ArXiv abs/2011.02553 (2020)

    Google Scholar 

  29. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to YuXuan Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7386 KB)

Appendices

A Model Architecture and Training

Fig. 6.
figure 6

Overview of Autoregressive Bounding Box Estimation architecture

1.1 A.1 Autoregressive 3D Bounding Box Estimation

For bounding box estimation, our model operates on 2D detection patch outputs of size 96\(\,\times \,\)96. We take the 2D bounding box from object-detection to crop and resize the following features for each object: 3D point cloud, depth uncertainty score, normals, instance mask, amodal instance mask (which includes the occluded regions of the object). We normalize each point p in the point cloud with the 0.25 (\(Q_1\)) and 0.75 (\(Q_3\)) quantiles per dimension using \(\frac{p-c_0}{s}\) for \(c_0=\frac{Q_1+Q_3}{2}\), \(s=Q_3-Q_1\). We omitted RGB since we found it wasn’t necessary for training and improved generalization.

We stack each 2D feature along the channel dimension and embed the features using a 2D Resnet U-Net. The features from the top of the U-Net are used in a series of self-attention modules across embeddings from all objects in a scene so that information can be shared across objects. The resulting features from self-attention are tiled across the spatial dimension before the downward pass of the U-Net. Finally, the features from the highest spatial resolution of the U-Net are passed into several strided-convs, flattened, and projected to a 128-dimension feature h per object. Figure 6 shows the overview of our model architecture.

For the autoregressive layers, we use 9 MLPs with hidden layers (128, 256, 512, 1024). For baselines, we keep the same architecture through h and use different sized MLPs depending on the box parameterization. We train using Adam with learning rate 1e−5 with a batch size of 24 scenes per step with varying number of objects per scene. We train for 10000 steps or until convergence.

1.2 A.2 Autoregressive 3D Object Detection

For Autoregressive FCAF3D, we add 7 autoregressive MLPs with hidden dimensions (128, 256, 512). All other parameters of FCAF3D are the same and we train the same hyperparameters as the released code for 30 epochs. For the baseline FCAF3D, we trained the author-released model for 30 epochs on 8 gpus. We found that the benchmarked numbers for \(AP_{0.25}\) and \(AP_{0.50}\) were slightly lower than the reported ones in the original paper, so in our table, we use the reported average AP across trials from the original paper. \(AP_{all}\) was calculated in a similar way as in MS-COCO by averaging AP for iou thresholds over 0.05, 0.10, 0.15, ..., 0.95.

B Quantile Box

1.1 B.1 Proof of Quantile-Confidence Box

Proof Sketch: Let P(b) be a distribution over an ordered set of boxes where for any two distinct boxes \(b_1, b_2\) in the sample space, one must be contained in the other, \(b_1 \subset b_2\) or \(b_2 \subset b_1\). We’ll show that a quantile box \(b_q\) is a confidence box with \(p=1-q\) by 1) constructing a confidence box \(b_p\) for any given q, 2) showing that any \(x\in b_p\) must have \(O(x) > q\), and 3) therefore \(b_p \subseteq Q(q) \subseteq b_q\) so the quantile box is a confidence box.

1.1.1 1) Confidence Box:

For any \(p=1-q\), we’ll show how to construct a confidence box \(b_p\). Using the ordered object distribution property of P(b), we can define ordering as containment \(b_1 < b_2 \equiv b_1 \subset b_2\). This ordering defines an inverse cdf:

$$\begin{aligned} F^{-1}(p) = \inf \{ x: P(b \le x) \ge p \} \end{aligned}$$
(8)

Let \(b_p = F^{-1}(1-q)\) be the inverse cdf of p; by definition \(b_p\) is a confidence box with confidence p since \(P(b \le b_p) = P(b \subseteq b_p) \ge p\)

1.1.2 2) Occupancy Of \(b_p\):

We’ll show that any \(x \in b_p\) satisfies \(O(x) > 1-p\). First we’ll prove that that \( P(b \ge b_p) > 1-p\). Let \(b_0 = \inf \{ b : b < b_p \}\), the smallest box that is strictly contained in \(b_p\). (If no such \(b_0\) exists, then \(b_p\) must be the smallest box in the distribution order such that \(P(b \ge b_p)=1\) and \( P(b \ge b_p) > 1-p\) for \(p\ne 0\))

Since \(b_p\) is the inverse cdf of p, we know that \(P(b \le b_0) < p\), otherwise \(b_0\) would be the inverse cdf of p (i.e. \(b_0=b_p\) a contradiction). It follows that

$$\begin{aligned} P(b \ge b_p)&= P(b > b_0)\end{aligned}$$
(9)
$$\begin{aligned}&= 1 - P(b \le b_0)\end{aligned}$$
(10)
$$\begin{aligned}&> 1-p \end{aligned}$$
(11)

Now consider any point \(x \in b_p\):

$$\begin{aligned} O(x)&= P(x\in b)\end{aligned}$$
(12)
$$\begin{aligned}&= \int _{b} \mathbbm {1}\{ x \in b\} p(b)db \end{aligned}$$
(13)
$$\begin{aligned}&\ge \int _{b \ge b_p } \mathbbm {1}\{ x \in b\} p(b)db \end{aligned}$$
(14)
$$\begin{aligned}&= \int _{b \ge b_p } p(b)db\end{aligned}$$
(15)
$$\begin{aligned}&= P( b \ge b_p )\end{aligned}$$
(16)
$$\begin{aligned}&> 1-p \end{aligned}$$
(17)

where (14) follows from the nonegativity of \(\mathbbm {1}\{ x \in b\} p(b)\). (15) follows from \(x\in b_p\), \(b_p \subseteq b \) which implies \(x \in b\).

1.1.3 3) Quantile-Confidence Box:

Since any \(x \in b_p\) satisfies \(O(x) > 1-p\), it follows that \(b_p \subseteq Q(1-p)\), where \(Q(q) = \{ x : O(x) > q \}\) is the occupancy quantile with quantile q. The quantile box by construction must contain the occupancy quantile \(Q(q) \subseteq b_q\), therefore we have \(b_p \subseteq Q(1-p) \subseteq b_q\), and

$$\begin{aligned} P(b\subseteq b_q)&\ge P(b\subseteq b_p)\end{aligned}$$
(18)
$$\begin{aligned}&\ge p \end{aligned}$$
(19)

So \(b_q\) is a confidence box with confidence requirement p.

1.2 B.2 Quantile Box Algorithm

figure a

We propose a fast quantile box Algorithm 1 that runs in polynomial time and is easily batchable on GPU. We use a finite sample of k boxes to approximate the occupancy and a sample of km points to approximate the occupancy quantile Q(q). To find the minimum volume box, we assume that one of the sampled box rotations will be close to the optimal quantile box rotation. We take the sampled rotations and calculate the rotation-axis-aligned bounding box volume for the occupancy quantile. The minimum volume rotation is selected for the quantile box and corresponding dimension/center calculated accordingly.

Empirically we find that \(k=64\), \(m=4^3\) provides a good trade-off of variance and inference time. We can efficiently batch all operations on GPU, and find that quantile box inference for 15 objects takes no more than 10 ms on a NVIDIA 1080TI.

C Dataset

Fig. 7.
figure 7

Examples of scenes from our dataset

Our dataset consists of almost 7000 simulated scenes of common objects in bins. See Fig. 7 for examples. Each scene consists of the following data:

  • RGB image of shape (H, W, 3)

  • Depth map of shape (H, W)

  • Intrinsic Matrix of the camera (3, 3)

  • Normals Map of shape (H, W, 3)

  • Instance Masks of shape (N, H, W) where N is the number of objects

  • Amodal Instance masks of shape (N, H, W) which includes the occluded regions of the object

  • 3D Bounding Box of each object (N, 9) as determined by dimensions, center, and rotation.

D Visualizations

In this section, we show various qualitative comparisons and visualization of our method (Figs. 8, 9, 10 and 11).

Fig. 8.
figure 8

Visualization of our model predictions on objects with rotational symmetry. The blue boxes show various samples from our model. The orange point cloud is the occupancy quantile. The white box is the quantile box. (Color figure online)

Fig. 9.
figure 9

Visualization of our dimension conditioning method. The model is able to leverage the conditioning information to accurately predict the correct pose & dimension for each object’s 3D bounding box. The prediction is shown in red-blue-green and the ground truth in turquoise-yellow-pink. Left: image of the scene. Middle: vanilla beam search. Right: beam search with dimension conditioning. (Color figure online)

Fig. 10.
figure 10

Visualization of bounding box samples from our autoregressive model on a rotationally symmetric water bottle. Our model is able to sample different modes for symmetric objects whereas a deterministic model would only be able to predict a single mode.

Fig. 11.
figure 11

Visualization of bounding box predictions with different quantiles. We can see that lower quantiles lead to larger boxes in the direction of uncertainty. Top: image of the scene. Left: quantile 0.1 Middle: quantile 0.3. Right: quantile 0.5.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Y., Mishra, N., Sieb, M., Shentu, Y., Abbeel, P., Chen, X. (2022). Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13670. Springer, Cham. https://doi.org/10.1007/978-3-031-20080-9_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20080-9_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20079-3

  • Online ISBN: 978-3-031-20080-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics