Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12359)


The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single “bird’s-eye-view” coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird’s-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera, then “splat” all frustums into a rasterized bird’s-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird’s-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by “shooting” template trajectories into a bird’s-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code:

Supplementary material (90.5 mb)
Supplementary material 1 (zip 92707 KB)


  1. 1.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. CoRR arXiv:abs/1511.00561 (2015).
  2. 2.
    Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. CoRR arXiv:abs/1903.11027 (2019).
  3. 3.
    Chang, M.F., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  4. 4.
    Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156 (2016)Google Scholar
  5. 5.
    Ghiasi, G., Lin, T., Le, Q.V.: DropBlock: a regularization method for convolutional networks. CoRR arXiv:abs/1810.12890 (2018).
  6. 6.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016).
  7. 7.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. CoRR arXiv:abs/1703.06870 (2017).
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:abs/1512.03385 (2015).
  9. 9.
    Hendy, N., et al.: Fishing net: future inference of semantic heatmaps in grids (2020)Google Scholar
  10. 10.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR arXiv:abs/1502.03167 (2015).
  11. 11.
    Kayhan, O.S., Gemert, J.C.v.: On translation invariance in CNNS: convolutional layers can exploit absolute spatial location. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  12. 12.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making rgb-based 3D detection and 6d pose estimation great again. CoRR arXiv:abs/1711.10006 (2017)
  13. 13.
    Kesten, R., et al.: Lyft level 5 AV dataset 2019 (2019).
  14. 14.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980 (2014)
  15. 15.
    Kirillov, A., He, K., Girshick, R.B., Rother, C., Dollár, P.: Panoptic segmentation. CoRR arXiv:abs/1801.00868 (2018).
  16. 16.
    Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)Google Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012).
  18. 18.
    Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. CoRR arXiv:abs/1812.05784 (2018)
  19. 19.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324 (1998)Google Scholar
  20. 20.
    Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes. ACM Trans. Graph. 38(4), 1–14 (2019).
  21. 21.
    Mani, K., Daga, S., Garg, S., Shankar, N.S., Jatavallabhula, K.M., Krishna, K.M.: MonoLayout: amodal scene layout from a single image. arXiv:abs/2002.08394 (2020)
  22. 22.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)Google Scholar
  23. 23.
    Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)Google Scholar
  24. 24.
    Philion, J.: Fastdraw: addressing the long tail of lane detection by adapting a sequential prediction network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  25. 25.
    Philion, J., Kar, A., Fidler, S.: Learning to evaluate perception models using planner-centric metrics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  26. 26.
    Poirson, P., Ammirato, P., Fu, C., Liu, W., Kosecka, J., Berg, A.C.: Fast single shot detection and pose estimation. CoRR arXiv:abs/1609.05590 (2016)
  27. 27.
    Qin, Z., Wang, J., Lu, Y.: MonoGRNet: a geometric reasoning network for monocular 3D object localization. Proc. AAAI Conf. Artif. Intell. 33, 8851–8858 (2019). Scholar
  28. 28.
    Roddick, T., Cipolla, R.: Predicting semantic map representations from images using pyramid occupancy networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  29. 29.
    Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. CoRR arXiv:abs/1811.08188 (2018)
  30. 30.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge (2014)Google Scholar
  31. 31.
    Simonelli, A., Bulò, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. CoRR arXiv:abs/1905.12365 (2019)
  32. 32.
    Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: predicting lighting volumes for spatially-coherent illumination (2020)Google Scholar
  33. 33.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Su, H., et al.: SplatNet: sparse lattice networks for point cloud processing. CoRR arXiv:abs/1802.08275 (2018).
  35. 35.
    Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset (2019)Google Scholar
  36. 36.
    Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-SCNN: gated shape CNNs for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)Google Scholar
  37. 37.
    Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. CoRR arXiv:abs/1905.11946 (2019).
  38. 38.
    Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images (2020)Google Scholar
  39. 39.
    Wang, Y., Chao, W., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. CoRR arXiv:abs/1812.07179 (2018)
  40. 40.
    You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. CoRR arXiv:abs/1906.06310 (2019)
  41. 41.
    Zeng, W., et al.: End-to-end interpretable neural motion planner. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8652–8661 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NVIDIASanta ClaraUSA
  2. 2.University of TorontoTorontoCanada
  3. 3.Vector InstituteChennaiIndia

Personalised recommendations