Abstract
Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary \(2\)D translations. However, these vanilla blocks are not equivariant to arbitrary \(3\)D translations in the projective manifold. Even then, all monocular \(3\)D detectors use vanilla blocks to obtain the \(3\)D coordinates, a task for which the vanilla blocks are not designed for. This paper takes the first step towards convolutions equivariant to arbitrary \(3\)D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this paper proposes Depth EquiVarIAnt NeTwork (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular \(3\)D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
The KITTI Vision Benchmark Suite. https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d. Accessed 03 July 2022
Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: efficient data generation for urban driving scenes. IJCV (2018)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Brazil, G., Liu, X.: M\(3\)D-RPN: monocular \(3\)D region proposal network for object detection. In: ICCV (2019)
Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic \(3\)D object detection in monocular video. In: ECCV (2020)
Bronstein, M.: Convolution from first principles. htpps://towardsdatascience.com/deriving-convolution-from-first-principles-4ff124888028. Accessed 13 Aug 2021
Bronstein, M., Bruna, J., Cohen, T., Veličković, P.: Geometric deep learning: gGrids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478 (2021)
Burns, B., Weiss, R., Riseman, E.: The non-existence of general-case view-invariants. In: Geometric Invariance in Computer Vision (1992)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., Chateau, T.: Deep MANTA: a coarse-to-fine many-task network for joint \(2\)D and \(3\)D vehicle analysis from monocular image. In: CVPR (2017)
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular \(3\)D object detection for autonomous driving. In: CVPR (2016)
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: \(3\)D object proposals for accurate object class detection. In: NeurIPS (2015)
Chen, Y., Tai, L., Sun, K., Li, M.: MonoPair: Monocular \(3\)D object detection using pairwise spatial relationships. In: CVPR (2020)
Chong, Z., et al.: MonoDistill: learning spatial features for monocular \(3\)D object detection. In: ICLR (2022)
Cohen, T., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: ICLR (2018)
Cohen, T., Welling, M.: Learning the irreducible representations of commutative lie groups. In: ICML (2014)
Cohen, T., Welling, M.: Group equivariant convolutional networks. In: ICML (2016)
Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. In: ICML (2016)
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular \(3\)D object detection. In: CVPR Workshops (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Esteves, C., Allen-Blanchette, C., Zhou, X., Daniilidis, K.: Polar transformer networks. In: ICLR (2018)
Fidler, S., Dickinson, S., Urtasun, R.: \(3\)D object detection and viewpoint estimation with a deformable \(3\)D cuboid model. In: NeurIPS (2012)
Freeman, W., Adelson, E.: The design and use of steerable filters. TPAMI (1991)
Gandikota, K., Geiping, J., Lähner, Z., Czapliński, A., Moeller, M.: Training or architecture? how to incorporate invariance in neural networks. arXiv preprint arXiv:2106.10044 (2021)
Ganea, O.E., Bécigneul, G., Hofmann, T.: Hyperbolic neural networks. In: NeurIPS (2017)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR (2012)
Ghosh, R., Gupta, A.: Scale steerable filters for locally scale-invariant convolutional neural networks. In: ICML Workshops (2019)
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press (2003)
Henriques, J., Vedaldi, A.: Warped convolutions: Efficient invariance to spatial transformations. In: ICML (2017)
Jansson, Y., Lindeberg, T.: Scale-invariant scale-channel networks: deep networks that generalise to previously unseen scales. IJCV (2021)
Jing, L.: Physical symmetry enhanced neural networks. Ph.D. thesis, Massachusetts Institute of Technology (2020)
Kanazawa, A., Sharma, A., Jacobs, D.: Locally scale-invariant convolutional neural networks. In: NeurIPS Workshops (2014)
Kumar, A., Brazil, G., Liu, X.: GrooMeD-NMS: grouped mathematically differentiable NMS for monocular \(3\)D object detection. In: CVPR (2021)
Kumar, A., et al.: LUVLi face alignment: estimating landmarks’ location, uncertainty, and visibility likelihood. In: CVPR (2020)
Kumar, A., Prabhakaran, V.: Estimation of bandlimited signals from the signs of noisy samples. In: ICASSP (2013)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998)
Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 644–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_38
Lian, Q., Ye, B., Xu, R., Yao, W., Zhang, T.: Geometry-aware data augmentation for monocular \(3\)D object detection. arXiv preprint arXiv:2104.05858 (2021)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular \(3\)D object detection. In: CVPR (2019)
Liu, X., Xue, N., Wu, T.: Learning auxiliary monocular contexts helps monocular \(3\)D object detection. In: AAAI (2022)
Liu, Y., Yixuan, Y., Liu, M.: Ground-aware monocular \(3\)D object detection for autonomous driving. Robotics and Automation Letters (2021)
Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: AutoShape: real-time shape-aware monocular \(3\)D object detection. In: ICCV (2021)
Lu, Y., et al.: Geometry uncertainty projection network for monocular \(3\)D object detection. In: ICCV (2021)
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-LiDAR representation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 311–327. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_19
Ma, X., Ouyang, W., Simonelli, A., Ricci, E.: \(3\)D object detection from images for autonomous driving: a survey. arXiv preprint arXiv:2202.02980 (2022)
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular \(3\)D object detection via color-embedded \(3\)D reconstruction for autonomous driving. In: ICCV (2019)
Ma, X., et al.: Delving into localization errors for monocular \(3\)D object detection. In: CVPR (2021)
Marcos, D., Kellenberger, B., Lobry, S., Tuia, D.: Scale equivariance in CNNs with vector fields. In: ICML Workshops (2018)
Marcos, D., Volpi, M., Komodakis, N., Tuia, D.: Rotation equivariant vector field networks. In: ICCV (2017)
Micheli, A.: Neural network for graphs: a contextual constructive approach. IEEE Trans. Neural Networks (2009)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is Pseudo-LiDAR needed for monocular \(3\)D object detection? In: ICCV (2021)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Payet, N., Todorovic, S.: From contours to \(3\)D object detection and pose estimation. In: ICCV (2011)
Pepik, B., Stark, M., Gehler, P., Schiele, B.: Multi-view and \(3\)D deformable part models. TPAMI (2015)
Rath, M., Condurache, A.: Boosting deep neural networks with geometrical prior knowledge: a survey. arXiv preprint arXiv:2006.16867 (2020)
Reading, C., Harakeh, A., Chae, J., Waslander, S.: Categorical depth distribution network for monocular \(3\)D object detection. In: CVPR (2021)
Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.: Soccer on your tabletop. In: CVPR (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Saxena, A., Driemeyer, J., Ng, A.: Robotic grasping of novel objects using vision. IJRR (2008)
Shi, S., Wang, X., Li, H.: PointRCNN: \(3\)D object proposal generation and detection from point cloud. In: CVPR (2019)
Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.K.: Geometry-based distance decomposition for monocular \(3\)D object detection. In: ICCV (2021)
Simonelli, A., Bulò, S., Porzi, L., Antequera, M., Kontschieder, P.: Disentangling monocular \(3\)D object detection: from single to multi-class recognition. TPAMI (2020)
Simonelli, A., Bulò, S., Porzi, L., Kontschieder, P., Ricci, E.: Are we missing confidence in Pseudo-LiDAR methods for monocular \(3\)D object detection? In: ICCV (2021)
Simonelli, A., Bulò, S., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular \(3\)D object detection. In: ICCV (2019)
Simonelli, A., Bulò, S., Porzi, L., Ricci, E., Kontschieder, P.: Towards generalization across depth for monocular \(3\)D object detection. In: ECCV (2020)
Sosnovik, I., Moskalev, A., Smeulders, A.: DISCO: accurate discrete scale convolutions. In: BMVC (2021)
Sosnovik, I., Moskalev, A., Smeulders, A.: Scale equivariance improves siamese tracking. In: WACV (2021)
Sosnovik, I., Szmaja, M., Smeulders, A.: Scale-equivariant steerable networks. In: ICLR (2020)
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: CVPR (2020)
Tang, Y., Dorn, S., Savani, C.: Center\(3\)D: center-based monocular \(3\)D object detection with joint depth understanding. arXiv preprint arXiv:2005.13423 (2020)
Thayalan-Vaz, S., M, S., Santhakumar, K., Ravi Kiran, B., Gauthier, T., Yogamani, S.: Exploring \(2\)D data augmentation for \(3\)D monocular object detection. arXiv preprint arXiv:2104.10786 (2021)
Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K., Riley, P.: Tensor field networks: rotation-and translation-equivariant neural networks for \(3\)D point clouds. arXiv preprint arXiv:1802.08219 (2018)
Wang, L., Du, L., Ye, X., Fu, Y., Guo, G., Xue, X., Feng, J., Zhang, L.: Depth-conditioned dynamic message propagation for monocular \(3\)D object detection. In: CVPR (2021)
Wang, L., Zhang, L., Zhu, Y., Zhang, Z., He, T., Li, M., Xue, X.: Progressive coordinate transforms for monocular \(3\)D object detection. In: NeurIPS (2021)
Wang, R., Walters, R., Yu, R.: Incorporating symmetry into deep dynamics models for improved generalization. In: ICLR (2021)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.: Pseudo-LiDAR from visual depth estimation: bridging the gap in \(3\)D object detection for autonomous driving. In: CVPR (2019)
Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: \(3\)D object detection from multi-view images via \(3\)D-to-\(2\)D queries. In: CoRL (2021)
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. TIP (2004)
Weiler, M., Forré, P., Verlinde, E., Welling, M.: Coordinate independent convolutional networks-isometry and gauge equivariant convolutions on riemannian manifolds. arXiv preprint arXiv:2106.06020 (2021)
Weiler, M., Hamprecht, F., Storath, M.: Learning steerable filters for rotation equivariant CNNs. In: CVPR (2018)
Wilk, M.v.d., Bauer, M., John, S., Hensman, J.: Learning invariances using the marginal likelihood. In: NeurIPS (2018)
Worrall, D., Brostow, G.: CubeNet: equivariance to 3D rotation and translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 585–602. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_35
Worrall, D., Garbin, S., Turmukhambetov, D., Brostow, G.: Harmonic networks: deep translation and rotation equivariance. In: CVPR (2017)
Worrall, D., Welling, M.: Deep scale-spaces: equivariance over scale. In: NeurIPS (2019)
Xu, Y., Xiao, T., Zhang, J., Yang, K., Zhang, Z.: Scale-invariant convolutional neural networks. arXiv preprint arXiv:1411.6369 (2014)
Yang, G., Ramanan, D.: Upgrading optical flow to \(3\)D scene flow through optical expansion. In: CVPR (2020)
Yeh, R., Hu, Y.T., Schwing, A.: Chirality nets for human pose regression. NeurIPS (2019)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2015)
Zhang, Y., Ma, X., Yi, S., Hou, J., Wang, Z., Ouyang, W., Xu, D.: Learning geometry-guided depth via projective modeling for monocular \(3\)D object detection. arXiv preprint arXiv:2107.13931 (2021)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular \(3\)D object detection. In: CVPR (2021)
Zhou, A., Knowles, T., Finn, C.: Meta-learning symmetries by reparameterization. In: ICLR (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., Jiang, Q.: MonoEF: extrinsic parameter free monocular \(3\)D object detection. TPAMI (2021)
Zhu, W., Qiu, Q., Calderbank, R., Sapiro, G., Cheng, X.: Scale-equivariant neural networks with decomposed convolutional filters. arXiv preprint arXiv:1909.11193 (2019)
Zou, Z., et al.: The devil is in the task: exploiting reciprocal appearance-localization features for monocular \(3\)D object detection. In: ICCV (2021)
Zwicke, P., Kiss, I.: A new implementation of the mellin transform and its application to radar classification of ships. TPAMI (1983)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, A., Brazil, G., Corona, E., Parchami, A., Liu, X. (2022). DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-20077-9_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20076-2
Online ISBN: 978-3-031-20077-9
eBook Packages: Computer ScienceComputer Science (R0)