GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision

Ke, Lei; Li, Shichao; Sun, Yanan; Tai, Yu-Wing; Tang, Chi-Keung

doi:10.1007/978-3-030-58555-6_31

Lei Ke¹²,
Shichao Li¹²,
Yanan Sun¹²,
Yu-Wing Tai^12,13 &
…
Chi-Keung Tang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12360))

Included in the following conference series:

European Conference on Computer Vision

3417 Accesses
22 Citations

Abstract

We present a novel end-to-end framework named as GSNet (Geometric and Scene-aware Network), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. Project page is available at https://lkeab.github.io/gsnet/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VERTEX: VEhicle Reconstruction and TEXture Estimation from a Single Image Using Deep Implicit Semantic Template Mapping

Deformable Feature Aggregation for Dynamic Multi-modal 3D Object Detection

Generic 3D Representation via Pose Estimation and Matching

References

Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: ICCV (2019)
Google Scholar
Cao, Z., Sheikh, Y., Banerjee, N.K.: Real-time scalable 6DOF pose estimation for textureless objects. In: 2016 IEEE International Conference on Robotics and Automation (ICRA) (2016)
Google Scholar
Chabot, F., Chaouch, M., Rabarisoa, J., Teulière, C., Chateau, T.: Deep MANTA: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In: CVPR (2017)
Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR (2016)
Google Scholar
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: SIGGRAPH (1996)
Google Scholar
Engelmann, F., Stückler, J., Leibe, B.: SAMP: shape and motion priors for 4D vehicle reconstruction. In: WACV (2017)
Google Scholar
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Hinterstoisser, S., et al.: Gradient response maps for real-time detection of textureless objects. TPAMI 34(5), 876–888 (2011)
Article Google Scholar
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: CVPR (2019)
Google Scholar
Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)
Google Scholar
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: ICCV (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)
Google Scholar
Kong, C., Lin, C.H., Lucey, S.: Using locally corresponding cad models for dense 3D reconstructions from a single image. In: CVPR (2017)
Google Scholar
Krishna Murthy, J., Sai Krishna, G., Chhaya, F., Madhava Krishna, K.: Reconstructing vehicles from a single image: shape priors for road scene understanding. In: 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017)
Google Scholar
Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: CVPR (2019)
Google Scholar
Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)
Google Scholar
Leotta, M.J., Mundy, J.L.: Predicting high resolution image edges with a generic, adaptive, 3-D vehicle model. In: CVPR (2009)
Google Scholar
Leotta, M.J., Mundy, J.L.: Vehicle surveillance with a generic, adaptive, 3D vehicle model. TPAMI 33(7), 1457–1469 (2010)
Article Google Scholar
Lepetit, V., Moreno-Noguer, F., Fua, P.: EP\(n\)P: an accurate \(o(n)\) solution to the P\(n\)P problem. IJCV 81(2) (2009). Article number: 155. https://doi.org/10.1007/s11263-008-0152-6
Li, C., Zeeshan Zia, M., Tran, Q.H., Yu, X., Hager, G.D., Chandraker, M.: Deep supervision with shape concepts for occlusion-aware 3D object parsing. In: CVPR (2017)
Google Scholar
Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3D object detection for autonomous driving. In: CVPR (2019)
Google Scholar
Li, P., Qin, T., Shen, S.: Stereo vision-based semantic 3D object and ego-motion tracking for autonomous driving. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 664–679. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_40
Chapter Google Scholar
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39
Chapter Google Scholar
Lin, C.H., et al.: Photometric mesh optimization for video-aligned 3D object reconstruction. In: CVPR (2019)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular 3D object detection. In: CVPR (2019)
Google Scholar
Liu, S., Li, T., Chen, W., Li, H.: Soft Rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Mottaghi, R., Xiang, Y., Savarese, S.: A coarse-to-fine model for 3D pose estimation and sub-category recognition. In: CVPR (2015)
Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: CVPR (2017)
Google Scholar
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: CVPR (2019)
Google Scholar
Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)
Google Scholar
Prisacariu, V.A., Reid, I.: Nonlinear shape manifolds as shape priors in level set segmentation and tracking. In: CVPR (2011)
Google Scholar
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: ICCV (2017)
Google Scholar
Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. In: CVPR (2018)
Google Scholar
Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: CVPR (2017)
Google Scholar
Rothganger, F., Lazebnik, S., Schmid, C., Ponce, J.: 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. IJCV 66(3), 231–259 (2006). https://doi.org/10.1007/s11263-005-3674-1
Article Google Scholar
Simonelli, A., Bulò, S.R.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: ICCV (2019)
Google Scholar
Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: SurfNet: generating 3D shape surfaces using deep residual networks. In: CVPR (2017)
Google Scholar
Song, X., et al.: ApolloCar3D: a large 3D car instance understanding benchmark for autonomous driving. In: CVPR (2019)
Google Scholar
Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)
Google Scholar
Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 712–729. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_43
Chapter Google Scholar
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: CVPR (2018)
Google Scholar
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017)
Google Scholar
Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: CVPR (2015)
Google Scholar
Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T., Schmalstieg, D.: Pose tracking from natural features on mobile phones. In: IEEE/ACM International Symposium on Mixed and Augmented Reality (2008)
Google Scholar
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Chapter Google Scholar
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)
Google Scholar
Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3D voxel patterns for object category recognition. In: CVPR (2015)
Google Scholar
Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: WACV (2014)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: Robotics: Science and Systems (RSS) (2018)
Google Scholar
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: CVPR (2018)
Google Scholar
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)
Google Scholar
Yang, B., Luo, W., Urtasun, R.: PIXOR: real-time 3D object detection from point clouds. In: CVPR (2018)
Google Scholar
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV (2019)
Google Scholar
Zeeshan Zia, M., Stark, M., Schindler, K.: Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects. In: CVPR (2014)
Google Scholar
Zhao, R., Wang, Y., Martinez, A.M.: A simple, fast and highly-accurate algorithm to recover 3D shape from 2D landmarks on a single image. TPAMI 40(12), 3059–3066 (2017)
Article Google Scholar
Zhu, R., Kiani Galoogahi, H., Wang, C., Lucey, S.: Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017)
Google Scholar
Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. TPAMI 35(11), 2608–2623 (2013)
Article Google Scholar

Download references

Acknowledgement

This research is supported in part by the Research Grant Council of the Hong Kong SAR under grant no. 1620818.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Lei Ke, Shichao Li, Yanan Sun, Yu-Wing Tai & Chi-Keung Tang
Kwai Inc., Shenzhen, China
Yu-Wing Tai

Authors

Lei Ke
View author publications
You can also search for this author in PubMed Google Scholar
Shichao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yanan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wing Tai
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Keung Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Wing Tai .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 147 KB)

Supplementary material 2 (mp4 44905 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ke, L., Li, S., Sun, Y., Tai, YW., Tang, CK. (2020). GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12360. Springer, Cham. https://doi.org/10.1007/978-3-030-58555-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-58555-6_31
Published: 16 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58554-9
Online ISBN: 978-3-030-58555-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision

Abstract

Access this chapter

Similar content being viewed by others

VERTEX: VEhicle Reconstruction and TEXture Estimation from a Single Image Using Deep Implicit Semantic Template Mapping

Deformable Feature Aggregation for Dynamic Multi-modal 3D Object Detection

Generic 3D Representation via Pose Estimation and Matching

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 147 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision

Abstract

Access this chapter

Similar content being viewed by others

VERTEX: VEhicle Reconstruction and TEXture Estimation from a Single Image Using Deep Implicit Semantic Template Mapping

Deformable Feature Aggregation for Dynamic Multi-modal 3D Object Detection

Generic 3D Representation via Pose Estimation and Matching

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 147 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation