GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)


We present a novel end-to-end framework named as GSNet (Geometric and Scene-aware Network), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. Project page is available at


Vehicle pose and shape reconstruction 3D traffic scene understanding 



This research is supported in part by the Research Grant Council of the Hong Kong SAR under grant no. 1620818.

Supplementary material

504470_1_En_31_MOESM1_ESM.pdf (147 kb)
Supplementary material 1 (pdf 147 KB)

Supplementary material 2 (mp4 44905 KB)


  1. 1.
    Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: ICCV (2019)Google Scholar
  2. 2.
    Cao, Z., Sheikh, Y., Banerjee, N.K.: Real-time scalable 6DOF pose estimation for textureless objects. In: 2016 IEEE International Conference on Robotics and Automation (ICRA) (2016)Google Scholar
  3. 3.
    Chabot, F., Chaouch, M., Rabarisoa, J., Teulière, C., Chateau, T.: Deep MANTA: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In: CVPR (2017)Google Scholar
  4. 4.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
  5. 5.
    Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR (2016)Google Scholar
  6. 6.
    Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR (2017)Google Scholar
  7. 7.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  8. 8.
    Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: SIGGRAPH (1996)Google Scholar
  9. 9.
    Engelmann, F., Stückler, J., Leibe, B.: SAMP: shape and motion priors for 4D vehicle reconstruction. In: WACV (2017)Google Scholar
  10. 10.
    Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)Google Scholar
  11. 11.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  12. 12.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  13. 13.
    Hinterstoisser, S., et al.: Gradient response maps for real-time detection of textureless objects. TPAMI 34(5), 876–888 (2011)CrossRefGoogle Scholar
  14. 14.
    Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: CVPR (2019)Google Scholar
  15. 15.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)Google Scholar
  16. 16.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: ICCV (2017)Google Scholar
  17. 17.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  18. 18.
    Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)Google Scholar
  19. 19.
    Kong, C., Lin, C.H., Lucey, S.: Using locally corresponding cad models for dense 3D reconstructions from a single image. In: CVPR (2017)Google Scholar
  20. 20.
    Krishna Murthy, J., Sai Krishna, G., Chhaya, F., Madhava Krishna, K.: Reconstructing vehicles from a single image: shape priors for road scene understanding. In: 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  21. 21.
    Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: CVPR (2019)Google Scholar
  22. 22.
    Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)Google Scholar
  23. 23.
    Leotta, M.J., Mundy, J.L.: Predicting high resolution image edges with a generic, adaptive, 3-D vehicle model. In: CVPR (2009)Google Scholar
  24. 24.
    Leotta, M.J., Mundy, J.L.: Vehicle surveillance with a generic, adaptive, 3D vehicle model. TPAMI 33(7), 1457–1469 (2010)CrossRefGoogle Scholar
  25. 25.
    Lepetit, V., Moreno-Noguer, F., Fua, P.: EP\(n\)P: an accurate \(o(n)\) solution to the P\(n\)P problem. IJCV 81(2) (2009). Article number: 155.
  26. 26.
    Li, C., Zeeshan Zia, M., Tran, Q.H., Yu, X., Hager, G.D., Chandraker, M.: Deep supervision with shape concepts for occlusion-aware 3D object parsing. In: CVPR (2017)Google Scholar
  27. 27.
    Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3D object detection for autonomous driving. In: CVPR (2019)Google Scholar
  28. 28.
    Li, P., Qin, T., Shen, S.: Stereo vision-based semantic 3D object and ego-motion tracking for autonomous driving. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 664–679. Springer, Cham (2018). Scholar
  29. 29.
    Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). Scholar
  30. 30.
    Lin, C.H., et al.: Photometric mesh optimization for video-aligned 3D object reconstruction. In: CVPR (2019)Google Scholar
  31. 31.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  32. 32.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  33. 33.
    Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular 3D object detection. In: CVPR (2019)Google Scholar
  34. 34.
    Liu, S., Li, T., Chen, W., Li, H.: Soft Rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)Google Scholar
  35. 35.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  36. 36.
    Mottaghi, R., Xiang, Y., Savarese, S.: A coarse-to-fine model for 3D pose estimation and sub-category recognition. In: CVPR (2015)Google Scholar
  37. 37.
    Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: CVPR (2017)Google Scholar
  38. 38.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  39. 39.
    Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: CVPR (2019)Google Scholar
  40. 40.
    Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)Google Scholar
  41. 41.
    Prisacariu, V.A., Reid, I.: Nonlinear shape manifolds as shape priors in level set segmentation and tracking. In: CVPR (2011)Google Scholar
  42. 42.
    Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: ICCV (2017)Google Scholar
  43. 43.
    Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. In: CVPR (2018)Google Scholar
  44. 44.
    Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: CVPR (2017)Google Scholar
  45. 45.
    Rothganger, F., Lazebnik, S., Schmid, C., Ponce, J.: 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. IJCV 66(3), 231–259 (2006). Scholar
  46. 46.
    Simonelli, A., Bulò, S.R.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: ICCV (2019)Google Scholar
  47. 47.
    Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: SurfNet: generating 3D shape surfaces using deep residual networks. In: CVPR (2017)Google Scholar
  48. 48.
    Song, X., et al.: ApolloCar3D: a large 3D car instance understanding benchmark for autonomous driving. In: CVPR (2019)Google Scholar
  49. 49.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)Google Scholar
  50. 50.
    Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 712–729. Springer, Cham (2018). Scholar
  51. 51.
    Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: CVPR (2018)Google Scholar
  52. 52.
    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017)Google Scholar
  53. 53.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: CVPR (2015)Google Scholar
  54. 54.
    Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T., Schmalstieg, D.: Pose tracking from natural features on mobile phones. In: IEEE/ACM International Symposium on Mixed and Augmented Reality (2008)Google Scholar
  55. 55.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). Scholar
  56. 56.
    Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)Google Scholar
  57. 57.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3D voxel patterns for object category recognition. In: CVPR (2015)Google Scholar
  58. 58.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: WACV (2014)Google Scholar
  59. 59.
    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: Robotics: Science and Systems (RSS) (2018)Google Scholar
  60. 60.
    Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: CVPR (2018)Google Scholar
  61. 61.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)Google Scholar
  62. 62.
    Yang, B., Luo, W., Urtasun, R.: PIXOR: real-time 3D object detection from point clouds. In: CVPR (2018)Google Scholar
  63. 63.
    Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: ICCV (2019)Google Scholar
  64. 64.
    Zeeshan Zia, M., Stark, M., Schindler, K.: Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects. In: CVPR (2014)Google Scholar
  65. 65.
    Zhao, R., Wang, Y., Martinez, A.M.: A simple, fast and highly-accurate algorithm to recover 3D shape from 2D landmarks on a single image. TPAMI 40(12), 3059–3066 (2017)CrossRefGoogle Scholar
  66. 66.
    Zhu, R., Kiani Galoogahi, H., Wang, C., Lucey, S.: Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017)Google Scholar
  67. 67.
    Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. TPAMI 35(11), 2608–2623 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The Hong Kong University of Science and TechnologyClear Water BayHong Kong
  2. 2.Kwai Inc.ShenzhenChina

Personalised recommendations