3D Pose Estimation for Fine-Grained Object Categories

  • Yaming WangEmail author
  • Xiao Tan
  • Yi Yang
  • Xiao Liu
  • Errui Ding
  • Feng Zhou
  • Larry S. Davis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11129)


Existing object pose estimation datasets are related to generic object types and there is so far no dataset for fine-grained object categories. In this work, we introduce a new large dataset to benchmark pose estimation for fine-grained objects, thanks to the availability of both 2D and 3D fine-grained data recently. Specifically, we augment two popular fine-grained recognition datasets (StanfordCars and CompCars) by finding a fine-grained 3D CAD model for each sub-category and manually annotating each object in images with 3D pose. We show that, with enough training data, a full perspective model with continuous parameters can be estimated using 2D appearance information alone. We achieve this via a framework based on Faster/Mask R-CNN. This goes beyond previous works on category-level pose estimation, which only estimate discrete/continuous viewpoint angles or recover rotation matrices often with the help of key points. Furthermore, with fine-grained 3D models available, we incorporate a dense 3D representation named as location field into the CNN-based pose estimation framework to further improve the performance. The new dataset is available at


  1. 1.
    Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D Object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). Scholar
  2. 2.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
  3. 3.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  4. 4.
    Ghodrati, A., Pedersoli, M., Tuytelaars, T.: Is 2D information enough for viewpoint estimation? In: BMVC (2014)Google Scholar
  5. 5.
    Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron. (2018)
  6. 6.
    Hara, K., Vemulapalli, R., Chellappa, R.: Designing deep convolutional neural networks for continuous object orientation estimation. arXiv preprint arXiv:1702.01499 (2017)
  7. 7.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  8. 8.
    Hodaň, T., Matas, J., Obdržálek, Š.: On evaluation of 6D object pose estimation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 606–619. Springer, Cham (2016). Scholar
  9. 9.
    Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 301–320. Springer, Cham (2016). Scholar
  10. 10.
    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops on 3D Representation and Recognition (2013)Google Scholar
  11. 11.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  12. 12.
    Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: fine pose estimation. In: ICCV (2013)Google Scholar
  13. 13.
    Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015)Google Scholar
  14. 14.
    Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: ICCV, vol. 1, p. 4 (2017)Google Scholar
  15. 15.
    Mottaghi, R., Xiang, Y., Savarese, S.: A coarse-to-fine model for 3D pose estimation and sub-category recognition. In: CVPR (2015)Google Scholar
  16. 16.
    Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR (2009)Google Scholar
  17. 17.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: ICRA, pp. 2011–2018 (2017)Google Scholar
  18. 18.
    Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3D geometry to deformable part models. In: CVPR (2012)Google Scholar
  19. 19.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  20. 20.
    Savarese, S., Fei-Fei, L.: 3D generic object categorization, localization and pose estimation. In: ICCV (2007)Google Scholar
  21. 21.
    Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.W.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)Google Scholar
  22. 22.
    Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: CVPR (2016)Google Scholar
  23. 23.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)Google Scholar
  24. 24.
    Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.W.: The vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In: CVPR (2012)Google Scholar
  25. 25.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)Google Scholar
  26. 26.
    Van Horn, G., et al.: The iNaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
  27. 27.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report (2011)Google Scholar
  28. 28.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3D voxel patterns for object category recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1903–1911 (2015)Google Scholar
  29. 29.
    Xiang, Y., et al.: ObjectNet3D: a large scale database for 3D object recognition. In: ECCV (2016)Google Scholar
  30. 30.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: WACV (2014)Google Scholar
  31. 31.
    Yang, L., Liu, J., Tang, X.: Object detection and viewpoint estimation with auto-masking neural network. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 441–455. Springer, Cham (2014). Scholar
  32. 32.
    Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: CVPR (2015)Google Scholar
  33. 33.
    Zhou, X., Leonardos, S., Hu, X., Daniilidis, K., et al.: 3D shape estimation from 2D landmarks: a convex relaxation approach. In: CVPR (2015)Google Scholar
  34. 34.
    Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. PAMI 35(11), 2608–2623 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Yaming Wang
    • 1
    Email author
  • Xiao Tan
    • 2
  • Yi Yang
    • 2
  • Xiao Liu
    • 2
  • Errui Ding
    • 2
  • Feng Zhou
    • 2
  • Larry S. Davis
    • 1
  1. 1.University of MarylandCollege ParkUSA
  2. 2.Baidu, Inc.BeijingChina

Personalised recommendations