Advertisement

Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild

Conference paper
  • 724 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)

Abstract

Detecting objects and estimating their viewpoint in images are key tasks of 3D scene understanding. Recent approaches have achieved excellent results on very large benchmarks for object detection and viewpoint estimation. However, performances are still lagging behind for novel object categories with few samples. In this paper, we tackle the problems of few-shot object detection and few-shot viewpoint estimation. We propose a meta-learning framework that can be applied to both tasks, possibly including 3D data. Our models improve the results on objects of novel classes by leveraging on rich feature information originating from base classes with many samples. A simple joint feature embedding module is proposed to make the most of this feature sharing. Despite its simplicity, our method outperforms state-of-the-art methods by a large margin on a range of datasets, including PASCAL VOC and MS COCO for few-shot object detection, and Pascal3D+ and ObjectNet3D for few-shot viewpoint estimation. And for the first time, we tackle the combination of both few-shot tasks, on ObjectNet3D, showing promising results.

Keywords

Few-shot learning Meta learning Object detection Viewpoint estimation 

Notes

Acknowledgements

We thank Vincent Lepetit and Yuming Du for helpful discussions.

References

  1. 1.
    Ammirato, P., Fu, C.Y., Shvets, M., Kosecka, J., Berg, A.C.: Target driven instance detection (2018). arXiv preprint arXiv:1803.04610
  2. 2.
    Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: International Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  3. 3.
    Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_24CrossRefGoogle Scholar
  4. 4.
    Bertinetto, L., Henriques, J.F., Valmadre, J., Torr, P.H.S., Vedaldi, A.: Learning feed-forward one-shot learners. In: International Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  5. 5.
    Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2846–2854 (2016)Google Scholar
  6. 6.
    Diba, A., Sharma, V., Pazandeh, A.M., Pirsiavash, H., Gool, L.V.: Weakly supervised cascaded convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5131–5139 (2017)Google Scholar
  7. 7.
    Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015).  https://doi.org/10.1007/s11263-014-0733-5
  8. 8.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88, 303–338 (2010).  https://doi.org/10.1007/s11263-009-0275-4CrossRefGoogle Scholar
  9. 9.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning (ICML) (2017)Google Scholar
  10. 10.
    Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4367–4375 (2018)Google Scholar
  11. 11.
    Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  12. 12.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  13. 13.
    Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3022–3031 (2018)Google Scholar
  14. 14.
    Ha, D., Dai, A., Le, Q.V.: HyperNetworks. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  15. 15.
    Hao, C., Yali, W., Guoyou, W., Yu, Q.: LSTD: a low-shot transfer detector for object detection. In: AAAI Conference on Artificial Intelligence (AAAI) (2018)Google Scholar
  16. 16.
    Hariharan, B., Girshick, R.B.: Low-shot visual recognition by shrinking and hallucinating features. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  17. 17.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)Google Scholar
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 37, 1904–1916 (2015)CrossRefGoogle Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)Google Scholar
  20. 20.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37331-2_42CrossRefGoogle Scholar
  21. 21.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3588–3597 (2018)Google Scholar
  22. 22.
    Hu, S.X., et al.: Empirical Bayes transductive meta-learning with synthetic gradients. In: International Conference on Learning Representations (ICLR) (2020)Google Scholar
  23. 23.
    Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  24. 24.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017)Google Scholar
  25. 25.
    Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML workshops (2015)Google Scholar
  26. 26.
    Kuo, W., Angelova, A., Malik, J., Lin, T.Y.: ShapeMask: learning to segment novel objects by refining shape priors. In: IEEE International Conference on Computer Vision (ICCV), pp. 9206–9215 (2019)Google Scholar
  27. 27.
    Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  28. 28.
    Li, F.F., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 28, 594–611 (2006)CrossRefGoogle Scholar
  29. 29.
    Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017)Google Scholar
  30. 30.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  31. 31.
    Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  32. 32.
    Massa, F., Russell, B.C., Aubry, M.: Deep exemplar 2D–3D detection by adapting from real to rendered views. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6024–6033 (2016)Google Scholar
  33. 33.
    Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640 (2017)Google Scholar
  34. 34.
    Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_8CrossRefGoogle Scholar
  35. 35.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018 (2017)Google Scholar
  36. 36.
    Pitteri, G., Ilic, S., Lepetit, V.: CorNet: generic 3D corners for 6D pose estimation of new objects without retraining. In: IEEE International Conference on Computer Vision Workshops (ICCVw) (2019)Google Scholar
  37. 37.
    Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017)Google Scholar
  38. 38.
    Qi, H., Brown, M., Lowe, D.G.: Low-shot learning with imprinted weights. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5822–5830 (2017)Google Scholar
  39. 39.
    Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: IEEE International Conference on Computer Vision (ICCV), pp. 3848–3856 (2017)Google Scholar
  40. 40.
    Rahman, S., Khan, S., Porikli, F.: Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 547–563. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20887-5_34CrossRefGoogle Scholar
  41. 41.
    Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  42. 42.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)Google Scholar
  43. 43.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)Google Scholar
  44. 44.
    Redmon, J., Farhadi, A.: YOLOV3: an incremental improvement (2018). arXiv preprint arXiv:1804.02767
  45. 45.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems (NeuIPS) (2015)Google Scholar
  46. 46.
    Schwartz, E., et al.: RepMet: representative-based metric learning for classification and few-shot object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5192–5201 (2019)Google Scholar
  47. 47.
    Shen, X., Efros, A.A., Aubry, M.: Discovering visual patterns in art collections with spatially-consistent feature learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  48. 48.
    Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: International Conference on Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  49. 49.
    Song, H.O., Lee, Y.J., Jegelka, S., Darrell, T.: Weakly-supervised discovery of visual pattern configurations. In: International Conference on Neural Information Processing Systems (NeurIPS) (2014)Google Scholar
  50. 50.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: IEEE International Conference on Computer Vision (ICCV), pp. 2686–2694 (2015)Google Scholar
  51. 51.
    Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 712–729. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_43CrossRefGoogle Scholar
  52. 52.
    Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292–301 (2018)Google Scholar
  53. 53.
    Tseng, H.Y., et al.: Few-shot viewpoint estimation. In: British Machine Vision Conference (BMVC) (2019)Google Scholar
  54. 54.
    Tulsiani, S., Carreira, J., Malik, J.: Pose induction for novel object categories. In: IEEE International Conference on Computer Vision (ICCV), pp. 64–72 (2015)Google Scholar
  55. 55.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  56. 56.
    Vinyals, O., Blundell, C., Lillicrap, T.P., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: International Conference on Neural Information Processing Systems (NeurIPS) (2016)Google Scholar
  57. 57.
    Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple few-shot object detection. In: International Conference on Machine Learning (ICML), July 2020Google Scholar
  58. 58.
    Wang, Y.-X., Hebert, M.: Learning to learn: model regression networks for easy small sample learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 616–634. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_37CrossRefGoogle Scholar
  59. 59.
    Wang, Y.X., Ramanan, D., Hebert, M.: Meta-learning to detect rare objects. In: IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  60. 60.
    Xiang, Yu., et al.: ObjectNet3D: a large scale database for 3D object recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 160–176. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_10CrossRefGoogle Scholar
  61. 61.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2014)Google Scholar
  62. 62.
    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: Robotics: Science and Systems (RSS) (2018)Google Scholar
  63. 63.
    Xiao, Y., Qiu, X., Langlois, P., Aubry, M., Marlet, R.: Pose from shape: deep pose estimation for arbitrary 3D objects. In: British Machine Vision Conference (BMVC) (2019)Google Scholar
  64. 64.
    Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta R-CNN : towards general solver for instance-level low-shot learning. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  65. 65.
    Zhou, X., Karpur, A., Luo, L., Huang, Q.: StarMap for category-agnostic keypoint and viewpoint estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 328–345. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_20CrossRefGoogle Scholar
  66. 66.
    Zhu, P., Wang, H., Saligrama, V.: Zero shot detection. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 30, 998–1010 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRSMarne-la-ValléeFrance
  2. 2.valeo.aiParisFrance

Personalised recommendations