Advertisement

Visual Relationship Prediction via Label Clustering and Incorporation of Depth Information

  • Hsuan-Kung Yang
  • An-Chieh Cheng
  • Kuan-Wei Ho
  • Tsu-Jui Fu
  • Chun-Yi LeeEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11130)

Abstract

In this paper, we investigate the use of an unsupervised label clustering technique and demonstrate that it enables substantial improvements in visual relationship prediction accuracy on the Person in Context (PIC) dataset. We propose to group object labels with similar patterns of relationship distribution in the dataset into fewer categories. Label clustering not only mitigates both the large classification space and class imbalance issues, but also potentially increases data samples for each clustered category. We further propose to incorporate depth information as an additional feature into the instance segmentation model. The additional depth prediction path supplements the relationship prediction model in a way that bounding boxes or segmentation masks are unable to deliver. We have rigorously evaluated the proposed techniques and performed various ablation analysis to validate the benefits of them.

Keywords

Relationship prediction Instance segmentation Semantic segmentation Unsupervised clustering Depth information 

References

  1. 1.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1745–1752, June 2011Google Scholar
  2. 2.
    Ramanathan, V., et al.: Learning semantic relationships for better action retrieval in images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1100–1109, June 2015Google Scholar
  3. 3.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 17–24, June 2010Google Scholar
  4. 4.
    Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4438–4446, July 2017Google Scholar
  5. 5.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123, 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification (2017). https://storage.googleapis.com/openimages/web/index.html
  7. 7.
  8. 8.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5831–5840, June 2018Google Scholar
  9. 9.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3588–3597, June 2018Google Scholar
  10. 10.
    Zhu, Y., Jiang, S.: Deep structured learning for visual relationship detection. In: Proceedings of AAAI Conference on Artificial Intelligence, AAAI, pp. 7623–7630, February 2018Google Scholar
  11. 11.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39, 1137–1149 (2017)CrossRefGoogle Scholar
  12. 12.
    Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 270–279, July 2017Google Scholar
  13. 13.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, ICCV, pp. 2961–2969, October 2017Google Scholar
  14. 14.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 580–587, June 2014Google Scholar
  15. 15.
    Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: Proceedings of International Conference on Neural Information Processing Systems, NIPS, pp. 1990–1998, December 2015Google Scholar
  16. 16.
    Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3150–3158, June 2016Google Scholar
  17. 17.
    Sadeghi, F., Kumar Divvala, S.K., Farhadi, A.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1456–1464, May 2015Google Scholar
  18. 18.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 852–869 (2016)CrossRefGoogle Scholar
  19. 19.
    Liang, X., Lee, L., Xing, E.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4408–4417 (2017)Google Scholar
  20. 20.
    Sun, M., Ng, A.Y., Saxena, A.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31, 824–840 (2008)Google Scholar
  21. 21.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2024–2039 (2016)CrossRefGoogle Scholar
  22. 22.
    Karsch, K., Liu, C., Kang, S.B.: Depth extraction from video using non-parametric sampling. In: Proceedings of International Conference on Neural Information Processing Systems, NIPS, pp. 2144–2158, May 2017Google Scholar
  23. 23.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 6602–6611, November 2017Google Scholar
  24. 24.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3213–3223, December 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Hsuan-Kung Yang
    • 1
  • An-Chieh Cheng
    • 1
  • Kuan-Wei Ho
    • 1
  • Tsu-Jui Fu
    • 1
  • Chun-Yi Lee
    • 1
    Email author
  1. 1.Elsa Lab, Department of Computer ScienceNational Tsing Hua UniversityHsinchuTaiwan

Personalised recommendations