Advertisement

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Conference paper
  • 452 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12358)

Abstract

Scene graph aims to faithfully reveal humans’ perception of image content. When humans analyze a scene, they usually prefer to describe image gist first, namely major objects and key relations in a scene graph. This humans’ inherent perceptive habit implies that there exists a hierarchical structure about humans’ preference during the scene parsing procedure. Therefore, we argue that a desirable scene graph should be also hierarchically constructed, and introduce a new scheme for modeling scene graph. Concretely, a scene is represented by a human-mimetic Hierarchical Entity Tree (HET) consisting of a series of image regions. To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context to capture the structured information embedded in HET. To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings by learning to capture humans’ subjective perceptive habits from objective entity saliency and size. Experiments indicate that our method not only achieves state-of-the-art performances for scene graph generation, but also is expert in mining image-specific relations which play a great role in serving downstream tasks.

Keywords

Image gist Key relation Hierarchical Entity Tree Hybrid-LSTM Relation Ranking Module 

Notes

Acknowledgements

This work is partially supported by Natural Science Foundation of China under contracts Nos. 61922080, U19B2036, 61772500, CAS Frontier Science Key Research Project No. QYZDJ-SSWJSC009, and Beijing Academy of Artificial Intelligence No. BAAI2020ZJ0201.

Supplementary material

504454_1_En_14_MOESM1_ESM.pdf (5.9 mb)
Supplementary material 1 (pdf 6015 KB)

References

  1. 1.
    Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6163–6171 (2019)Google Scholar
  3. 3.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3298–3308 (2017)Google Scholar
  4. 4.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5–6), 602–610 (2005)CrossRefGoogle Scholar
  5. 5.
    Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1969–1978 (2019)Google Scholar
  6. 6.
    Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 31(1), 59–73 (2008)Google Scholar
  7. 7.
    He, S., Tavakoli, H.R., Borji, A., Pugeault, N.: Human attention in image captioning: dataset and analysis. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8529–8538 (2019)Google Scholar
  8. 8.
    Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3203–3212 (2017)Google Scholar
  9. 9.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 11, 1254–1259 (1998)CrossRefGoogle Scholar
  10. 10.
    Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015)Google Scholar
  11. 11.
    Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6271–6280 (2019)Google Scholar
  12. 12.
    Klein, D.A., Frintrop, S.: Center-surround divergence of feature statistics for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2214–2219 (2011)Google Scholar
  13. 13.
    Krishna, R., et al.: Visual genome: connecting language and vision using Crowdsourced dense image annotations. Int. J. Comput. Vision (IJCV) 123(1), 32–73 (2017).  https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRefGoogle Scholar
  14. 14.
    Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5455–5463 (2015)Google Scholar
  15. 15.
    Li, Y., Ouyang, W., Wang, X., Tang, X.: Vip-cnn: visual phrase guided convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7244–7253 (2017)Google Scholar
  16. 16.
    Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 346–363. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_21CrossRefGoogle Scholar
  17. 17.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1261–1270 (2017)Google Scholar
  18. 18.
    Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4408–4417 (2017)Google Scholar
  19. 19.
    Liang, Y., Bai, Y., Zhang, W., Qian, X., Zhu, L., Mei, T.: VrR-VG: refocusing visually-relevant relationships. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 10403–10412 (2019)Google Scholar
  20. 20.
    Lin, L., Wang, G., Zhang, R., Zhang, R., Liang, X., Zuo, W.: Deep structured scene parsing by learning with image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2276–2284 (2016)Google Scholar
  21. 21.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  22. 22.
    Lin, X., Ding, C., Zeng, J., Tao, D.: GPS-net: graph property sensing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3746–3755 (2020)Google Scholar
  23. 23.
    Liu, N., Han, J., Yang, M.H.: PiCANet: learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3089–3098 (2018)Google Scholar
  24. 24.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_51CrossRefGoogle Scholar
  25. 25.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7219–7228 (2018)Google Scholar
  26. 26.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  27. 27.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5179–5188 (2017)Google Scholar
  28. 28.
    Qi, M., Li, W., Yang, Z., Wang, Y., Luo, J.: Attentive relational networks for mapping images to scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3957–3966 (2019)Google Scholar
  29. 29.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015)Google Scholar
  30. 30.
    Sharma, A., Tuzel, O., Jacobs, D.W.: Deep hierarchical parsing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 530–538 (2015)Google Scholar
  31. 31.
    Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8376–8384 (2019)Google Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  33. 33.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 129–136 (2011)Google Scholar
  34. 34.
    Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 1: Long Papers), pp. 1556–1566 (2015)Google Scholar
  35. 35.
    Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3716–3725 (2020)Google Scholar
  36. 36.
    Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6619–6628 (2019)Google Scholar
  37. 37.
    Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3183–3192 (2015)Google Scholar
  38. 38.
    Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4019–4028 (2017)Google Scholar
  39. 39.
    Wang, W., Wang, R., Shan, S., Chen, X.: Exploring context and visual pattern of relationship for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8188–8197 (2019)Google Scholar
  40. 40.
    Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(6), 1367–1381 (2018)CrossRefGoogle Scholar
  41. 41.
    Xie, Y., Lu, H., Yang, M.H.: Bayesian saliency via low and mid level cues. IEEE Trans. Image Proc. (TIP) 22(5), 1689–1698 (2012)MathSciNetzbMATHGoogle Scholar
  42. 42.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5419 (2017)Google Scholar
  43. 43.
    Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 690–706. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_41CrossRefGoogle Scholar
  44. 44.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 711–727. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_42CrossRefGoogle Scholar
  45. 45.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2621–2629 (2019)Google Scholar
  46. 46.
    Yin, G., et al.: Zoom-net: mining deep feature interactions for visual relationship recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 330–347. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_20CrossRefGoogle Scholar
  47. 47.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1974–1982 (2017)Google Scholar
  48. 48.
    Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3736–3745 (2020)Google Scholar
  49. 49.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5831–5840 (2018)Google Scholar
  50. 50.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5532–5540 (2017)Google Scholar
  51. 51.
    Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4233–4241 (2017)Google Scholar
  52. 52.
    Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. Proc. AAAI Conf. Artif. Intell. (AAAI) 33, 9185–9194 (2019)Google Scholar
  53. 53.
    Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11535–11543 (2019)Google Scholar
  54. 54.
    Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: CapSal: leveraging captioning to boost semantics for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6024–6033 (2019)Google Scholar
  55. 55.
    Zhu, L., Chen, Y., Lin, Y., Lin, C., Yuille, A.: Recursive segmentation and recognition templates for image parsing. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 34(2), 359–371 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Institute of Computing Technology, CASBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations