Advertisement

NODIS: Neural Ordinary Differential Scene Understanding

Conference paper
  • 500 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)

Abstract

Semantic image understanding is a challenging topic in computer vision. It requires to detect all objects in an image, but also to identify all the relations between them. Detected objects, their labels and the discovered relations can be used to construct a scene graph which provides an abstract semantic interpretation of an image. In previous works, relations were identified by solving an assignment problem formulated as (Mixed-)Integer Linear Programs. In this work, we interpret that formulation as Ordinary Differential Equation (ODE). The proposed architecture performs scene graph inference by solving a neural variant of an ODE by end-to-end learning. The connection between (Mixed-)Integer Linear Program and ODEs in combination with the end-to-end training amounts to learning how to solve assignment problems with image-specific objective functions. Intuitive, visual explanations are provided for the role of the single free variable of the ODE modules which are associated with time in many natural processes. The proposed model achieves results equal to or above state-of-the-art on all three benchmark tasks: scene graph generation (SGGEN), classification (SGCLS) and visual relationship detection (PREDCLS) on Visual Genome benchmark. The strong results on scene graph classification support the claim that assignment problems can indeed be solved by neural ODEs.

Keywords

Semantic image understanding Scene graph Visual relationship detection 

Notes

Acknowledgement

This work was partially supported by the DFG grant COVMAP (RO 2497/12-2) and EXC 2122.

Supplementary material

504476_1_En_38_MOESM1_ESM.pdf (1.3 mb)
Supplementary material 1 (pdf 1296 KB)

References

  1. 1.
    Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., Saunshi, N.: A theoretical analysis of contrastive unsupervised representation learning. In: Proceedings of Machine Learning Research (PMLR) (2019)Google Scholar
  2. 2.
    Chen, L., Zhang, H., Xiao, J., He, X., Pu, S., Chang, S.F.: Counterfactual critic multi-agent training for scene graph generation. In: IEEE International Conference on Computer Vision (ICCV), pp. 4613–4623 (2019)Google Scholar
  3. 3.
    Chen, T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.: Neural ordinary differential equations. In: Neural Information Processing Systems (NeurIPS), pp. 6571–6583 (2018)Google Scholar
  4. 4.
    Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2019)Google Scholar
  5. 5.
    Chen, V.S., Varma, P., Krishna, R., Bernstein, M., Re, C., Fei-Fei, L.: Scene graph prediction with limited labels. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2580–2590 (2019)Google Scholar
  6. 6.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3076–3086 (2017)Google Scholar
  7. 7.
    Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1271–1278 (2009)Google Scholar
  8. 8.
    Fügenschuh, A., Herty, M., Klar, A., Martin, A.: Combinatorial and continuous models for the optimization of traffic flows on networks. SIAM J. Optim. 16(4), 1155–1176 (2006)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gkanatsios, N., Pitsikalis, V., Koutras, P., Maragos, P.: Attention-translation-relation network for scalable scene graph generation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)Google Scholar
  10. 10.
    Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)Google Scholar
  11. 11.
    He, K., Georgia, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969 (2017)Google Scholar
  12. 12.
    Herzig, R., Raboh, M., Chechik, G., Berant, J., Globerson, A.: Mapping images to scene graphs with permutation-invariant structured prediction. In: Advances in Neural Information Processing Systems, pp. 7211–7221 (2018)Google Scholar
  13. 13.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3588–3597 (2018)Google Scholar
  14. 14.
    Hu, T., Liao, W., Yang, M.Y., Rosenhahn, B.: Exploiting attention for visual relationship detection. In: Fink, G.A., Frintrop, S., Jiang, X. (eds.) DAGM GCPR 2019. LNCS, vol. 11824, pp. 331–344. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-33676-9_23CrossRefGoogle Scholar
  15. 15.
    Johnson, J., et al.: Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015)Google Scholar
  16. 16.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  17. 17.
    Kluger, F., Ackermann, H., Yang, M.Y., Rosenhahn, B.: Temporally consistent horizon lines. In: ICRA (2020)Google Scholar
  18. 18.
    Kluger, F., Brachmann, E., Ackermann, H., Rother, C., Yang, M.Y., Rosenhahn, B.: CONSAC: robust multi-model fitting by conditional sample consensus. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4643 (2020)Google Scholar
  19. 19.
    Krishna, R., Chami, I., Bernstein, M., Fei-Fei, L.: Referring relationships. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  20. 20.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123(1), 32–73 (2017).  https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRefGoogle Scholar
  21. 21.
    Krishnaswamy, N., Friedman, S., Pustejovsky, J.: Combining deep learning and qualitative spatial reasoning to learn complex structures from sparse examples with noise. Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 2911–2918 (2019)Google Scholar
  22. 22.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph cut based inference with co-occurrence statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15555-0_18CrossRefGoogle Scholar
  23. 23.
    Li, Y., Ouyang, W., Wang, X.: Vip-CNN: visual phrase guided convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1347–1356 (2017)Google Scholar
  24. 24.
    Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 346–363. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_21CrossRefGoogle Scholar
  25. 25.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: IEEE International Conference on Computer Vision (ICCV), pp. 1261–1270 (2017)Google Scholar
  26. 26.
    Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: IEEE International Conference on Computer Vision (ICCV), pp. 848–857 (2017)Google Scholar
  27. 27.
    Liao, W., Rosenhahn, B., Shuai, L., Yang, M.Y.: Natural language guided visual relationship detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)Google Scholar
  28. 28.
    Liu, Y., Wang, R., Shan, S., Chen, X.: Structure inference net: object detection using scene-level context and instance-level relationships. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6985–6994 (2018)Google Scholar
  29. 29.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_51CrossRefGoogle Scholar
  30. 30.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
  31. 31.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_48CrossRefGoogle Scholar
  32. 32.
    Reinders, C., Ackermann, H., Yang, M.Y., Rosenhahn, B.: Learning convolutional neural networks for object detection with very little training data. In: Multimodal Scene Understanding, pp. 65–100. Elsevier (2019)Google Scholar
  33. 33.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Neural Information Processing Systems (NeurIPS), pp. 91–99 (2015)Google Scholar
  34. 34.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_49CrossRefGoogle Scholar
  35. 35.
    Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8376–8384 (2019)Google Scholar
  36. 36.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_54CrossRefGoogle Scholar
  37. 37.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  38. 38.
    Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  39. 39.
    Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3233–3241 (2017)Google Scholar
  40. 40.
    Wang, W., Wang, R., Shan, S., Chen, X.: Exploring context and visual pattern of relationship for scene graph generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8188–8197 (2019)Google Scholar
  41. 41.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5419 (2017)Google Scholar
  42. 42.
    Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 690–706. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_41CrossRefGoogle Scholar
  43. 43.
    Yang, M.Y., Liao, W., Ackermann, H., Rosenhahn, B.: On support relations and semantic scene graphs. ISPRS J. Photogram. Remote Sens. (ISPRS) 131, 15–25 (2017)CrossRefGoogle Scholar
  44. 44.
    Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10685–10694 (2019)Google Scholar
  45. 45.
    Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: learning object-agnostic visual relationship features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 38–54. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_3CrossRefGoogle Scholar
  46. 46.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17–24 (2010)Google Scholar
  47. 47.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 711–727. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_42CrossRefGoogle Scholar
  48. 48.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1974–1982 (2017)Google Scholar
  49. 49.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5831–5840 (2018)Google Scholar
  50. 50.
    Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  51. 51.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: IEEE International Conference on Computer Vision (ICCV), pp. 589–598 (2017)Google Scholar
  52. 52.
    Zhuo, W., Salzmann, M., He, X., Liu, M.: Indoor scene parsing with instance segmentation, semantic labeling and support relationship inference. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5429–5437 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute of Information ProcessingLeibniz UniversityHannoverGermany
  2. 2.Scene Understanding GroupUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations