Advertisement

Scene Understanding Using Deep Neural Networks—Objects, Actions, and Events: A Review

  • Ranjini SurendranEmail author
  • D. Jude Hemanth
Conference paper
  • 33 Downloads
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1087)

Abstract

Scene understanding plays an important role in various fields of applications like autonomous driving, robotic navigation, etc. Scene could be considered to be an association of a large number of objects, their actions, and the events they relate in a relevant and valid combination. Scene understanding aims at providing human like ability for machines and completely analyses the visual scenes. Understanding the context of a complex scene and providing an accurate visual information from the basic level of objects to a relation between them form the major objective. Deep learning neural networks which can learn features from a massive data have excelled over conventional machine learning algorithms.

Keywords

Scene understanding Semantic segmentation Instance segmentation Deep learning 

References

  1. 1.
    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), April 1 (2018)CrossRefGoogle Scholar
  2. 2.
    S. Ahlawat et al., Hand gesture recognition using convolutional neural network, in International Conference on Innovative Computing and Communications, Proceedings of ICICC, vol. 2 (2018), pp 179–186Google Scholar
  3. 3.
    R. Krishna, Y. Zhu, O. Groth, J. Johnson et al., Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (Springer) 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  4. 4.
    C.L. Zitnick, R. Vedantam, D. Parikh, Adopting abstract images for semantic scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 627–38 (2016)CrossRefGoogle Scholar
  5. 5.
    P. Pandey et al., Subject-independent emotion detection from EEG signals using deep neural network, in International Conference on Innovative Computing and Communications, Proceedings of ICICC, vol. 2 (2018), pp 41–46Google Scholar
  6. 6.
    Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  7. 7.
    G. Csurka, C. Dance, L. Fan et al., Visual categorization with bags of keypoints, in Proceedings of the ECCV workshop, 2004Google Scholar
  8. 8.
    S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Proceedings of the CVPR, 2006Google Scholar
  9. 9.
    Y. Lin, F. Lv, S. Zhu et al., Large-scale image classification: fast feature extraction and SVM training, in Proceedings of the CVPR, 2011Google Scholar
  10. 10.
    A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the NIPS, 2012Google Scholar
  11. 11.
    P. Sermanet, D. Eigen, X. Zhang et al., Overfeat: integrated recognition, localization and detection using convolutional networks, in Proceedings of the ICLR, 2014Google Scholar
  12. 12.
    M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional neural networks, in: Proceedings of the ECCV, 2014Google Scholar
  13. 13.
    K. He, X. Zhang, S. Ren et al., Spatial pyramid pooling in deep convolutional networks for visual recognition, in: Proceedings of the ECCV, 2014Google Scholar
  14. 14.
    K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the ICLR, 2015Google Scholar
  15. 15.
    C. Szegedy, W. Liu, Y. Jia et al., Going deeper with convolutions, in Proceedings of the CVPR, 2015Google Scholar
  16. 16.
    P.F. Felzenszwalb, R.B. Girshick et al., Object detection with discriminatively trained part-based models. Pattern Anal. Mach. Intell. IEEE Trans. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  17. 17.
    B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2189–2202 (2012)CrossRefGoogle Scholar
  18. 18.
    T. Deselaers, B. Alexe, V. Ferrari, Weakly supervised localization and learning with generic knowledge. Int. J. Comput. Vis. 100(3), 275–293 (2012)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  20. 20.
    T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, H.-Y. Shum, Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 33(2), 353–367 (2011)CrossRefGoogle Scholar
  21. 21.
    M.-M. Cheng, N.J. Mitra et al., Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)CrossRefGoogle Scholar
  22. 22.
    A. Borji, L. Itti, State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)CrossRefGoogle Scholar
  23. 23.
    A. Prest, C. Schmid, V. Ferrari, Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)CrossRefGoogle Scholar
  24. 24.
    J. Tighe, M. Niethammer, S. Lazebnik, Scene parsing with object instance inference using regions and per-exemplar detectors. Int. J. Comput. Vis. 112(2), 150–171 (2015)MathSciNetCrossRefGoogle Scholar
  25. 25.
    C. Szegedy, A. Toshev, D. Erhan, Deep neural networks for object detection, in Proceedings of the NIPS, 2013Google Scholar
  26. 26.
    D. Erhan, C. Szegedy et al., Scalable object detection using deep neural networks, in Proceedings of the CVPR, 2014Google Scholar
  27. 27.
    R. Girshick, J. Donahue, T. Darrell et al., Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the CVPR, 2014Google Scholar
  28. 28.
    J.R.R. Uijlings, K.E.A. Vande Sande, T. Gevers et al., Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRefGoogle Scholar
  29. 29.
    R. Girshick, Fast R-CNN, in Proceedings of the ICCV, 2015Google Scholar
  30. 30.
    S. Ren, K. He, R. Girshick et al., Faster R-CNN: towards real-time object detection with region proposal networks, in Proceedings of the NIPS, 2015Google Scholar
  31. 31.
    J. Redmon, S. Divvala, R. Girshick et al., You only look once: unified, real-time object detection, arXiv preprint, arXiv:1506.02640 (2015)
  32. 32.
    J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRefGoogle Scholar
  33. 33.
    J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, F. Wu, Background prior-based salient object detection via deep reconstruction residual. IEEE Trans. Circuits Syst. Video Technol. 25(8), 1309–1321 (2015)CrossRefGoogle Scholar
  34. 34.
    C. Rother, V. Kolmogorov et al., GrabCut: interactive foreground extraction using iterated graph cuts, in ACM Transactions on Graphics (TOG), vol. 23 (ACM, 2004), pp. 309–314Google Scholar
  35. 35.
    J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost for imageunderstanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81(1), 2–23 (2009)CrossRefGoogle Scholar
  36. 36.
    B. Hariharan, P. Arbeláez, R. Girshick et al., Simultaneous detection and segmentation, in Proceedings of the ECCV, 2014Google Scholar
  37. 37.
    J. Dai, K. He, J. Sun, Convolutional feature masking for joint object and stuff segmentation, in Proceedings of the CVPR, 2015Google Scholar
  38. 38.
    J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the CVPR, 2015Google Scholar
  39. 39.
    L.C. Chen, G. Papandreou, I. Kokkinos, et al., Semantic image segmentation with deep convolutional nets and fully connected CRFs, in Proceedings of the ICLR, 2015Google Scholar
  40. 40.
    S. Zheng, S. Jayasumana, B. Romera-Paredes et al., Conditional random fields as recurrent neural networks, in Proceedings of the ICCV, 2015Google Scholar
  41. 41.
    G. Papandreou, L. Chen, K. Murphy et al., Weakly- and semi-supervised learning of a DCNN for semantic image segmentation, in Proceedings of the ICCV, 2015Google Scholar
  42. 42.
    J. Dai, K. He, J. Sun, Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in Proceedings of the ICCV, 2015Google Scholar
  43. 43.
    P.O. Pinheiro, T.-Y. Lin, R. Collobert, P. Dollár, Learning to refine objectsegments, in European Conference on Computer Vision (Springer, 2016), pp. 75–91Google Scholar
  44. 44.
    S. Zagoruyko, A. Lerer, T. Lin, P.O. Pinheiro, S. Gross, S. Chintala, P. Dollár, Amultipath network for object detection, in: Proceedings of the BritishMachine Vision Conference 2016, BMVC 2016, York, UK, September 19–22,2016, 2016Google Scholar
  45. 45.
    J. Yao, S. Fidler, R. Urtasun, Describing the scene as a whole: Joint object detection, scene classification and semanticsegmentation, in CVPR, 2012, pp. 702–709Google Scholar
  46. 46.
    T. Vu, C. Olsson, I. Laptev, A. Oliva, J. Sivic, Predicting actions from static scenes, in ECCV, 2014, pp. 421–436CrossRefGoogle Scholar
  47. 47.
    G. Gkioxari, R.B. Girshick, J. Malik, Contextual action recognition with r*cnn, in ICCV, 2015, pp. 1080–1088Google Scholar
  48. 48.
    Y. Xiong, K. Zhu, D. Lin, X. Tang, Recognize complex events from static images by fusing deep channels, in CVPR, 2015, pp. 1600–1609Google Scholar
  49. 49.
    X.S. Wei, B.B. Gao, J. Wu, Deep spatial pyramid ensemble, for cultural event recognition, in ICCV Workshop on ChaLearn Looking at People, 2015, pp. 38–44Google Scholar
  50. 50.
    R. Rothe, R. Timofte, L. Van Gool, Dldr: deep linear discriminative retrieval for cultural event classification from a single image, in The IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp. 53–60Google Scholar
  51. 51.
    A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42(3), 145–175 (2001)CrossRefGoogle Scholar
  52. 52.
    C. Liu, J. Yuen, A. Torralba, Nonparametric scene parsing: label transfer via dense scene alignment, in IEEE Conference on Computer Vision and Pattern Recognition CVPR (IEEE, 2009), pp. 1972–1979Google Scholar
  53. 53.
    S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (IEEE, 2006), pp. 2169–2178Google Scholar
  54. 54.
    L. Li, F. Li, What, where and who? Classifying events by scene and object recognition. In ICCV, 2007, pp. 1–8Google Scholar
  55. 55.
    A. Torralba, R. Fergus, W.T. Freeman, 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  56. 56.
    J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (IEEE, 2009), pp. 248–255Google Scholar
  57. 57.
    B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, Labelme: a database and web-based tool for image annotation. Int. J. Comput. Vision 77(1–3), 157–173 (2008)CrossRefGoogle Scholar
  58. 58.
    J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, A. Torralba, Sun database: large-scale scene recognition from abbey to zoo, in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2010), pp. 3485–3492Google Scholar
  59. 59.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database, in: Advances in Neural Information Processing Systems, 2014, pp. 487–495Google Scholar
  60. 60.
    M. Everingham, S.A. Eslami, L. Van Gool, C.K. Williams et al., The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)CrossRefGoogle Scholar
  61. 61.
    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft coco: common objects in context, in European Conference on Computer Vision (Springer, 2014), pp. 740–755Google Scholar
  62. 62.
    G. Ros, L. Sellart, J. Materzynska et al., The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, pp. 3234–3243Google Scholar
  63. 63.
    M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson et al., The cityscapes dataset, in CVPR Workshop on The Future of Datasets in Vision 2015Google Scholar
  64. 64.
    L. Bossard, M. Guillaumin, L.J.V. Gool, Event recognition in photo collections with a stopwatch HMM, in ICCV (2013), pp. 1193–1200Google Scholar
  65. 65.
    S. Escalera, J. Fabian et al., Looking at people: apparent age and cultural event recognition datasets and results, in ICCV Workshop on ChaLearn Looking at People (2015), pp. 1–9Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Department of ECEKITS, Karunya UniversityCoimbatoreIndia

Personalised recommendations