Advertisement

International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 292–313 | Cite as

Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance

  • Chao-Yeh ChenEmail author
  • Kristen Grauman
Article

Abstract

Understanding images with people often entails understanding their interactions with other objects or people. As such, given a novel image, a vision system ought to infer which other objects/people play an important role in a given person’s activity. However, existing methods are limited to learning action-specific interactions (e.g., how the pose of a tennis player relates to the position of his racquet when serving the ball) for improved recognition, making them unequipped to reason about novel interactions with actions or objects unobserved in the training data. We propose to predict the “interactee” in novel images—that is, to localize the object of a person’s action. Given an arbitrary image with a detected person, the goal is to produce a saliency map indicating the most likely positions and scales where that person’s interactee would be found. To that end, we explore ways to learn the generic, action-independent connections between (a) representations of a person’s pose, gaze, and scene cues and (b) the interactee object’s position and scale. We provide results on a newly collected UT Interactee dataset spanning more than 10,000 images from SUN, PASCAL, and COCO. We show that the proposed interaction-informed saliency metric has practical utility for four tasks: contextual object detection, image retargeting, predicting object importance, and data-driven natural language scene description. All four scenarios reveal the value in linking the subject to its object in order to understand the story of an image.

Keywords

Human-object interaction Importance Objectness 

Notes

Acknowledgments

We thank the anonymous reviewers for their feedback, and Larry Zitnick for sharing the captioning data with us. We also thank Texas Advanced Computing Center (TACC) for providing the computing resource. This research is supported in part by ONR PECASE N00014-15-1-2291.

References

  1. Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In Conference on computer vision and pattern recognition.Google Scholar
  2. Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graphics, 26(3), 10.CrossRefGoogle Scholar
  3. Berg, A., Berg. T., Daume, H., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M., Sood, A., Stratos, K., & Yamaguchi, K. (2012). Understanding and predicting importance in images. In Conference on computer vision and pattern recognition.Google Scholar
  4. Bishop, C. M. (1994). Mixture density networks. Tech. rep., Microsoft Research Cambridge.Google Scholar
  5. Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In Conference on computer vision and pattern recognition.Google Scholar
  6. Chao, Y.W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). Hico: A benchmark for recognizing human-object interactions in images. In International conference on computer vision.Google Scholar
  7. Chen, C. Y., & Grauman, K. (2014). Predicting the location of nteractees?in novel human-object interactions. In Asian conference on computer vision.Google Scholar
  8. Cristani, M., Bazzani, L., Paggettim, G., Fossati, A., Bue, A. D., Menegaz, G., & Murino, V. (2011). Social interaction discovery by statistical analysis of f-formations. In British machine vision conference.Google Scholar
  9. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Conference on computer vision and pattern recognition.Google Scholar
  10. Damen, D., & Hogg, D. (2012). Detecting carried objects from sequences of walking pedestrians. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  11. Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Gupta, A., & Efros, A. (2012). Scene semantics from long-term observation of people. In European conference on computer vision.Google Scholar
  12. Desai, C., & Ramanan, D. (2013). Predicting functional regions on objects. In CVPR workshop on scene analysis beyond semantics.Google Scholar
  13. Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human-object interactions. In Workshopon structured models in computer vision, computervision and pattern recognition (SMiCV)Google Scholar
  14. Devlin, J., Gupta, S., Girschick, R., Mitchell, M., & Zitnick, L. (2015). Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467
  15. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Conference on computer vision and pattern recognition.Google Scholar
  16. Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  17. Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. IJCV, 88, 303–338.CrossRefGoogle Scholar
  18. Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C., & Zweig, G. (2015). From captions to visual concepts and back. In Conference on computer vision and pattern recognition.Google Scholar
  19. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. In European conference on computer vision.Google Scholar
  20. Farhadi, A., & Sadeghi, M. (2011). Recognition using visual phrases. In Conference on computer vision and pattern recognition.Google Scholar
  21. Fathi, A., Hodgins, J., & Rehg, J. (2012). Social interactions: A first-person perspective. In Conference on computer vision and pattern recognition.Google Scholar
  22. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.CrossRefGoogle Scholar
  23. Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2014). People watching: Human actions as a cue for single view geometry. In International journal of computer vision.Google Scholar
  24. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In International conference on computer vision.Google Scholar
  25. Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI, 31(10)Google Scholar
  26. Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Conference on computer vision and pattern recognition.Google Scholar
  27. Gupta, S., & Malik, J. (2015). Visual semantic role labeling. arXiv:1505.04474
  28. Haritaoglu, I., Harwood, D., & Davis, L. (2000). W4: Real-time surveillance of people and their activities. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  29. Hou, X., & Zhang, L. (2007). Saliency detection: A spectral residual approach. In Conference on computer vision and pattern recognition.Google Scholar
  30. Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In British machine vision conference.Google Scholar
  31. Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European conference on computer vision.Google Scholar
  32. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. TPAMI, 20(11),Google Scholar
  33. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the acm international conference on multimedia.Google Scholar
  34. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Conference on computer vision and pattern recognition.Google Scholar
  35. Kjellstrom, H., Romero, J., Mercado, D. M., & Kragic, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In European conference on computer vision.Google Scholar
  36. Koppula, H., & Saxena, A. (2013). Anticipating human activities using object affordances for reactive robotic response. In RSS.Google Scholar
  37. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Neural information processing systems conference.Google Scholar
  38. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., & Berg, T. (2011). Baby talk: Understanding and generating image descriptions. In Conference on computer vision and pattern recognition.Google Scholar
  39. Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In Association for computational linguistics.Google Scholar
  40. Le, D., Bernardi, R., & Uijlings, J. (2014). TUHOI: Trento universal human object interaction dataset. In Vision and language workshop at COLING.Google Scholar
  41. Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In International conference on computer vision.Google Scholar
  42. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision.Google Scholar
  43. Liu, T., Sun, J., Zheng, N., Tang, X., & Shum, H. (2007). Learning to detect a salient object. In Conference on computer vision and pattern recognition.Google Scholar
  44. Maji, S., Bourdev, L., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Conference on computer vision and pattern recognition.Google Scholar
  45. Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2011). Here’s looking at you kid. detection people looking at each other in videos. In British machine vision conference.Google Scholar
  46. Ordonez, V., Deng, J., Choi, Y., Berg, A., & Berg, T. (2013). From large scale image categorization to entry-level categories. In International conference on computer vision.Google Scholar
  47. Ordonez, V., Kulkarni, G., & Berg, T.L. (2011). Im2text: Describing images using 1 million captioned photographs. In Neural information processing systems conference.Google Scholar
  48. Papineni, K., Roukos, S., Ward, T., Zhu, & W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Association for computational linguistics.Google Scholar
  49. Park, H., & Shi, J. (2015). Social saliency. In Conference on computer vision and pattern recognition.Google Scholar
  50. Peursum, P., West, G., & Venkatesh, S. (2005). Combining image regions and human activity for indirect object recognition in indoor wide-angle views. In International conference on computer vision.Google Scholar
  51. Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Inferring the why in images. In Workshop on vision meets cognition at CVPR.Google Scholar
  52. Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.CrossRefGoogle Scholar
  53. Recasens, A., Khosla, A., Vondrick, C., & Torralba, A. (2015). Where are they looking? In Neural information processing systems conference.Google Scholar
  54. Ronchi, M. R., & Perona, P. (2015). Describing common human visual actions in images. In British machine vision conference.Google Scholar
  55. Sadovnik, A., Chiu, Y. I., Snavely, N., Edelman, S., & Chen, T. (2012). Image description with a goal: Building efficient discriminating expressions for images. In Conference on computer vision and pattern recognition.Google Scholar
  56. Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In European conference on computer vision.Google Scholar
  57. Spain, M., & Perona, P. (2011). Measuring and predicting object importance. International Journal of Computer Vision, 91(1), 59–76.CrossRefGoogle Scholar
  58. Torralba, A. (2003). IJCV, 53(2), 169–191.Google Scholar
  59. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In Conference on computer vision and pattern recognition.Google Scholar
  60. Yang, Y., Baker, S., Kannan, A., & Ramanan, D. (2012). Recognizing proxemics in personal photos. In Conference on computer vision and pattern recognition.Google Scholar
  61. Yao, B., & Fei-Fei, L. (2010a). Grouplet: A structured image representation for recognizing human and object interactions. In Conference on computer vision and pattern recognition.Google Scholar
  62. Yao, B., & Fei-Fei, L. (2010b). Modeling mutual context of object and human pose in human-object interaction activities. In Conference on computer vision and pattern recognition.Google Scholar
  63. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Fei-Fei, L. (2011). Action recognition by learning bases of action attributes and parts. In International conference on computer vision.Google Scholar
  64. Yao, B., Yang, X., Lin, L., Lee, M., & Zhu, S. C. (2010). I2T: Image parsing to text description. Proceedings of the IEEE, 98(8)Google Scholar
  65. Yatskar, M., Zettlemoyer, L., & Farhadi, A. (2016). Situation recognition: Visual semantic role labeling for image understanding. In Conference on computer vision and pattern recognition.Google Scholar
  66. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Neural information processing systems conference.Google Scholar
  67. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Neural information processing systems conference.Google Scholar
  68. Zhu, Y., Fathi, A., & Fei-Feim, L. (2014). Reasoning about object affordances in a knowledge base representation. In European conference on computer vision Google Scholar
  69. Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Texas at AustinAustinUSA

Personalised recommendations