International Journal of Computer Vision

, Volume 119, Issue 3, pp 346–373 | Cite as

Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data

  • Marcus Rohrbach
  • Anna Rohrbach
  • Michaela Regneri
  • Sikandar Amin
  • Mykhaylo Andriluka
  • Manfred Pinkal
  • Bernt Schiele
Article

Abstract

Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.

Keywords

Activity recognition Fine-grained recognition Script data Hand detection 

References

  1. Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multi-view pictorial structures for 3D human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press.Google Scholar
  2. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  3. Andriluka, M., Roth, S., & Schiele, B. (2011). Discriminative appearance models for pictorial structures. International Journal of Computer Vision (IJCV), 99, 259–280.Google Scholar
  4. Aubert, O., & Prié, Y. (2007). Advene: An open-source framework for integrating and visualising audiovisual metadata. In MM. ACM.Google Scholar
  5. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding (pp. 29–39). Springer.Google Scholar
  6. Barr, A., & Feigenbaum, E. (1981). The handbook of artificial intelligence (Vol. 1). Los Altos: William Kaufman Inc.MATHGoogle Scholar
  7. Bloem, J., Regneri, M., & Thater, S. (2012). Robust processing of noisy web-collected data. In KONVENS.Google Scholar
  8. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  9. Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  10. Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  11. Chakraborty, B., Holte, M., Moeslund, T., Gonzalez, J., & Roca, X. (2011). A selective spatio-temporal interest point detector for human action recognition in complex scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  12. Chaquet, J., Carmona, E., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.CrossRefGoogle Scholar
  13. Chen, D., & Dolan, W. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  14. Cherian, A., Mairal, J., Alahari, K., & Schmid, C. (2014). Mixing body-part sequences for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  15. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  16. Das, P., Xu, C., Doell, R., & Corso, J. (2013). Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  17. Divvala, S. K., Efros, A. A., & Hebert, M. (2012). How important are “Deformable Parts” in the Deformable Parts Model? In Computer Vision–ECCV 2012. Workshops and Demonstrations (pp. 31–40). Berlin, Heidelberg: Springer.Google Scholar
  18. Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  19. Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2011). The PASCAL action classification taster competition. International Journal of Computer Vision, 88, 303–338.CrossRefGoogle Scholar
  20. Farhadi, A., Endres, I., & Hoiem, D. (2010). Attribute-centric recognition for cross-category generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  21. Fathi, A., Farhadi, A., & Rehg, J. (2011). Understanding egocentric activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE.Google Scholar
  22. Fellbaum, C. (1998). WordNet: An electronical lexical database. Cambridge: The MIT Press.MATHGoogle Scholar
  23. Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision (IJCV), 61, 55–79.CrossRefGoogle Scholar
  24. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. EEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32, 1627–1645.CrossRefGoogle Scholar
  25. Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  26. Ferryman, J. (Ed.). (2007). PETS.Google Scholar
  27. Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Trans. Comput’73.Google Scholar
  28. Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  29. Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2013). Learning multi-modal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (p. 99).Google Scholar
  30. Gkioxari, G., Arbelaez, P., Bourdev, L., & Malik, J. (2013). Articulated pose estimation using discriminative armlet classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  31. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  32. Gupta, A., Srinivasan, P., Shi, J., & Davis, L. (2009). Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  33. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. (2013). Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia. IEEE.Google Scholar
  34. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(1), 221–231.Google Scholar
  35. Kantorov, V., & Laptev, I. (2014). Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  36. Karlinsky, L., Dinerstein, M., & Ullman, S. (2010). Using body-anchored priors for identifying actions in single images. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  37. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  38. Kliper-Gross, O., Hassner, T., & Wolf, L. (2012). The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(3), 615–621.CrossRefGoogle Scholar
  39. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  40. De la Torre, F., Hodgins, J., Montano, J., Valcarcel, S., Forcada, R., & Macey, J. (2009). Guide to the cmu multimodal activity database. Technical Report CMU-RI-TR-08-22, Robotics Institute.Google Scholar
  41. Lampert, C., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (p. 99).Google Scholar
  42. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64, 107–123.CrossRefGoogle Scholar
  43. Laptev, I., & Pérez, P. (2007). Retrieving actions in movies. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  44. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  45. Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3361–3368). IEEE.Google Scholar
  46. Li, L.-J., & Li, F.-F. (2007). What, where and who? classifying events by scene and object recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 1–8). IEEE.Google Scholar
  47. Liu, J., McCloskey, S., & Liu, Y. (2012). Training data recycling for multi-level learning. In 21st International Conference on Pattern Recognition (ICPR), 2012, (pp. 2314–2318), Nov 2012.Google Scholar
  48. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ’in the wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  49. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  50. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009.Google Scholar
  51. Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  52. Mittal, A., Zisserman, A., & Torr, P. (2011). Hand detection using multiple proposals. In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
  53. Motwani, T. S., & Mooney, R. J. (2012). Improving video activity recognition using object recognition and text mining. In ECAI (pp. 600–605) August 2012.Google Scholar
  54. Natarajan, P., & Nevatia, R. (2008). View and scale invariant action recognition using multiview shape-flow models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  55. Niebles, J., Chen, C.-W., & Li, F.-F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  56. Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP (pp. 722–729). IEEE.Google Scholar
  57. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J.T., et al. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3153–3160). IEEE.Google Scholar
  58. Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., et al. (2012). Trecvid 2012—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012. NIST, USA.Google Scholar
  59. Packer, B., Saenko, K., & Koller, D. (2012). A combined pose, object, and feature model for action understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  60. Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I.D. (2010). High five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
  61. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.Google Scholar
  62. Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  63. Ramanathan, V., Liang, P., & Li, F.-F. (2013). Video event understanding using natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  64. Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  65. Regneri, M., Koller, A., & Pinkal, M. (2010). Learning script knowledge with web experiments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  66. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. 1.Google Scholar
  67. Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  68. Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Forster, K., Troster, G., et al. (2010). Collecting complex activity data sets in highly rich networked sensor environments. In INSS.Google Scholar
  69. Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., & Schiele, B. (2014). Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition (GCPR), September 2014.Google Scholar
  70. Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  71. Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where–and why? Semantic relatedness for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  72. Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  73. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012a). A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  74. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012b). Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  75. Rohrbach, M., Ebert, S., & Schiele, B. (2013a). Transfer learning in a transductive setting. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  76. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013b). Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  77. Ryoo, M., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  78. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. In Information Processing And Management.Google Scholar
  79. Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation.Google Scholar
  80. Schank, R., & Abelson, R. (1977). Scripts, plans, goals and understanding.Google Scholar
  81. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.Google Scholar
  82. Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2014). Coherent multi-sentence video description with variable level of detail, 03/2014. arXiv:1403.6173.
  83. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1297–1304). IEEE.Google Scholar
  84. Sill, J., Takács, G., Mackey, L., & Lin, D. (2009). Feature-weighted linear stacking. arXiv:0911.0460.
  85. Singh, P., Lin, T., Mueller, E., Lim, G., Perkins, T., & Zhu, W. (2002). Open mind common sense: Knowledge acquisition from the general public. In DOA, CoopIS and ODBASE, 2002,Google Scholar
  86. Singh, V., & Nevatia, R. (2011). Action recognition in cluttered dynamic scenes using pose-specific part models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  87. Socher, R., & Li, F.-F. (2010). Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, June 2010.Google Scholar
  88. Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems (NIPS) (pp. 935–943).Google Scholar
  89. Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. Technical report, arXiv:1212.0402.
  90. Stein, S., & McKenna, S. (2013). Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp. ACM, September 2013.Google Scholar
  91. Sung, J., Ponce, C., Selman, B., & Saxena, A. (2011). Human activity detection from RGBD images. CoRR, abs/1107.0169. informal publication.Google Scholar
  92. Tang, K., Li, F.-F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, June 2012.Google Scholar
  93. Tang, K., Yao, B., Li, F.-F., & Koller, D. (2013). Combining the right features for complex event recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  94. Taylor, G.W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 140–153). Springer.Google Scholar
  95. Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In THEMIS.Google Scholar
  96. Teo, C.L., Yang, Y., Daume, H., Fermuller, C., & Aloimonos, Y. (2012). Towards a watson that sees: Language-guided action recognition for robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (pp. 374–381). IEEE.Google Scholar
  97. Ting, K.M., & Witten, I.H. (1997). Stacked generalization: When does it work? In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).Google Scholar
  98. Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  99. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.Google Scholar
  100. Wang, H., Ullah, M., Klaser, A., Laptev, I., & Schmid, C. (2009a). Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
  101. Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2011). Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  102. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103, 60–79.MathSciNetCrossRefGoogle Scholar
  103. Wang, J., Markert, K., & Everingham, M. (2009). Learning models for object recognition from natural language descriptions. In Andrea Cavallaro, Simon Prince, Daniel C. Alexander (Eds.), Proceedings of the British Machine Vision Conference (BMVC) (pp. 1–11). British Machine Vision Association.Google Scholar
  104. Wang, L., Qiao, Y., & Tang, X. (2013b). Mining motion atoms and phrases for complex action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  105. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., et al. (2010). Caltech-ucsd birds 200. Technical Report, California Institute of Technology.Google Scholar
  106. Yang, W., Wang, Y., & Mori, G. (2011). Recognizing human actions from still images with latent poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  107. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.Google Scholar
  108. Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35.Google Scholar
  109. Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011a). Does human action recognition benefit from pose estimation? In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
  110. Yao, B., & Li, F.-F. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(9), 1691–1703.MathSciNetCrossRefGoogle Scholar
  111. Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J., & Li, F.-F. (2011b). Action recognition by learning bases of action attributes and parts. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, November 2011b.Google Scholar
  112. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 29 2009.Google Scholar
  113. Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  114. Zhang, L., Khan, M.U.G., & Gotoh, Y. (2011). Video scene classification based on natural language description. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 942–949). IEEE.Google Scholar
  115. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  116. Zinnen, A., Blanke, U., & Schiele, B. (2009). An analysis of sensor-oriented vs. model-based activity recognition. In ISWC.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Marcus Rohrbach
    • 1
    • 2
  • Anna Rohrbach
    • 1
  • Michaela Regneri
    • 3
    • 6
  • Sikandar Amin
    • 1
    • 4
  • Mykhaylo Andriluka
    • 1
    • 5
  • Manfred Pinkal
    • 3
  • Bernt Schiele
    • 1
  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.UC Berkeley EECS and ICSIBerkeleyUSA
  3. 3.Department of Computational Linguistics and PhoneticsSaarland UniversitySaarbrückenGermany
  4. 4.Department of InformaticsTechnische Universität MünchenMünchenGermany
  5. 5.Stanford UniversityStanfordUSA
  6. 6.SPIEGEL-Verlag, IT DepartmentSaarland UniversityHamburgGermany

Personalised recommendations