World Wide Web

, Volume 19, Issue 2, pp 265–276 | Cite as

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

  • Chenqiang Gao
  • Luyu Yang
  • Yinhe Du
  • Zeming Feng
  • Jiang Liu


As an important task in computer vision, the interaction recognition has attracted extensive attention due to its widely potential applications. The existing methods mainly focus on the interaction recognition problem on constrained datasets with few variations of scenes, viewpoints, background clutter for the experimental purpose. The performance of the recently proposed methods on the available constrained dataset almost approaches to saturation, which is not adaptive to further evaluate the robustness of new methods. In this paper, we introduce a new unconstrained dataset, called WEB-interaction, collected from the Internet. Our WEB-interaction more represents realistic scenes and has much more challenges than existing datasets. Besides, we evaluate the state-of-the-art pipeline of interaction recognition on both WEB-interaction and UT-interaction datasets. The evaluation results reveal that MBHx and MBHy of Motion Boundary Histogram (MBH) are important feature descriptors for interaction recognition and MBHx has relatively dominative information. For fusion strategy, the late fusion benefits more to performance than early fusion. Filming condition effects are also evaluated on WEB-interaction dataset. In addition, the best average precision(AP) result of different features on our WEB-interaction dataset is 44.2 % and the mean is around 38 %. Compare to the UT-interaction dataset, our dataset has bigger improvement space, which is more significant to promote new methods.


Interaction recognition Interaction dataset Feature fusion The Internet 



This work is supported by the National Natural Science Foundation of China (No. 61102131, 61275099), the Natural Science Foundation of Chongqing Science and Technology Commission (No. cstc2014jcyjA40048), Cooperation of Industry, Education and Academy of Chongqing University of Posts and Telecommunications No. WF201404), the Chongqing Distinguished Youth Foundation (No. CSTC2011jjjq40002).


  1. 1.
    Cai, Y., Chen, Q., Brown, L., Datta, A., Fan, Q., Feris, R., Yan, S., Hauptmann, A., Pankanti, S.: Cmu-ibm-nus@trecvid 2012: Surveillance event detection. In: Proc. TRECVID (2012)Google Scholar
  2. 2.
    Chang, C-C., Lin, C-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  3. 3.
    Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.S.: How useful is social feedback for learning to rank youtube videos? World Wide Web 17(5), 997–1025 (2014)CrossRefGoogle Scholar
  4. 4.
    Chen, M-y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos (2009)Google Scholar
  5. 5.
    Clausi, D.A., Deng, H.: Design-based texture feature fusion using gabor filters and co-occurrence probabilities. IEEE Trans. Image Process. 14(7), 925–936 (2005)CrossRefGoogle Scholar
  6. 6.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, volume 1, pp. 1–2. Prague (2004)Google Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp 886–893. IEEE (2005)Google Scholar
  8. 8.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006, pp 428–441. Springer (2006)Google Scholar
  9. 9.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE (2005)Google Scholar
  10. 10.
    Douze, M., Jégou, H., Schmid, C.: An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Trans. Multimedia 12(4), 257–266 (2010)CrossRefGoogle Scholar
  11. 11.
    Fu, Y., Jia, Y., Kong, Y.: Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2014)Google Scholar
  12. 12.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vis. 107(3), 219–238 (2014)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Gallese, V., Fadiga, L., Fogassi, L., Rizzolatti, G.: Action recognition in the premotor cortex. Brain 119(2), 593–609 (1996)CrossRefGoogle Scholar
  14. 14.
    Han, Y-h., Shao, J., Wu, F., Wei, B-g.: Multiple hypergraph ranking for video concept detection. J. Zhejiang Univ. Sci. C 11(7), 525–537 (2010)CrossRefGoogle Scholar
  15. 15.
    Han, Y., Yang, Y., Yan, Y., Ma, Z., Sebe, N., Zhou, X.: Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst. 26(2), 252–264 (2015)CrossRefGoogle Scholar
  16. 16.
    Hoai, M., Zisserman, A.: Improving human action recognition using score distribution and ranking. In: Proceedings of the Asian Conference on Computer Vision (2014)Google Scholar
  17. 17.
    Huang, G., Zhang, Y., Cao, J., Steyn, M., Taraporewalla, K.: Online mining abnormal period patterns from multiple medical sensor data streams. World Wide Web 17(4), 569–587 (2014)CrossRefGoogle Scholar
  18. 18.
    Kong, Y., Jia, Y., Yun, F.: Learning human interaction by interactive phrases. In: Computer Vision–ECCV 2012, pp. 300–313. Springer (2012)Google Scholar
  19. 19.
    Lan, Z-z., Bao, L., Yu, S-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. Springer (2012)Google Scholar
  20. 20.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2-3), 107–123 (2005)CrossRefGoogle Scholar
  21. 21.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE (2008)Google Scholar
  22. 22.
    Lin, G., Zhu, H., Kang, X., Fan, C., Zhang, E.: Feature structure fusion and its application. Information Fusion 20, 146–154 (2014)CrossRefGoogle Scholar
  23. 23.
    Liu, Y., Han, Y.: A real-world web cross-media dataset containing images, texts and videos. In: Proceedings of International Conference on Internet Multimedia Computing and Service, p. 332. ACM (2014)Google Scholar
  24. 24.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  25. 25.
    Ma, Z., Yang, Y., Sebe, N., Hauptmann, A. G.: Multiple features but few labels?: A symbiotic solution exemplified for video analysis. In: Proceedings of the ACM International Conference on Multimedia, pp. 77–86. ACM (2014)Google Scholar
  26. 26.
    Nour el Houda Slimani, K., Benezeth, Y., Souami, F.: Human interaction recognition based on the co-occurrence of visual words. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 461–466. IEEE (2014)Google Scholar
  27. 27.
    Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 1817–1824. IEEE (2013)Google Scholar
  28. 28.
    Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in tv shows. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2441–2453 (2012)CrossRefGoogle Scholar
  29. 29.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010, pp. 143–156. Springer (2010)Google Scholar
  30. 30.
    Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043. IEEE (2011)Google Scholar
  31. 31.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Computer vision, 2009 ieee 12th international conference on, pp. 1593–1600. IEEE (2009)Google Scholar
  32. 32.
    Ryoo, M.S., Chen, C-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (sdha) 2010. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Springer (2010)Google Scholar
  33. 33.
    Sener, F., Bas, C., Ikizler-Cinbis, N.: On recognizing actions in still images via multiple features. In: Computer Vision–ECCV 2012. Workshops and Demonstrations, pp. 263–272. Springer (2012)Google Scholar
  34. 34.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1470–1477. IEEE (2003)Google Scholar
  35. 35.
    Snoek, C.G.M., Worring, M., Smeulders, A.W.M: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399–402. ACM (2005)Google Scholar
  36. 36.
    Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 1729–1736. IEEE (2011)Google Scholar
  37. 37.
    Waltisberg, D., Yao, A., Gall, J., Gool, L.V.: Variations of a hough-voting action recognition system. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer (2010)Google Scholar
  38. 38.
    Wang, H., Klaser, A., Schmid, C., Liu, C-L.: Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3169–3176. IEEE (2011)Google Scholar
  39. 39.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)Google Scholar
  40. 40.
    Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)CrossRefGoogle Scholar
  41. 41.
    Wu, J., Chen, F., Hu, D.: Human interaction recognition by spatial structure models. In: Intelligence Science and Big Data Engineering, pp. 216–222. Springer (2013)Google Scholar
  42. 42.
    Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2834–2841. IEEE (2013)Google Scholar
  43. 43.
    Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis., 1–15 (2014)Google Scholar
  44. 44.
    Yang, Y., Ma, Z., Xu, Z., Yan, S., Hauptmann, A.G.: How related exemplars help complex event detection in web videos?. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 2104–2111. IEEE (2013)Google Scholar
  45. 45.
    Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013)CrossRefGoogle Scholar
  46. 46.
    Ye, G., Liu, D., Jhuo, I-H., Chang, S-F.: Robust late fusion with rank minimization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3021–3028. IEEE (2012)Google Scholar
  47. 47.
    Yu, T-H., Kim, T-K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. In: BMVC, vol. 2 (2010)Google Scholar
  48. 48.
    Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal interest points for action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 724–730. IEEE (2013)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Chenqiang Gao
    • 1
  • Luyu Yang
    • 1
  • Yinhe Du
    • 1
  • Zeming Feng
    • 1
  • Jiang Liu
    • 1
  1. 1.Chongqing Key Laboratory of Signal and Information ProcessingChongqing University of Posts and TelecommunicationsChongqingChina

Personalised recommendations