Abstract
We propose two complementary techniques to improve the performance of action recognition systems. The first technique addresses the temporal interval ambiguity of actions by learning a classifier score distribution over video subsequences. A classifier based on this score distribution is shown to be more effective than using the maximum or average scores. The second technique learns a classifier for the relative values of action scores, capturing the correlation and exclusion between action classes. Both techniques are simple and have efficient implementations using a Least-Squares SVM. We demonstrate that taken together the techniques exceed the state-of-the-art performance by a wide margin on challenging benchmarks for human actions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in tv shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2441–2453 (2012)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (2011)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the International Conference on Computer Vision (2013)
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 536–548. Springer, Heidelberg (2010)
Duchenne, O., Laptev, I., Sivic, J., Bach, F.R., Ponce, J.: Automatic annotation of human actions in video. In: Proceedings of the International Conference on Computer Vision (2009)
Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: Proceedings of the International Conference on Computer Vision (2011)
Shapovalova, N., Vahdat, A., Cannons, K., Lan, T., Mori, G.: Similarity constrained latent support vector machine: an application to weakly supervised action classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 55–68. Springer, Heidelberg (2012)
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34, 601–614 (2012)
Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems (2003)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–1645 (2010)
Dietterich, T., Lathrop, R., Lozano-Pérez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artif. Intell. 89, 31–71 (1997)
Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Advances in Neural Information Processing Systems (1998)
Zhang, Q., Goldman, S.A.: EM-DD: an improved multiple-instance learning technique. In: Advances in Neural Information Processing Systems (2002)
Hu, Y., Li, M., Yu, N.: Multiple-instance ranking: learning to rank images for image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)
Ray, S., Craven, M.: Supervised versus multiple instance learning: an empirical comparison. In: Proceedings of the International Conference on Machine Learning (2005)
Wohlhart, P., Köstinger, M., Roth, P.M., Bischof, H.: Multiple instance boosting for face recognition in videos. In: Proceedings of the International Conference on Pattern Recognition (2011)
Gartner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Proceedings of the International Conference on Machine Learning (2002)
Chen, Y., Bi, J., Wang, J.Z.: Miles: multiple-instance learning via embedded instance selection. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1931–1947 (2006)
Kwok, J.T., Cheung, P.M.: Marginalized multi-instance kernels. In: International Joint Conference on Artificial Intelligence (2007)
Ping, W., Xu, Y., Wang, J., Hua, X.S.: FAMER: making multi-instance learning better and faster. In: International Conference on Data Mining (2011)
Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d. samples. In: Proceedings of the International Conference on Machine Learning (2009)
Ping, W., Xu, Y., Ren, K., Chi, C.H., Shen, F.: Non-I.I.D. multi-instance dimensionality reduction by learning a maximum bag margin subspace. In: AAAI Conference on Artificial Intelligence (2010)
Li, W., Duan, L., Xu, D., Tsang, I.W.H.: Text-based image retrieval using progressive multi-instance learning. In: Proceedings of the International Conference on Computer Vision (2011)
Hajimirsadeghi, H., Li, J., Mori, G., Sayed, T., Zaki, M.: Multiple instance learning by discriminative training of markov networks. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (2013)
Poggio, T., Vetter, T.: Recognition and structure from one 2D model view: observations on prototypes, object classes and symmetries. Technical report AIM-1347, MIT (1992)
Vedaldi, A., Blaschko, M., Zisserman, A.: Learning equivariant structured output svm regressors. In: Proceedings of the International Conference on Computer Vision (2011)
Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action classification. In: Proceedings of the International Conference on Computer Vision (2007)
Nguyen, M.H., Torresani, L., De la Torre, F., Rother, C.: Weakly supervised discriminative localization and classification: a joint learning process. In: Proceedings of the International Conference on Computer Vision (2009)
Yuan, J., Liu, Z., Yu, Y.: Discriminative subvolume search for efficient action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
Hoai, M., Lan, Z.Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011)
Gaidon, A., Harchaoui, Z., Schmid, C.: Actom sequence models for efficient action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011)
Cheung, P.M., Kwok, J.T.: A regularization framework for multiple-instance learning. In: Proceedings of the International Conference on Machine Learning (2006)
Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Trans. Syst. Man Cybern. 18, 183–190 (1988)
Yager, R.R., Filev, D.P.: Induced ordered weighted averaging operators. IEEE Trans. Syst. Man Cybern. 29, 141–150 (1999)
Hajimirsadeghi, H., Mori, G.: Multiple instance real boosting with aggregation functions. In: Proceedings of the International Conference on Pattern Recognition (2012)
Li, F., Sminchisescu, C.: Convex multiple-instance learning by estimating likelihood ratio. In: Advances in Neural Information Processing Systems (2010)
Aytar, Y., Orhan, O.B., Shah, M.: Improving semantic concept detection and retrieval using contextual estimates. In: ICME (2007)
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: Proceedings of the International Conference on Computer Vision (2007)
Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 776–789. Springer, Heidelberg (2010)
Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: a high-level image representation for scene classification and semantic feature sparsification. In: Advances in Neural Information Processing Systems (2010)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
Bourdev, L., Maji, S., Malik, J.: Describing people: a poselet-based approach to attribute classification. In: Proceedings of the International Conference on Computer Vision, pp. 1543–1550 (2011)
Song, Z., Chen, Q., Huang, Z., Hua, Y., Yan, S.: Contextualizing object detection and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010)
Suykens, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999)
Saunders, C., Gammerman, A., Vovk, V.: Ridge regression learning algorithm in dual variables. In: Proceedings of the International Conference on Machine Learning (1998)
Suykens, J.A.K., Gestel, T.V., Brabanter, J.D., DeMoor, B., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific, Singapore (2002)
Tommasi, T., Caputo, B.: The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In: Proceedings of the British Machine Vision Conference (2009)
Hoai, M.: Regularized max pooling for image categorization. In: Proceedings of the British Machine Vision Conference (2014)
Cawley, G.C., Talbot, N.L.: Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Netw. 17, 1467–1475 (2004)
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)
Vig, E., Dorr, M., Cox, D.: Space-variant descriptor sampling for action recognition based on saliency and eye movements. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. Lecture Notes in Computer Science, vol. 7578, pp. 84–97. Springer, Heidelberg (2012)
Marin-Jimenez, M.J., Yeguas, E., de la Blanca, N.P.: Exploring stip-based models for recognizing human interactions in tv videos. PRL 34, 1819–1828 (2013)
Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-based modeling of human actions with motion reference points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012)
Mathe, S., Sminchisescu, C.: Dynamic eye movement datasets and learnt saliency models for visual action recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 842–856. Springer, Heidelberg (2012)
Gaidon, A., Harchaoui, Z., Schmid, C.: Recognizing activities with cluster-trees of tracklets. In: Proceedings of the British Machine Vision Conference (2012)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 256–269. Springer, Heidelberg (2012)
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Heidelberg (2014)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
Yu, G., Yuan, J., Liu, Z.: Propagative hough voting for human activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. lncs, vol. 7574, pp. 693–706. Springer, Heidelberg (2012)
Hoai, M., Zisserman, A.: Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Acknowledgements
This work was supported by the EPSRC grant EP/I012001/1 and a Royal Society Wolfson Research Merit Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hoai, M., Zisserman, A. (2015). Improving Human Action Recognition Using Score Distribution and Ranking. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision -- ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9007. Springer, Cham. https://doi.org/10.1007/978-3-319-16814-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-16814-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16813-5
Online ISBN: 978-3-319-16814-2
eBook Packages: Computer ScienceComputer Science (R0)