Action Recognition Using Co-trained Deep Convolutional Neural Networks

Zhang, Le; Varadarajan, Jagannadan; Pei, Yong

doi:10.1007/978-3-030-56150-5_8

Le Zhang¹⁰,
Jagannadan Varadarajan¹¹ &
Yong Pei¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12158))

Included in the following conference series:

International Joint Conference on Artificial Intelligence

426 Accesses

Abstract

Deep convolutional networks have become ubiquitous in computer vision owing to their success in visual recognition task on still images. However, their adaptations to video classification have not clearly established their superiority over conventional hand crafted features. Existing CNN methods for action recognition typically train multiple streams to independently deal with spatial and temporal information and then combine their prediction scores. But relatively little is known about the benefits of combining these modalities during the training process. In this work, we propose a novel semi-supervised learning approach that allows multiple streams to supervise each other in a co-training strategy, thus making the training simultaneous in the two modalities. We show that transferring information between the networks by predicting labels on an unlabeled set outperforms state-of-the-art methods. Furthermore, we also show that performance of our approach is comparable to existing methods but while using less data. We demonstrate the effectiveness of our approach through extensive experiments on the UCF 101 and HMDB datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Since we consider the spatial and temporal aspects as two views of the data, we use the terms streams and views interchangeably.
2.
https://github.com/yjxiong/caffe.

References

Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE Trans. Pattern Anal. Mach. Intell. 22, 844–851 (2000)
Article Google Scholar
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
Article Google Scholar
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance, vol. 2003 (October 2003)
Google Scholar
Laptev, I., Lindeberg, T.: Space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR (2011)
Google Scholar
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Article Google Scholar
Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)
Google Scholar
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circuits Syst. Video Technol. 18, 1473–1488 (2008)
Article Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. Crcv-tr-12-01, UCF (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Sutskever, I., Krizhevsky, A., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Szegedy, C.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Zhang, L., et al.: Nonlinear regression via deep negative correlation learning. In: IEEE TPAMI (2019)
Google Scholar
Shi, Z., et al.: Crowd counting with deep negative correlation learning. In: CVPR, pp. 5382–5390 (2018)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Liu, Y., et al.: DEL: deep embedding learning for efficient image segmentation. In: IJCAI, vol. 864, p. 870 (2018)
Google Scholar
Zhang, L., Peng, S., Winkler, S.: Persemon: a deep network for joint analysis of apparent personality, emotion and their relationship. IEEE Trans. Affect. Comput. (2019)
Google Scholar
Zhang, L., Varadarajan, J., Nagaratnam Suganthan, P., Ahuja, N., Moulin, P.: Robust visual tracking using oblique random forests. In: CVPR, pp. 5589–5598 (2017)
Google Scholar
Zhang, L., Suganthan, P.N.: Robust visual tracking via co-trained kernelized correlation filters. PR 69, 82–93 (2017)
Google Scholar
Zhang, L., Suganthan, P.N.: Visual tracking with convolutional random vector functional link network. IEEE Trans. Cybern. 47(10), 3243–3253 (2016)
Article Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)
Google Scholar
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)
Google Scholar
Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: semi-supervised learning of object detectors from videos. CoRR arxiv:1505.05769 (2015)
Dai, D., Van Gool, L.: Ensemble projection for semi-supervised image classification (2013)
Google Scholar
Dai, D., Van Gool, L.: Unsupervised high-level feature learning by ensemble projection for semi-supervised image classification and image clustering. Technical report, ETH Zurich (2016)
Google Scholar
Carbonetto, P., Dorkó, G., Schmid, C., Kück, H., de Freitas, N.: A semi-supervised learning approach to object recognition with spatial integration of local features and segmentation cues. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 277–300. Springer, Heidelberg (2006). https://doi.org/10.1007/11957959_15
Chapter Google Scholar
Gupta, S., Kim, J., Grauman, K., Mooney, R.: Watch, listen and learn: co-training on captioned images and videos. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 457–472. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_48
Chapter Google Scholar
Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)
Google Scholar
Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream SR-CNNs for action recognition in videos. In: BMVC (2016)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)
Google Scholar
Park, E., Han, X., Berg, T.L., Berg, A.C.: Combining multiple sources of knowledge in deep CNNs for action recognition. In: WACV (2016)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)
Google Scholar
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: CIKM (2000)
Google Scholar
Levin, A., Viola, P.A., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: ICCV (2003)
Google Scholar
Christoudias, C.M., Urtasun, R., Kapoorz, A., Darrell, T.: Co-training with noisy perceptual observations. In: CVPR, pp. 2844–2851 (2009)
Google Scholar
Goldman, S.A., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: ICML (2000)
Google Scholar
Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training. In: IJCAI (2005)
Google Scholar
Yu, S., Krishnapuram, B., Steck, H., Rao, R.B., Rosales, R.: Bayesian co-training. In: JMLR, vol. 12 (November 2011)
Google Scholar
Gorban, A., et al.: THUMOS challenge: action recognition with a large number of classes (2015). http://www.thumos.info/
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L¹ optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Chapter Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Google Scholar
Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)
Google Scholar
Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: IEEE conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)
Google Scholar
Wang, H., Schmid, C.: Lear-Inria submission for the thumos workshop. In: ICCV Workshop on Action Recognition with a Large Number of Classes, vol. 2, p. 8 (2013)
Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)
Article MathSciNet Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

I2R, A*STAR, Singapore, Singapore
Le Zhang
Grab, Singapore, Singapore
Jagannadan Varadarajan
Webank, Shenzhen, People’s Republic of China
Yong Pei

Authors

Le Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jagannadan Varadarajan
View author publications
You can also search for this author in PubMed Google Scholar
Yong Pei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Le Zhang .

Editor information

Editors and Affiliations

Sorbonne University – Sciences, Paris, France
Amal El Fallah Seghrouchni
Bar-Ilan University, Ramat Gan, Israel
David Sarne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Varadarajan, J., Pei, Y. (2020). Action Recognition Using Co-trained Deep Convolutional Neural Networks. In: El Fallah Seghrouchni, A., Sarne, D. (eds) Artificial Intelligence. IJCAI 2019 International Workshops. IJCAI 2019. Lecture Notes in Computer Science(), vol 12158. Springer, Cham. https://doi.org/10.1007/978-3-030-56150-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-56150-5_8
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-56149-9
Online ISBN: 978-3-030-56150-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics