Skip to main content

Action Recognition Using Co-trained Deep Convolutional Neural Networks

  • Conference paper
  • First Online:
Artificial Intelligence. IJCAI 2019 International Workshops (IJCAI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12158))

Included in the following conference series:

  • 426 Accesses

Abstract

Deep convolutional networks have become ubiquitous in computer vision owing to their success in visual recognition task on still images. However, their adaptations to video classification have not clearly established their superiority over conventional hand crafted features. Existing CNN methods for action recognition typically train multiple streams to independently deal with spatial and temporal information and then combine their prediction scores. But relatively little is known about the benefits of combining these modalities during the training process. In this work, we propose a novel semi-supervised learning approach that allows multiple streams to supervise each other in a co-training strategy, thus making the training simultaneous in the two modalities. We show that transferring information between the networks by predicting labels on an unlabeled set outperforms state-of-the-art methods. Furthermore, we also show that performance of our approach is comparable to existing methods but while using less data. We demonstrate the effectiveness of our approach through extensive experiments on the UCF 101 and HMDB datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Since we consider the spatial and temporal aspects as two views of the data, we use the terms streams and views interchangeably.

  2. 2.

    https://github.com/yjxiong/caffe.

References

  1. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE Trans. Pattern Anal. Mach. Intell. 22, 844–851 (2000)

    Article  Google Scholar 

  2. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)

    Article  Google Scholar 

  3. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance, vol. 2003 (October 2003)

    Google Scholar 

  4. Laptev, I., Lindeberg, T.: Space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  5. Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR (2011)

    Google Scholar 

  6. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)

    Article  Google Scholar 

  7. Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)

    Google Scholar 

  8. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circuits Syst. Video Technol. 18, 1473–1488 (2008)

    Article  Google Scholar 

  9. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. Crcv-tr-12-01, UCF (2012)

    Google Scholar 

  10. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)

    Google Scholar 

  11. Sutskever, I., Krizhevsky, A., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

    Google Scholar 

  12. Szegedy, C.: Going deeper with convolutions. In: CVPR (2015)

    Google Scholar 

  13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)

  15. Zhang, L., et al.: Nonlinear regression via deep negative correlation learning. In: IEEE TPAMI (2019)

    Google Scholar 

  16. Shi, Z., et al.: Crowd counting with deep negative correlation learning. In: CVPR, pp. 5382–5390 (2018)

    Google Scholar 

  17. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  18. Liu, Y., et al.: DEL: deep embedding learning for efficient image segmentation. In: IJCAI, vol. 864, p. 870 (2018)

    Google Scholar 

  19. Zhang, L., Peng, S., Winkler, S.: Persemon: a deep network for joint analysis of apparent personality, emotion and their relationship. IEEE Trans. Affect. Comput. (2019)

    Google Scholar 

  20. Zhang, L., Varadarajan, J., Nagaratnam Suganthan, P., Ahuja, N., Moulin, P.: Robust visual tracking using oblique random forests. In: CVPR, pp. 5589–5598 (2017)

    Google Scholar 

  21. Zhang, L., Suganthan, P.N.: Robust visual tracking via co-trained kernelized correlation filters. PR 69, 82–93 (2017)

    Google Scholar 

  22. Zhang, L., Suganthan, P.N.: Visual tracking with convolutional random vector functional link network. IEEE Trans. Cybern. 47(10), 3243–3253 (2016)

    Article  Google Scholar 

  23. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)

    Google Scholar 

  24. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

    Google Scholar 

  25. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)

    Google Scholar 

  26. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)

    Google Scholar 

  27. Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: semi-supervised learning of object detectors from videos. CoRR arxiv:1505.05769 (2015)

  28. Dai, D., Van Gool, L.: Ensemble projection for semi-supervised image classification (2013)

    Google Scholar 

  29. Dai, D., Van Gool, L.: Unsupervised high-level feature learning by ensemble projection for semi-supervised image classification and image clustering. Technical report, ETH Zurich (2016)

    Google Scholar 

  30. Carbonetto, P., Dorkó, G., Schmid, C., Kück, H., de Freitas, N.: A semi-supervised learning approach to object recognition with spatial integration of local features and segmentation cues. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 277–300. Springer, Heidelberg (2006). https://doi.org/10.1007/11957959_15

    Chapter  Google Scholar 

  31. Gupta, S., Kim, J., Grauman, K., Mooney, R.: Watch, listen and learn: co-training on captioned images and videos. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 457–472. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_48

    Chapter  Google Scholar 

  32. Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  33. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  34. Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)

    Google Scholar 

  35. Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream SR-CNNs for action recognition in videos. In: BMVC (2016)

    Google Scholar 

  36. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  37. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)

    Google Scholar 

  38. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)

    Google Scholar 

  39. Park, E., Han, X., Berg, T.L., Berg, A.C.: Combining multiple sources of knowledge in deep CNNs for action recognition. In: WACV (2016)

    Google Scholar 

  40. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)

    Google Scholar 

  41. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: CIKM (2000)

    Google Scholar 

  42. Levin, A., Viola, P.A., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: ICCV (2003)

    Google Scholar 

  43. Christoudias, C.M., Urtasun, R., Kapoorz, A., Darrell, T.: Co-training with noisy perceptual observations. In: CVPR, pp. 2844–2851 (2009)

    Google Scholar 

  44. Goldman, S.A., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: ICML (2000)

    Google Scholar 

  45. Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training. In: IJCAI (2005)

    Google Scholar 

  46. Yu, S., Krishnapuram, B., Steck, H., Rao, R.B., Rosales, R.: Bayesian co-training. In: JMLR, vol. 12 (November 2011)

    Google Scholar 

  47. Gorban, A., et al.: THUMOS challenge: action recognition with a large number of classes (2015). http://www.thumos.info/

  48. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  49. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  50. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  51. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)

  52. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22

    Chapter  Google Scholar 

  53. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

  54. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)

    Google Scholar 

  55. Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)

    Google Scholar 

  56. Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: IEEE conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)

    Google Scholar 

  57. Wang, H., Schmid, C.: Lear-Inria submission for the thumos workshop. In: ICCV Workshop on Action Recognition with a Large Number of Classes, vol. 2, p. 8 (2013)

    Google Scholar 

  58. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  59. Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)

    Article  MathSciNet  Google Scholar 

  60. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

    Google Scholar 

  61. Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Le Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, L., Varadarajan, J., Pei, Y. (2020). Action Recognition Using Co-trained Deep Convolutional Neural Networks. In: El Fallah Seghrouchni, A., Sarne, D. (eds) Artificial Intelligence. IJCAI 2019 International Workshops. IJCAI 2019. Lecture Notes in Computer Science(), vol 12158. Springer, Cham. https://doi.org/10.1007/978-3-030-56150-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-56150-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-56149-9

  • Online ISBN: 978-3-030-56150-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics