Advertisement

Graph Distillation for Action Detection with Privileged Modalities

  • Zelun Luo
  • Jun-Ting Hsieh
  • Lu Jiang
  • Juan Carlos Niebles
  • Li Fei-Fei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)

Abstract

We propose a technique that tackles action detection in multimodal videos under a realistic and challenging condition in which only limited training data and partially observed modalities are available. Common methods in transfer learning do not take advantage of the extra modalities potentially available in the source domain. On the other hand, previous work on multimodal learning only focuses on a single domain or task and does not handle the modality discrepancy between training and testing. In this work, we propose a method termed graph distillation that incorporates rich privileged information from a large-scale multimodal dataset in the source domain, and improves the learning in the target domain where training data and modalities are scarce. We evaluate our approach on action classification and detection tasks in multimodal videos, and show that our model outperforms the state-of-the-art by a large margin on the NTU RGB+D and PKU-MMD benchmarks. The code is released at http://alan.vision/eccv18_graph/.

Notes

Acknowledgement

This work was supported in part by Stanford Computer Science Department and Clinical Excellence Research Center. We specially thank Li-Jia Li, De-An Huang, Yuliang Zou, and all the anonymous reviewers for their valuable comments.

Supplementary material

Supplementary material 1 (mp4 15128 KB)

References

  1. 1.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: International Conference on Machine Learning (ICML) (2009)Google Scholar
  2. 2.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  4. 4.
    Caruana, R.: Multitask learning. In: Thrun, S., Pratt, L. (eds.) Learning to learn, pp. 95–133. Springer, Boston (1998).  https://doi.org/10.1007/978-1-4615-5529-2_5CrossRefGoogle Scholar
  5. 5.
    Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  6. 6.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling (2014)Google Scholar
  7. 7.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  8. 8.
    Ding, Z., Shao, M., Fu, Y.: Missing modality transfer learning via latent low-rank constraint. IEEE Trans. Image Process. 24(11), 4322–4334 (2015).  https://doi.org/10.1109/TIP.2015.2462023MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ding, Z., Wang, P., Ogunbona, P.O., Li, W.: Investigation of different skeleton features for CNN-based 3D action recognition. arXiv preprint arXiv:1705.00835 (2017)
  10. 10.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  11. 11.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_47CrossRefGoogle Scholar
  12. 12.
    Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: International Conference on Computer Vision (ICCV), pp. 2960–2967 (2013)Google Scholar
  13. 13.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  14. 14.
    Gorban, A., et al.: Thumos challenge: action recognition with a large number of classes. In: Computer Vision and Pattern Recognition (CVPR) Workshop (2015)Google Scholar
  15. 15.
    Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  16. 16.
    Haque, A., et al.: Towards vision-based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In: Proceedings of Machine Learning for Healthcare 2017 (2017)Google Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  18. 18.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Workshop (2015)Google Scholar
  19. 19.
    Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  20. 20.
    Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G.: Easy samples first: self-paced reranking for zero-example multimedia search. In: MM (2014)Google Scholar
  21. 21.
    Kingma, P.K., Ba, J.: Adam: a method for stochastic optimization (2015)Google Scholar
  22. 22.
    Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013)CrossRefGoogle Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)Google Scholar
  24. 24.
    Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based action recognition with convolutional neural networks. arXiv preprint arXiv:1704.07595 (2017)
  25. 25.
    Li, W., Chen, L., Xu, D., Gool, L.V.: Visual recognition in RGB images and videos by learning from RGB-D data. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 2030–2036 (2018).  https://doi.org/10.1109/TPAMI.2017.2734890CrossRefGoogle Scholar
  26. 26.
    Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, J.: Learning from noisy labels with distillation. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  27. 27.
    Liang, J., Jiang, L., Meng, D., Hauptmann, A.G.: Learning to detect concepts from webly-labeled video data. In: IJCAI (2016)Google Scholar
  28. 28.
    Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)
  29. 29.
    Liu, J., Akhtar, N., Mian, A.: Viewpoint invariant action recognition using RGB-D videos. arXiv preprint arXiv:1709.05087 (2017)
  30. 30.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_50CrossRefGoogle Scholar
  31. 31.
    Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  32. 32.
    Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 68, 346–362 (2017)CrossRefGoogle Scholar
  33. 33.
    Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  34. 34.
    Luo, Z., et al.: Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. In: Machine Learning for Healthcare (MLHC) (2018)Google Scholar
  35. 35.
    Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  36. 36.
    Luo, Z., Zou, Y., Hoffman, J., Fei-Fei, L.: Label efficient learning of transferable representations across domains and tasks. In: Advances in Neural Information Processing Systems (NIPS) (2017)Google Scholar
  37. 37.
    Montes, A., Salvador, A., Giro-i Nieto, X.: Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128 (2016)
  38. 38.
    Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Information bottleneck learning using privileged information for visual recognition. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  39. 39.
    Ni, B., Wang, G., Moulin, P.: RGBD-HUDaACT: a color-depth video database for human daily activity recognition. In: Consumer Depth Cameras for Computer Vision (2013)CrossRefGoogle Scholar
  40. 40.
    Noury, N., et al.: Fall detection-principles and methods. In: Engineering in Medicine and Biology Society (2007)Google Scholar
  41. 41.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010).  https://doi.org/10.1109/TKDE.2009.191CrossRefGoogle Scholar
  42. 42.
    Qin, Z., Shelton, C.R.: Event detection in continuous video: an inference in point process approach. IEEE Trans. Image Process. 26(12), 5680–5691 (2017)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  44. 44.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Neural Information Processing Systems (NIPS) (2015)Google Scholar
  45. 45.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  46. 46.
    Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+ D videos. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) (2017)Google Scholar
  47. 47.
    Shao, L., Cai, Z., Liu, L., Lu, K.: Performance evaluation of deep feature learning for RGB-D image/video classification. Inf. Sci. 385, 266–283 (2017)CrossRefGoogle Scholar
  48. 48.
    Shi, Z., Kim, T.K.: Learning and refining of privileged information-based RNNS for action recognition from depth sequences. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  49. 49.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  50. 50.
    Sung, J., Ponce, C., Selman, B., Saxena, A.: Human activity detection from RGBD images. In: AAAI Workshop on Pattern, Activity and Intent Recognition (2011)Google Scholar
  51. 51.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  52. 52.
    Vapnik, V., Vashist, A.: A new learning paradigm: learning using privileged information. Neural Netw. 22(5), 544–557 (2009)CrossRefGoogle Scholar
  53. 53.
    Wang, H., Wang, L.: Learning robust representations using recurrent neural networks for skeleton based action classification and detection. In: International Conference on Multimedia & Expo Workshops (ICMEW) (2017)Google Scholar
  54. 54.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  55. 55.
    Wang, Z., Ji, Q.: Classifier learning with hidden information. In: Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  56. 56.
    Xu, D., Ouyang, W., Ricci, E., Wang, X., Sebe, N.: Learning cross-modal deep representations for robust pedestrian detection. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  57. 57.
    Yang, H., Zhou, J.T., Cai, J., Ong, Y.S.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  58. 58.
    Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., Fei-Fei, L.: Learning to learn from noisy web videos (2017)Google Scholar
  59. 59.
    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  60. 60.
    Yu, M., Liu, L., Shao, L.: Structure-preserving binary representations for RGB-D action recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(8), 1651–1664 (2016)CrossRefGoogle Scholar
  61. 61.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-\({L}^{1}\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74936-3_22CrossRefGoogle Scholar
  62. 62.
    Zhang, S., Liu, X., Xiao, J.: On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)Google Scholar
  63. 63.
    Zhang, Z., Conly, C., Athitsos, V.: A survey on vision-based fall detection. In: Conference on PErvasive Technologies Related to Assistive Environments (PETRA) (2015)Google Scholar
  64. 64.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Stanford UniversityStanfordUSA
  2. 2.Google Inc.Mountain ViewUSA

Personalised recommendations