Advertisement

PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-Modalities

  • Lan Wang
  • Chenqiang Gao
  • Luyu Yang
  • Yue Zhao
  • Wangmeng Zuo
  • Deyu Meng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Data of different modalities generally convey complimentary but heterogeneous information, and a more discriminative representation is often preferred by combining multiple data modalities like the RGB and infrared features. However in reality, obtaining both data channels is challenging due to many limitations. For example, the RGB surveillance cameras are often restricted from private spaces, which is in conflict with the need of abnormal activity detection for personal security. As a result, using partial data channels to build a full representation of multi-modalities is clearly desired. In this paper, we propose a novel Partial-modal Generative Adversarial Networks (PM-GANs) that learns a full-modal representation using data from only partial modalities. The full representation is achieved by a generated representation in place of the missing data channel. Extensive experiments are conducted to verify the performance of our proposed method on action recognition, compared with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset for action recognition is introduced, and will be the first publicly available action dataset that contains paired infrared and visible spectrum. (The dataset will be available at http://www.escience.cn/people/gaochenqiang/Publications.html).

Keywords

Cross-modal representation Generative adversarial networks Infrared action recognition Infrared dataset 

Notes

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 61571071, 61661166011, 61721002), the Natural Science Foundation of Chongqing Science and Technology Commission (No. cstc2018jcyjAX0227) and the Research Innovation Program for Postgraduate of Chongqing (No. CYS17222).

References

  1. 1.
    Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)CrossRefGoogle Scholar
  2. 2.
    Bouwmans, T., Zahzah, E.H.: Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance. Comput. Vis. Image Underst. 122, 22–34 (2014)CrossRefGoogle Scholar
  3. 3.
    Bronstein, M.M., Bronstein, A.M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3594–3601. IEEE (2010)Google Scholar
  4. 4.
    Bui, D.T., Nguyen, Q.P., Hoang, N.D., Klempe, H.: A novel fuzzy k-nearest neighbor inference model with differential evolution for spatial prediction of rainfall-induced shallow landslides in a tropical hilly area using GIS. Landslides 14(1), 1–17 (2017)CrossRefGoogle Scholar
  5. 5.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  6. 6.
    Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., Torralba, A.: Learning aligned cross-modal representations from weakly aligned data. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2940–2949. IEEE (2016)Google Scholar
  7. 7.
    Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 653–656. ACM (2013)Google Scholar
  8. 8.
    Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1486–1494. MIT Press (2015)Google Scholar
  9. 9.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941. IEEE (2016)Google Scholar
  10. 10.
    Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 7–16. ACM (2014)Google Scholar
  11. 11.
    Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 773–787 (2017)CrossRefGoogle Scholar
  12. 12.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (ICML), pp. 1180–1189. ACM (2015)Google Scholar
  13. 13.
    Gao, C., et al.: Infar dataset: infrared action recognition at different times. Neurocomputing 212, 36–47 (2016)CrossRefGoogle Scholar
  14. 14.
    van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G., et al.: Apt: Action localization proposals from dense trajectories. In: British Machine Vision Conference (BMVC), pp. 177.1–177.12. British Machine Vision Association (2015)Google Scholar
  15. 15.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680. MIT Press (2014)Google Scholar
  16. 16.
    Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2827–2836. IEEE (2016)Google Scholar
  17. 17.
    Han, J., Bhanu, B.: Human activity recognition in thermal infrared imagery. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR Workshops), p. 17. IEEE (2005)Google Scholar
  18. 18.
    Han, J., Bhanu, B.: Fusion of color and infrared video for moving human detection. Pattern Recogn. 40(6), 1771–1784 (2007)CrossRefGoogle Scholar
  19. 19.
    Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefGoogle Scholar
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016)Google Scholar
  21. 21.
    Ji, S., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRefGoogle Scholar
  22. 22.
    Jiang, Z., Rozgic, V., Adali, S.: Learning spatiotemporal features for infrared action recognition with 3D convolutional neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR Workshops). IEEE (2017)Google Scholar
  23. 23.
    Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimed. 17(3), 370–381 (2015)CrossRefGoogle Scholar
  24. 24.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732. IEEE (2014)Google Scholar
  25. 25.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  26. 26.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference (BMVC), pp. 275-1–10. British Machine Vision Association (2008)Google Scholar
  27. 27.
    Kumar, S., Udupa, R.: Learning hash functions for cross-view similarity search. In: International Koint Conference on Artificial Intelligence (IJCAI), pp. 1360–1365. AAAI Press (2011)Google Scholar
  28. 28.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  29. 29.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE (2008)Google Scholar
  30. 30.
    Li, T., Mei, T., Kweon, I.S., Hua, X.S.: Contextual bag-of-words for visual categorization. IEEE Trans. Circ. Syst. Video Technol. 21(4), 381–392 (2011)CrossRefGoogle Scholar
  31. 31.
    Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)CrossRefGoogle Scholar
  32. 32.
    Lindtner, S., Hertz, G.D., Dourish, P.: Emerging sites of HCI innovation: hackerspaces, hardware startups & incubators. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 439–448. ACM (2014)Google Scholar
  33. 33.
    Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017)CrossRefGoogle Scholar
  34. 34.
    Liu, C., et al.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology (2009)Google Scholar
  35. 35.
    Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity 2018, Article ID 5345241, 20 p. (2018)Google Scholar
  36. 36.
    Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning (ICML), pp. 97–105. ACM (2015)Google Scholar
  37. 37.
    Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Sig. Process. Mag. 32(3), 53–69 (2015)CrossRefGoogle Scholar
  38. 38.
    Peng, Y., Qi, J., Yuan, Y.: CM-GANs: cross-modal generative adversarial networks for common representation learning. arXiv preprint arXiv:1710.05106 (2017)
  39. 39.
    Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)CrossRefGoogle Scholar
  40. 40.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_11CrossRefGoogle Scholar
  41. 41.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  42. 42.
    Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 667–681 (2018)CrossRefGoogle Scholar
  43. 43.
    Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)CrossRefGoogle Scholar
  44. 44.
    Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 2004 International Conference on Pattern Recognition, (ICPR), vol. 3, pp. 32–36. IEEE (2004)Google Scholar
  46. 46.
    Shao, L., Zhu, F., Li, X.: Transfer learning for visual categorization: a survey. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1019–1034 (2015)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Shin, H.C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)CrossRefGoogle Scholar
  48. 48.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 568–576. MIT Press (2014)Google Scholar
  49. 49.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  50. 50.
    Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962–2971. IEEE (2017)Google Scholar
  51. 51.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1510–1517 (2018)CrossRefGoogle Scholar
  52. 52.
    Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4041–4049. IEEE (2015)Google Scholar
  53. 53.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119(3), 219–238 (2016)MathSciNetCrossRefGoogle Scholar
  55. 55.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4305–4314. IEEE (2015)Google Scholar
  56. 56.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  57. 57.
    Wei, Y., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2017)Google Scholar
  58. 58.
    Wu, A., Zheng, W.S., Yu, H.X., Gong, S., Lai, J.: RGB-infrared cross-modality person re-identification. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5380–5389. IEEE (2017)Google Scholar
  59. 59.
    Yang, L., Gao, C., Meng, D., Jiang, L.: A novel group-sparsity-optimization-based feature selection model for complex interaction recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 508–521. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16814-2_33CrossRefGoogle Scholar
  60. 60.
    Yang, Y., Zha, Z.J., Gao, Y., Zhu, X., Chua, T.S.: Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Trans. Multimed. 16(6), 1677–1689 (2014)CrossRefGoogle Scholar
  61. 61.
    Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans. Image Process. 23(5), 2009–2018 (2014)MathSciNetCrossRefGoogle Scholar
  62. 62.
    Zollhöfer, M., et al.: Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graph. (TOG) 33(4), 156 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Communication and Information EngineeringChongqing University of Posts and TelecommunicationsChongqingChina
  2. 2.Chongqing Key Laboratory of Signal and Information ProcessingChongqingChina
  3. 3.University of Maryland College ParkCollege ParkUSA
  4. 4.Harbin Institute of TechnologyHarbinChina
  5. 5.Xi’an Jiaotong UniversityXi’anChina

Personalised recommendations