Skip to main content

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13691)

Abstract

We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a “good” video for action recognition and select only high-quality samples for augmentation. In particular, we choose video compositing of a foreground and a background video as the data augmentation process, which results in diverse and realistic new samples. We learn which pairs of videos to augment without having to actually composite them. This reduces the space of possible augmentations, which has two advantages: it saves computational cost and increases the accuracy of the final trained classifier, as the augmented pairs are of higher quality than average. We present experimental results on the entire spectrum of training settings: few-shot, semi-supervised and fully supervised. We observe consistent improvements across all of them over prior work and baselines on Kinetics, UCF101, HMDB51, and achieve a new state-of-the-art on settings with limited data. We see improvements of up to 8.6% in the semi-supervised setting. Project Page: https://sites.google.com/view/learn2augment/home.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-19821-2_14
  • Chapter length: 18 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-031-19821-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2020)

    Google Scholar 

  2. Berthelot, D., et al.: RemixMatch: semi-supervised learning with distribution matching and augmentation anchoring. In: International Conference on Learning Representations (2019)

    Google Scholar 

  3. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: a holistic approach to semi-supervised learning. Adv. Neural. Inf. Process. Syst. 32, 1–11 (2019)

    Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  5. Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t I dance in the mall? learning to mitigate scene bias in action recognition. In: NeurIPS (2019)

    Google Scholar 

  6. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

    Google Scholar 

  7. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: Randaugment: Practical automated data augmentation with a reduced search space. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 18613–18624. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf

  8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference On Computer Vision And Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  9. DeVries, T., Taylor, G.W.: Dataset augmentation in feature space. In: ICLR Workshop (2017)

    Google Scholar 

  10. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

  11. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35(2), pp. 1451–1459, May 2021. https://ojs.aaai.org/index.php/AAAI/article/view/16235

  12. Gowda, S.N., Sevilla-Lara, L., Kim, K., Keller, F., Rohrbach, M.: A new split for evaluating true zero-shot action recognition. In: Bauckhage, C., Gall, J., Schwing, A. (eds.) DAGM GCPR 2021. LNCS, vol. 13024, pp. 191–205. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92659-5_12

    CrossRef  Google Scholar 

  13. Grandvalet, Y., Bengio, Y., et al.: Semi-supervised learning by entropy minimization. CAP 367, 281–296 (2005)

    Google Scholar 

  14. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    CrossRef  Google Scholar 

  16. Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets, pp. 7366–7375, June 2018. https://doi.org/10.1109/CVPR.2018.00769

  17. Iosifidis, A., Tefas, A., Pitas, I.: Semi-supervised classification of human actions based on neural networks. In: 2014 22nd International Conference on Pattern Recognition, pp. 1336–1341. IEEE (2014)

    Google Scholar 

  18. Jing, L., Parag, T., Wu, Z., Tian, Y., Wang, H.: VideoSSL: semi-supervised learning for video classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1110–1119, January 2021

    Google Scholar 

  19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

    Google Scholar 

  20. Korbar, B., Tran, D., Torresani, L.: SCSampler: sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6232–6242 (2019)

    Google Scholar 

  21. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)

    Google Scholar 

  22. Kuo, C.-W., Ma, C.-Y., Huang, J.-B., Kira, Z.: FeatMatch: feature-based augmentation for semi-supervised learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 479–495. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_28

    CrossRef  Google Scholar 

  23. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)

  24. Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges In Representation Learning, ICML, vol. 3, p. 896 (2013)

    Google Scholar 

  25. Lemley, J., Bazrafkan, S., Corcoran, P.M.: Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, 5858–5869 (2017)

    CrossRef  Google Scholar 

  26. Lin, T.Y., et al.: Microsoft coco: Common objects in context (2014). http://arxiv.org/abs/1405.0312

  27. Liu, G., Reda, F.A., Shih, K.J., Wang, T.-C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 89–105. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_6

    CrossRef  Google Scholar 

  28. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics (2018)

    Google Scholar 

  29. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. arXiv preprint arXiv:2101.06184 (2021)

  30. Singh, A., et al.: Semi-supervised action recognition with temporal contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10389–10399 (2021)

    Google Scholar 

  31. Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017)

  32. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  33. Sun, D., et al.: Autoflow: learning a better training set for optical flow. In: CVPR (2021)

    Google Scholar 

  34. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017)

  35. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3), 229–256 (1992)

    CrossRef  Google Scholar 

  36. Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: International Conference on Machine Learning, pp. 10842–10851. PMLR (2020)

    Google Scholar 

  37. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: CutMix: regularization strategy to train strong classifiers with localizable features. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  38. Yun, S., Oh, S.J., Heo, B., Han, D., Kim, J.: VideoMix: rethinking data augmentation for video classification. arXiv preprint arXiv:2012.03457 (2020)

  39. Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L.: S4l: self-supervised semi-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1476–1485 (2019)

    Google Scholar 

  40. Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31

    CrossRef  Google Scholar 

  41. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  42. Zhang, Y., Jia, G., Chen, L., Zhang, M., Yong, J.: Self-paced video data augmentation by generative adversarial networks with insufficient samples. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1652–1660. MM 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3414003

  43. Zou, Y., Choi, J., Wang, Q., Huang, J.: Learning representational invariances for data-efficient action recognition. CoRR abs/2103.16565 (2021). https://arxiv.org/abs/2103.16565

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shreyank N. Gowda .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4244 KB)

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Gowda, S.N., Rohrbach, M., Keller, F., Sevilla-Lara, L. (2022). Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13691. Springer, Cham. https://doi.org/10.1007/978-3-031-19821-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19821-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19820-5

  • Online ISBN: 978-3-031-19821-2

  • eBook Packages: Computer ScienceComputer Science (R0)