Abstract
Adversarial attacks on video recognition models have been explored recently. However, most existing works treat each video frame equally and ignore their temporal interactions. To overcome this drawback, a few methods try to select some key frames and then perform attacks based on them. Unfortunately, their selection strategy is independent of the attacking step, therefore the resulting performance is limited. Instead, we argue the frame selection phase is closely relevant with the attacking phase. The key frames should be adjusted according to the attacking results. For that, we formulate the black-box video attacks into a Reinforcement Learning (RL) framework. Specifically, the environment in RL is set as the recognition model, and the agent in RL plays the role of frame selecting. By continuously querying the recognition models and receiving the attacking feedback, the agent gradually adjusts its frame selection strategy and adversarial perturbations become smaller and smaller. We conduct a series of experiments with two mainstream video recognition models: C3D and LRCN on the public UCF-101 and HMDB-51 datasets. The results demonstrate that the proposed method can significantly reduce the adversarial perturbations with efficient query times.
Similar content being viewed by others
References
Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6, 14410–14430.
Bose, J. A., & Aarabi, P. (2018). Adversarial attacks on face detectors using neural net based constrained optimization. pp. 1–6.
Cheng, M., Le, T., Chen, P.-Y., Zhang, H., Yi, J., & Hsieh, C.-J. (2019). Query-efficient hard-label black-box attack: An optimization-based approach. In International conference on learning representations.
Cheng, M., Singh, S., Chen, P.-Y., Liu, S., & Hsieh, C.-J. (2020). Sign-opt: A query-efficient hard-label adversarial attack. In International conference on learning representations.
Croce, F., Rauber, J., & Hein, M. (2020). Scaling up the randomized gradient-free adversarial attack reveals overestimation of robustness using established attacks. In International Journal of Computer Vision, 128(4), 1028–1046.
Das, N., Shanbhogue, M., Chen, S., Hohman, F., Li, S., Chen, L., Kounavis, M. E., & Chau, D. H. (2018) Shield: Fast, practical defense and vaccination for deep learning using jpeg compression. Knowledge discovery and data mining, pp. 196–204.
Deng,L., Chen, J., Sun, Q., He, X., Tang, S., Ming, Z., Zhang, Y. & Chua,T.-S. (2019). Mixed-dish recognition with contextual relation networks. Proceedings of the 27th ACM International conference on multimedia.
Donahue, J., Hendricks, L. A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., & Darrell, T. (2017). Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 677–691.
Dong, W., Zhang, Z., & Tan, T. (2019). Attention-aware sampling via deep reinforcement learning for action recognition. In National Conference on Artificial Intelligence, 33, 8247–8254.
Dong, Y., Su, H., Wu, B., Li, Z., Liu, W., Zhang, T., & Zhu, J. (2019). Efficient decision-based black-box adversarial attacks on face recognition. In Computer vision and pattern recognition, pp. 7714–7722.
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J. M., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118.
Goodfellow, I. J., Jonathon, S., & Christian, S. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations.
Goswami, G., Agarwal, A., Ratha, N., Singh, R., & Vatsa, M. (2019). Detecting and mitigating adversarial perturbations for robust face recognition. In International Journal of Computer Vision, 127(6), 719–742.
Guo, C., Rana, M., Cisse, M., & Van Der Maaten, L. (2017). Countering adversarial images using input transformations. International conference on learning representations.
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In computer vision and pattern recognition, pp. 6546–6555.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition, pp. 770–778.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
Ilyas, A.,Engstrom, L., Athalye, A., Lin, J., Athalye, A., Engstrom, L., Ilyas, A., & Kwok, K. (2018). Black-box adversarial attacks with limited queries and information. In International conference on machine learning.
Jia, X., Wei, X., & Cao, X. (2019). Identifying and resisting adversarial videos using temporal consistency. arXiv preprint arXiv:1909.04837.
Jia, X., Wei, X., Cao, X., & Foroosh, H. (2019). Comdefend: An efficient image compression model to defend adversarial examples. In 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp. 6084–6092.
Jiang, L., Ma, X., Chen, S., Bailey, J., & Jiangl, Y.-G. (2019). Black-box adversarial attacks on video recognition models. In Acm multimedia, pp. 864–872.
Kingma ,D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. International conference on learning representations.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre,T. (2011). Hmdb: A large video database for human motion recognition. In 2011 International conference on computer vision, IEEE. pp. 2556–2563.
Lecun, Y., Bengio, Y., & Hinton, G. E. (2015). Deep learning. Nature, 521(7553), 436–444.
Li, S., Neupane, A., Paul, S., Song, C., Krishnamurthy, S. V., Roy-Chowdhury, A. K., & Swami, A. (2019). In network and distributed system security symposium: Stealthy adversarial perturbations against real-time video classification systems.
Li, Y. (2017). Deep reinforcement learning: An overview. arXiv: Learning.
Litjens, G. J. S., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., Van Der Laak, J. A., Van Ginneken, B., & Sanchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60–88.
Liu, S., Chen, P. Y., Chen, X., & Hong, M. (2019). Signsgd via zeroth-order oracle. In 7th International conference on learning representations, ICLR 2019.
Lu, J., Sibai, H., Fabry, E., & Forsyth, D. (2017). No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv: Computer Vision and Pattern Recognition, 2017.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2017).Towards deep learning models resistant to adversarial attacks. International conference on learning representations.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Martin, A. (2013). Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Moosavidezfooli, S., Fawzi, A., & Frossard, P. (2016). Deepfool: A simple and accurate method to fool deep neural networks. Computer vision and pattern recognition, pp. 2574–2582.
Nezami, O. M., Chaturvedi, A., Dras, M., & Garain, U. (2020). Pick-object-attack: Type-specific adversarial attack for object detection.
Prakash, A., Moran, N., Garber, S., DiLillo, A., & Storer, J. (2018). Deflecting adversarial attacks with pixel deflection. In 2018 IEEE/CVF conference on computer vision and pattern recognition, pp. 8571–8580.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484-489.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Soomro K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv: Computer Vision and Pattern Recognition, 2012.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Computer vision and pattern recognition, pp. 2818–2826.
Teng, S., Zhang, S., Huang, Q., & Sebe N. (2021). Viewpoint and scale consistency reinforcement for UAV vehicle re-identification. International Journal of Computer Vision 129.3 719-735.
Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I. J., Boneh, D., McDaniel, P. D. (2018). Ensemble adversarial training: Attacks and defenses. In International conference on learning representations.
Wei, X., Liang, S., Chen, N. & Cao, X. (2019). Transferable adversarial attacks for image and video object detection. International joint conference on artificial intelligence, pp. 954–960.
Wei, X., Zhu, J., Feng, S., Su, H. (2018) Video-to-video translation with global temporal consistency. Proceedings of the 26th ACM International conference on multimedia.
Wei, X., Zhu, J., Yuan, S., & Hang, S. (2019). Sparse adversarial perturbations for videos. In National Conference on Artificial Intelligence, 33, 8973–8980.
Wei, Z., Chen, J., Wei, X., & Yugang, J. (2020). Heuristic black-box adversarial attacks on video recognition models. In National conference on artificial intelligence.
Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., & Schmidhuber, J. (2014). Natural evolution strategies. The Journal of Machine Learning Research, 15(1), 949–980.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Xie, C., Wang, J.,Zhang, Z., Ren, Z.,&Yuille, A. L. (2017) Mitigating adversarial effects through randomization. International conference on learning representations.
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, A. L. (2017). Adversarial examples for semantic segmentation and object detection. International conference on computer vision, pp. 1378–1387.
Zhang, H., & Wang, J. (2019). Towards adversarially robust object detection. International conference on computer vision, pp. 421–430.
Zhou, K., Qiao, Y., Xiang, T. (2018). Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In National conference on artificial intelligence.
Acknowledgements
This work is supported by National Key R&D Program of China (Grant No.2020AAA0104002), National Natural Science Foundation of China (No.62076018). We also thank anonymous reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wenjun Kevin Zeng.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wei, X., Yan, H. & Li, B. Sparse Black-Box Video Attack with Reinforcement Learning. Int J Comput Vis 130, 1459–1473 (2022). https://doi.org/10.1007/s11263-022-01604-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01604-w