Advertisement

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)

Abstract

Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the major tricks for DNN, has been widely used in many computer vision tasks. However, there are few works studying the data augmentation problem for VQA and none of the existing image based augmentation schemes (such as rotation and flipping) can be directly applied to VQA due to its semantic structure – an \(\langle image, question, answer\rangle \) triplet needs to be maintained correctly. For example, a direction related Question-Answer (QA) pair may not be true if the associated image is rotated or flipped. In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data. The augmented examples do not change the visual properties presented in the image as well as the semantic meaning of the question, the correctness of the \(\langle image, question, answer\rangle \) is thus still maintained. We then use adversarial learning to train a classic VQA model (BUTD) with our augmented data. We find that we not only improve the overall performance on VQAv2, but also can withstand adversarial attack effectively, compared to the baseline model. The source code is available at https://github.com/zaynmi/seada-vqa.

Keywords

VQA Data augmentation Adversarial learning 

Notes

Acknowledgements

This work was supported by National Key Research and Development Program of China (2016YFB1001003), NSFC (U19B2035, 61527804, 60906119), STCSM (18DZ1112300). C. Ma was sponsored by Shanghai Pujiang Program

Supplementary material

504475_1_En_26_MOESM1_ESM.pdf (16.2 mb)
Supplementary material 1 (pdf 16627 KB)

References

  1. 1.
    Agarwal, V., Shetty, R., Fritz, M.: Towards causal VQA: revealing and reducing spurious correlations by invariant and covariant semantic editing. arXiv preprint arXiv:1912.07538 (2019)
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)Google Scholar
  3. 3.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)
  4. 4.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48 (2016)Google Scholar
  5. 5.
    Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173 (2017)
  6. 6.
    Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)Google Scholar
  7. 7.
    Blohm, M., Jagfeld, G., Sood, E., Yu, X., Vu, N.T.: Comparing attention-based convolutional and recurrent neural networks: Success and limitations in machine reading comprehension. arXiv preprint arXiv:1808.08744 (2018)
  8. 8.
    Carlini, N.,et al.: On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705 (2019)
  9. 9.
    Ebrahimi, J., Lowd, D., Dou, D.: On adversarial examples for character-level neural machine translation. arXiv preprint arXiv:1806.09030 (2018)
  10. 10.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
  11. 11.
    Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
  12. 12.
    Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
  13. 13.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)Google Scholar
  14. 14.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: End-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813 (2017)Google Scholar
  15. 15.
    Iyyer, M., Wieting, J., Gimpel, K., Zettlemoyer, L.: Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059 (2018)
  16. 16.
    Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6485–6494 (2017)Google Scholar
  17. 17.
    Kafle, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3–20 (2017)CrossRefGoogle Scholar
  18. 18.
    Kafle, K., Yousefhussien, M., Kanan, C.: Data augmentation for visual question answering. In: Proceedings of the 10th International Conference on Natural Language Generation, pp. 198–202 (2017)Google Scholar
  19. 19.
    Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, pp. 1564–1574 (2018)Google Scholar
  20. 20.
    Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
  21. 21.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  22. 22.
    Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017).  https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRefGoogle Scholar
  23. 23.
    Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016)
  24. 24.
    Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236 (2016)
  25. 25.
    Li, Y., et al.: Visual question generation as dual task of visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6116–6124 (2018)Google Scholar
  26. 26.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)Google Scholar
  27. 27.
    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
  28. 28.
    Mallinson, J., Sennrich, R., Lapata, M.: Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 881–893 (2017)Google Scholar
  29. 29.
    Norcliffe-Brown, W., Vafeias, S., Parisot, S.: Learning conditioned graph structures for interpretable visual question answering. In: Advances in Neural Information Processing Systems, pp. 8334–8343 (2018)Google Scholar
  30. 30.
    Patro, B., Namboodiri, V.P.: Differential attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7680–7688 (2018)Google Scholar
  31. 31.
    Peng, G., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. arXiv preprint arXiv:1812.05252 (2018)
  32. 32.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  33. 33.
    Ray, A., Sikka, K., Divakaran, A., Lee, S., Burachas, G.: Sunny and dark outside?! Improving answer consistency in VQA through entailed question generation. arXiv preprint arXiv:1909.04696 (2019)
  34. 34.
    Ribeiro, M.T., Singh, S., Guestrin, C.: Semantically equivalent adversarial rules for debugging NLP models. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 856–865 (2018)Google Scholar
  35. 35.
    Shah, M., Chen, X., Rohrbach, M., Parikh, D.: Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6649–6658 (2019)Google Scholar
  36. 36.
    Sharma, V., Vaibhav, A., Chaudhary, S., Patel, L., Morency, L.: Attend and attack: attention guided adversarial attacks on visual question answering models (2018)Google Scholar
  37. 37.
    Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 60 (2019).  https://doi.org/10.1186/s40537-019-0197-0CrossRefGoogle Scholar
  38. 38.
    Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
  39. 39.
    Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)Google Scholar
  40. 40.
    Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)
  41. 41.
    Xie, C., Tan, M., Gong, B., Wang, J., Yuille, A., Le, Q.V.: Adversarial examples improve image recognition. arXiv preprint arXiv:1911.09665 (2019)
  42. 42.
    Xie, Z., et al.: Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573 (2017)
  43. 43.
    Xu, X., Chen, X., Liu, C., Rohrbach, A., Darrell, T., Song, D.: Fooling vision and language models despite localization and attention mechanism. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4951–4961 (2018)Google Scholar
  44. 44.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)Google Scholar
  45. 45.
    Yu, A.W., et al.: QANet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018)
  46. 46.
    Zhang, Y., Hare, J., Prügel-Bennett, A.: Learning to count objects in natural images for visual question answering. arXiv preprint arXiv:1802.05766 (2018)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.MoE Key Lab of Artificial Intelligence, AI InstituteShanghai Jiao Tong UniversityShanghaiChina
  2. 2.University of AdelaideAdelaideAustralia

Personalised recommendations