Skip to main content
Log in

Bold driver and static restart fused adaptive momentum for visual question answering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Stacked attention networks (SANs) are one of the most classic models for visual question answering (VQA) and have effectively promoted the research progress of VQA. Existing literature utilized momentum to optimize SANs and obtained impressive results. However, error analysis shows that the fixed global learning rate in momentum makes it easy to fall into local optimal solution. Many Learning Rate Adaptation algorithms (LRA) (e.g., static restart, bold driver) are proposed to solve the issue by adjusting global learning rate. However, these algorithms still have many defects. For example, static restart has too high restart learning rate and the blindness of adaptive global learning rate; although bold driver can solve the blindness, it has the improper setting of adaptive parameters. To solve these issues, we fuse bold driver and static restart (BDSR) into momentum to devise our method called bold driver and static restart fused adaptive momentum (BDSRM). Then, we analyze its optimization process and time complexity and conduct quantitative experiments on VQAv1, Cifar-10 and similar models to verify that our BDSRM outperforms the state-of-the-art optimization algorithms on SANs. Afterward, we perform ablation experiments and visualization experiments to verify that our BDSR has preferable effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The code of SANs is publicly available at https://github.com/zcyang/imageqa-san.

  2. The code of deep residual networks is publicly available at https://github.com/Lasagne/Recipes/tree/master/papers/deep_residual_learning.

References

  1. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR 2018, pp 6077–6086. Salt Lake City, USA

  2. Antol S, Agrawal A, Lu J et al (2015) Vqa: visual question answering. In: ICCV 2015, pp 4–31. Santiago, Chile

  3. Behera A, Wharton Z, Hewage P et al (2021) Context-aware attentional pooling (CAP) for fine-grained visual classification. In: AAAI 2021, pp 1–17. New York, USA

  4. Belinkov Y, Durrani N, Dalvi F et al (2017) What do neural machine translation models learn about morphology. In: ACL 2017, pp 861–872. Vancouver, Canada

  5. Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: ICCV 2017, pp 2631–263. Venice, Italy

  6. Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: AAAI 2019, pp 8102–8109. Honolulu, USA

  7. Bridle JS (1989) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236

    Google Scholar 

  8. Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: CVPR 2019, pp 1989–1998. Long Beach, USA

  9. Cao J, Qin X, Zhao S et al (2022) Bilateral cross-modality graph matching attention for feature fusion in visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3135655

    Article  Google Scholar 

  10. Cho K, Merrienboer BV, Bahdanau D et al (2014) On the properties of neural machine translation: encoder-decoder approaches. Comput Sci. https://doi.org/10.3115/v1/W14-4012

    Article  Google Scholar 

  11. Donoghue BO, Candes E (2015) Adaptive restart for accelerated gradient schemes. Found Comput Math 15(3):715–732

    Article  MATH  Google Scholar 

  12. Gao P, Jiang Z, You H et al (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR 2019, pp 4321–4330. Long Beach, USA

  13. Guo D, Xu C, Tao D (2022) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3104937

    Article  Google Scholar 

  14. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: CVPR 2016, pp 770–778. Las Vegas, USA

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  16. Hu R, Andreas J, Rohrbach M et al (2017) Learning to reason: end-to-end module networks for visual question answering. In: ICCV 2017, pp 804–813. Venice, Italy

  17. Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20

    Article  Google Scholar 

  18. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: NeurIPS 2018, pp 1571–1581. Montreal, Canada

  19. Kim JH, Lee SW, Kwak DH et al (2016) Multimodal residual learning for visual QA. In: NeurIPS 2016, pp 361–369. Barcelona, Spain

  20. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR 2015, pp 1–15. Santiago, Chile

  21. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, Toronto, Canada. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  22. Krizhevsky A (2013) the cifar-10 and cifar-100 datasets. http://www.cs.toronto.edu/~kriz/cifar.html

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS 2012, pp 1097–1105. Lake Tahoe, USA

  24. Lecun Y, Bottou L (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  25. Li G, Su H, Zhu W (2017) Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. In: CVPR 2017. Honolulu, USA. arXiv:1712.00733v1

  26. Li G, Wang X, Zhu W (2020) Boosting visual question answering with context-aware knowledge aggregation. In: ACM MM 2020, pp 1227–1235. Seattle, USA

  27. Li S, Lv X (2019) Momentum based on adaptive bold driver. In: ICME 2019, pp 1828–1833. Shanghai, China

  28. Li S, Lv X (2019) Static restart stochastic gradient descent algorithm based on image question answering. J Comput Res Develop 56(5):1092–1100

    Google Scholar 

  29. Loshchilov I, Hutter F (2017) Sgdr: stochastic gradient descent with warm restarts. In: ICLR 2017, pp 1–16. Toulon, France

  30. Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI 2016, pp 3567–3573. Phoenix, USA

  31. Niu Y, Tang K, Zhang H et al (2021) Counterfactual vqa: A cause-effect look at language bias. In: CVPR 2021, pp 12700–12710. Online

  32. Orr G, Schraudolph N, Cummins F (2016) Momentum and learning rate adaptation. http://www.willamette.edu/~gorr/classes/cs449/momrate.html

  33. Ouyang N, Huang Q, Li P et al (2020) Suppressing biased samples for robust vqa. IEEE Trans Multimed 12(18):1–12

    Google Scholar 

  34. Peng L, Yang Y, Zhang X et al (2022) Answer again: Improving vqa with cascaded-answering model. IEEE Trans Knowl Data Eng 34(4):1644–1655

    Google Scholar 

  35. Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17

    Article  Google Scholar 

  36. Powell MJD (1977) Restart procedures for the conjugate gradient method. IEEE Trans Multimed 12(1):241–254

    MATH  Google Scholar 

  37. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151

    Article  Google Scholar 

  38. Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: NeurIPS 2015, pp 2953–2961. Montreal, Canada

  39. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407

    Article  MATH  Google Scholar 

  40. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Porcess Explor Microstruct Cognit 1:318–362

    Google Scholar 

  41. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR 2015, pp 1–14. Santiago, Chile

  42. Sutskever I, Martens J, Dahl G et al (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013, pp 1139–1147. Atlanta, USA

  43. Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: CVPR 2015, pp 1–9. Boston, USA

  44. Tieleman T, Hinton GE (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  45. Wang P, Wu Q, Shen C et al (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  46. Williams C, Barber D (1999) Bayesian classification with gaussian processes. IEEE Trans Pattern Anal Mach Intell 20(12):1342–1351

    Article  Google Scholar 

  47. Wu C, Liu J, Wang X et al (2019) Differential networks for visual question answering. In: AAAI 2019, pp 8997–9004. Honolulu, USA

  48. Wu Q, Teney D, Wang P et al (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–41

    Article  Google Scholar 

  49. Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: CVPR 2016, pp 21–29. Las Vegas, USA

  50. Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: CVPR 2019, pp 6282–6290. Long Beach, USA

  51. Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV 2017. Venice, Italy. arXiv:1708.01471v1

  52. Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701v1

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities (No. 2021ZY88) and the National Natural Science Foundation of China (Nos. 62202054, 62102376 and 62171043).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chuanwen Luo or Yuqing Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Luo, C., Zhu, Y. et al. Bold driver and static restart fused adaptive momentum for visual question answering. Knowl Inf Syst 65, 921–943 (2023). https://doi.org/10.1007/s10115-022-01775-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01775-5

Keywords

Navigation