Bold driver and static restart fused adaptive momentum for visual question answering

Li, Shengdong; Luo, Chuanwen; Zhu, Yuqing; Wu, Weili

doi:10.1007/s10115-022-01775-5

Bold driver and static restart fused adaptive momentum for visual question answering

Regular Paper
Published: 09 November 2022

Volume 65, pages 921–943, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shengdong Li¹,
Chuanwen Luo^2,3,
Yuqing Zhu ORCID: orcid.org/0000-0002-8540-4586^4,5^na1 &
…
Weili Wu⁶^na1

317 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Stacked attention networks (SANs) are one of the most classic models for visual question answering (VQA) and have effectively promoted the research progress of VQA. Existing literature utilized momentum to optimize SANs and obtained impressive results. However, error analysis shows that the fixed global learning rate in momentum makes it easy to fall into local optimal solution. Many Learning Rate Adaptation algorithms (LRA) (e.g., static restart, bold driver) are proposed to solve the issue by adjusting global learning rate. However, these algorithms still have many defects. For example, static restart has too high restart learning rate and the blindness of adaptive global learning rate; although bold driver can solve the blindness, it has the improper setting of adaptive parameters. To solve these issues, we fuse bold driver and static restart (BDSR) into momentum to devise our method called bold driver and static restart fused adaptive momentum (BDSRM). Then, we analyze its optimization process and time complexity and conduct quantitative experiments on VQAv1, Cifar-10 and similar models to verify that our BDSRM outperforms the state-of-the-art optimization algorithms on SANs. Afterward, we perform ablation experiments and visualization experiments to verify that our BDSR has preferable effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FNet with Cross-Attention Encoder for Visual Question Answering

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

IMCN: Improved modular co-attention networks for visual question answering

Article 16 April 2024

Cheng Liu, Chao Wang & Yan Peng

Notes

The code of SANs is publicly available at https://github.com/zcyang/imageqa-san.
The code of deep residual networks is publicly available at https://github.com/Lasagne/Recipes/tree/master/papers/deep_residual_learning.

References

Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR 2018, pp 6077–6086. Salt Lake City, USA
Antol S, Agrawal A, Lu J et al (2015) Vqa: visual question answering. In: ICCV 2015, pp 4–31. Santiago, Chile
Behera A, Wharton Z, Hewage P et al (2021) Context-aware attentional pooling (CAP) for fine-grained visual classification. In: AAAI 2021, pp 1–17. New York, USA
Belinkov Y, Durrani N, Dalvi F et al (2017) What do neural machine translation models learn about morphology. In: ACL 2017, pp 861–872. Vancouver, Canada
Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: ICCV 2017, pp 2631–263. Venice, Italy
Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: AAAI 2019, pp 8102–8109. Honolulu, USA
Bridle JS (1989) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236
Google Scholar
Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: CVPR 2019, pp 1989–1998. Long Beach, USA
Cao J, Qin X, Zhao S et al (2022) Bilateral cross-modality graph matching attention for feature fusion in visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3135655
Article Google Scholar
Cho K, Merrienboer BV, Bahdanau D et al (2014) On the properties of neural machine translation: encoder-decoder approaches. Comput Sci. https://doi.org/10.3115/v1/W14-4012
Article Google Scholar
Donoghue BO, Candes E (2015) Adaptive restart for accelerated gradient schemes. Found Comput Math 15(3):715–732
Article MATH Google Scholar
Gao P, Jiang Z, You H et al (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR 2019, pp 4321–4330. Long Beach, USA
Guo D, Xu C, Tao D (2022) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3104937
Article Google Scholar
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: CVPR 2016, pp 770–778. Las Vegas, USA
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu R, Andreas J, Rohrbach M et al (2017) Learning to reason: end-to-end module networks for visual question answering. In: ICCV 2017, pp 804–813. Venice, Italy
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20
Article Google Scholar
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: NeurIPS 2018, pp 1571–1581. Montreal, Canada
Kim JH, Lee SW, Kwak DH et al (2016) Multimodal residual learning for visual QA. In: NeurIPS 2016, pp 361–369. Barcelona, Spain
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR 2015, pp 1–15. Santiago, Chile
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, Toronto, Canada. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Krizhevsky A (2013) the cifar-10 and cifar-100 datasets. http://www.cs.toronto.edu/~kriz/cifar.html
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS 2012, pp 1097–1105. Lake Tahoe, USA
Lecun Y, Bottou L (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Li G, Su H, Zhu W (2017) Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. In: CVPR 2017. Honolulu, USA. arXiv:1712.00733v1
Li G, Wang X, Zhu W (2020) Boosting visual question answering with context-aware knowledge aggregation. In: ACM MM 2020, pp 1227–1235. Seattle, USA
Li S, Lv X (2019) Momentum based on adaptive bold driver. In: ICME 2019, pp 1828–1833. Shanghai, China
Li S, Lv X (2019) Static restart stochastic gradient descent algorithm based on image question answering. J Comput Res Develop 56(5):1092–1100
Google Scholar
Loshchilov I, Hutter F (2017) Sgdr: stochastic gradient descent with warm restarts. In: ICLR 2017, pp 1–16. Toulon, France
Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI 2016, pp 3567–3573. Phoenix, USA
Niu Y, Tang K, Zhang H et al (2021) Counterfactual vqa: A cause-effect look at language bias. In: CVPR 2021, pp 12700–12710. Online
Orr G, Schraudolph N, Cummins F (2016) Momentum and learning rate adaptation. http://www.willamette.edu/~gorr/classes/cs449/momrate.html
Ouyang N, Huang Q, Li P et al (2020) Suppressing biased samples for robust vqa. IEEE Trans Multimed 12(18):1–12
Google Scholar
Peng L, Yang Y, Zhang X et al (2022) Answer again: Improving vqa with cascaded-answering model. IEEE Trans Knowl Data Eng 34(4):1644–1655
Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17
Article Google Scholar
Powell MJD (1977) Restart procedures for the conjugate gradient method. IEEE Trans Multimed 12(1):241–254
MATH Google Scholar
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
Article Google Scholar
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: NeurIPS 2015, pp 2953–2961. Montreal, Canada
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Article MATH Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Porcess Explor Microstruct Cognit 1:318–362
Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR 2015, pp 1–14. Santiago, Chile
Sutskever I, Martens J, Dahl G et al (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013, pp 1139–1147. Atlanta, USA
Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: CVPR 2015, pp 1–9. Boston, USA
Tieleman T, Hinton GE (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Google Scholar
Wang P, Wu Q, Shen C et al (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Article Google Scholar
Williams C, Barber D (1999) Bayesian classification with gaussian processes. IEEE Trans Pattern Anal Mach Intell 20(12):1342–1351
Article Google Scholar
Wu C, Liu J, Wang X et al (2019) Differential networks for visual question answering. In: AAAI 2019, pp 8997–9004. Honolulu, USA
Wu Q, Teney D, Wang P et al (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–41
Article Google Scholar
Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: CVPR 2016, pp 21–29. Las Vegas, USA
Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: CVPR 2019, pp 6282–6290. Long Beach, USA
Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV 2017. Venice, Italy. arXiv:1708.01471v1
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701v1

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities (No. 2021ZY88) and the National Natural Science Foundation of China (Nos. 62202054, 62102376 and 62171043).

Author information

Chuanwen Luo and Yuqing Zhu are corresponding authors and contribute equally to this research.

Authors and Affiliations

School of Information, Renmin University of China, Beijing, 100872, China
Shengdong Li
School of Information Science and Technology, Beijing Forestry University, Beijing, 100083, China
Chuanwen Luo
Engineering Research Center for Forestry-Oriented Intelligent Information Processing, National Forestry and Grassland Administration, Beijing, 100083, China
Chuanwen Luo
Zhejiang Lab, Hangzhou, 311121, Zhejiang, China
Yuqing Zhu
Department of Computer Science, California State University Los Angeles, Los Angeles, CA, 90032, USA
Yuqing Zhu
Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080, USA
Weili Wu

Authors

Shengdong Li
View author publications
You can also search for this author in PubMed Google Scholar
Chuanwen Luo
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Weili Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chuanwen Luo or Yuqing Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, S., Luo, C., Zhu, Y. et al. Bold driver and static restart fused adaptive momentum for visual question answering. Knowl Inf Syst 65, 921–943 (2023). https://doi.org/10.1007/s10115-022-01775-5

Download citation

Received: 19 July 2021
Revised: 05 October 2022
Accepted: 09 October 2022
Published: 09 November 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10115-022-01775-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bold driver and static restart fused adaptive momentum for visual question answering

Abstract

Access this article

Similar content being viewed by others

FNet with Cross-Attention Encoder for Visual Question Answering

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

IMCN: Improved modular co-attention networks for visual question answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bold driver and static restart fused adaptive momentum for visual question answering

Abstract

Access this article

Similar content being viewed by others

FNet with Cross-Attention Encoder for Visual Question Answering

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

IMCN: Improved modular co-attention networks for visual question answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation