Abstract
Stacked attention networks (SANs) are one of the most classic models for visual question answering (VQA) and have effectively promoted the research progress of VQA. Existing literature utilized momentum to optimize SANs and obtained impressive results. However, error analysis shows that the fixed global learning rate in momentum makes it easy to fall into local optimal solution. Many Learning Rate Adaptation algorithms (LRA) (e.g., static restart, bold driver) are proposed to solve the issue by adjusting global learning rate. However, these algorithms still have many defects. For example, static restart has too high restart learning rate and the blindness of adaptive global learning rate; although bold driver can solve the blindness, it has the improper setting of adaptive parameters. To solve these issues, we fuse bold driver and static restart (BDSR) into momentum to devise our method called bold driver and static restart fused adaptive momentum (BDSRM). Then, we analyze its optimization process and time complexity and conduct quantitative experiments on VQAv1, Cifar-10 and similar models to verify that our BDSRM outperforms the state-of-the-art optimization algorithms on SANs. Afterward, we perform ablation experiments and visualization experiments to verify that our BDSR has preferable effectiveness.
Similar content being viewed by others
Notes
The code of SANs is publicly available at https://github.com/zcyang/imageqa-san.
The code of deep residual networks is publicly available at https://github.com/Lasagne/Recipes/tree/master/papers/deep_residual_learning.
References
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR 2018, pp 6077–6086. Salt Lake City, USA
Antol S, Agrawal A, Lu J et al (2015) Vqa: visual question answering. In: ICCV 2015, pp 4–31. Santiago, Chile
Behera A, Wharton Z, Hewage P et al (2021) Context-aware attentional pooling (CAP) for fine-grained visual classification. In: AAAI 2021, pp 1–17. New York, USA
Belinkov Y, Durrani N, Dalvi F et al (2017) What do neural machine translation models learn about morphology. In: ACL 2017, pp 861–872. Vancouver, Canada
Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: ICCV 2017, pp 2631–263. Venice, Italy
Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: AAAI 2019, pp 8102–8109. Honolulu, USA
Bridle JS (1989) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236
Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: CVPR 2019, pp 1989–1998. Long Beach, USA
Cao J, Qin X, Zhao S et al (2022) Bilateral cross-modality graph matching attention for feature fusion in visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3135655
Cho K, Merrienboer BV, Bahdanau D et al (2014) On the properties of neural machine translation: encoder-decoder approaches. Comput Sci. https://doi.org/10.3115/v1/W14-4012
Donoghue BO, Candes E (2015) Adaptive restart for accelerated gradient schemes. Found Comput Math 15(3):715–732
Gao P, Jiang Z, You H et al (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR 2019, pp 4321–4330. Long Beach, USA
Guo D, Xu C, Tao D (2022) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3104937
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: CVPR 2016, pp 770–778. Las Vegas, USA
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu R, Andreas J, Rohrbach M et al (2017) Learning to reason: end-to-end module networks for visual question answering. In: ICCV 2017, pp 804–813. Venice, Italy
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: NeurIPS 2018, pp 1571–1581. Montreal, Canada
Kim JH, Lee SW, Kwak DH et al (2016) Multimodal residual learning for visual QA. In: NeurIPS 2016, pp 361–369. Barcelona, Spain
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR 2015, pp 1–15. Santiago, Chile
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, Toronto, Canada. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Krizhevsky A (2013) the cifar-10 and cifar-100 datasets. http://www.cs.toronto.edu/~kriz/cifar.html
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS 2012, pp 1097–1105. Lake Tahoe, USA
Lecun Y, Bottou L (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li G, Su H, Zhu W (2017) Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. In: CVPR 2017. Honolulu, USA. arXiv:1712.00733v1
Li G, Wang X, Zhu W (2020) Boosting visual question answering with context-aware knowledge aggregation. In: ACM MM 2020, pp 1227–1235. Seattle, USA
Li S, Lv X (2019) Momentum based on adaptive bold driver. In: ICME 2019, pp 1828–1833. Shanghai, China
Li S, Lv X (2019) Static restart stochastic gradient descent algorithm based on image question answering. J Comput Res Develop 56(5):1092–1100
Loshchilov I, Hutter F (2017) Sgdr: stochastic gradient descent with warm restarts. In: ICLR 2017, pp 1–16. Toulon, France
Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI 2016, pp 3567–3573. Phoenix, USA
Niu Y, Tang K, Zhang H et al (2021) Counterfactual vqa: A cause-effect look at language bias. In: CVPR 2021, pp 12700–12710. Online
Orr G, Schraudolph N, Cummins F (2016) Momentum and learning rate adaptation. http://www.willamette.edu/~gorr/classes/cs449/momrate.html
Ouyang N, Huang Q, Li P et al (2020) Suppressing biased samples for robust vqa. IEEE Trans Multimed 12(18):1–12
Peng L, Yang Y, Zhang X et al (2022) Answer again: Improving vqa with cascaded-answering model. IEEE Trans Knowl Data Eng 34(4):1644–1655
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17
Powell MJD (1977) Restart procedures for the conjugate gradient method. IEEE Trans Multimed 12(1):241–254
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: NeurIPS 2015, pp 2953–2961. Montreal, Canada
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Porcess Explor Microstruct Cognit 1:318–362
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR 2015, pp 1–14. Santiago, Chile
Sutskever I, Martens J, Dahl G et al (2013) On the importance of initialization and momentum in deep learning. In: ICML 2013, pp 1139–1147. Atlanta, USA
Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: CVPR 2015, pp 1–9. Boston, USA
Tieleman T, Hinton GE (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Wang P, Wu Q, Shen C et al (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Williams C, Barber D (1999) Bayesian classification with gaussian processes. IEEE Trans Pattern Anal Mach Intell 20(12):1342–1351
Wu C, Liu J, Wang X et al (2019) Differential networks for visual question answering. In: AAAI 2019, pp 8997–9004. Honolulu, USA
Wu Q, Teney D, Wang P et al (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–41
Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: CVPR 2016, pp 21–29. Las Vegas, USA
Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: CVPR 2019, pp 6282–6290. Long Beach, USA
Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV 2017. Venice, Italy. arXiv:1708.01471v1
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701v1
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities (No. 2021ZY88) and the National Natural Science Foundation of China (Nos. 62202054, 62102376 and 62171043).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, S., Luo, C., Zhu, Y. et al. Bold driver and static restart fused adaptive momentum for visual question answering. Knowl Inf Syst 65, 921–943 (2023). https://doi.org/10.1007/s10115-022-01775-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01775-5