QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework

Chowdhury, Souvik; Soni, Badal

doi:10.1007/s13369-023-07661-8

QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework

Research Article-Computer Engineering and Computer Science
Published: 17 February 2023

Volume 48, pages 10479–10491, (2023)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

207 Accesses
2 Citations
Explore all metrics

Abstract

In recent years, multimodal learning has gained acceptability because of the availability of low resource-consuming fusion techniques and robust and powerful deep learning architectures. Visual question answering (VQA) is an interdisciplinary research domain in natural language processing and computer vision. In previous works, researchers have tried to optimize the VQA problems primarily with the help of optimized bilinear fusion techniques. In this paper, we propose a novel question segregation framework for visual question answering to optimize the VQA problem where the VQA framework is segregated by the question type labels. The main contribution of the proposed question segregation framework is the reduction in the execution time and computational resource requirement of VQA models. Six VQA models are tested under the proposed framework and have promising results. The proposed question segregation framework can be extended to other VQA models and datasets, and the problem of model bias toward larger volume question type labels is rectified by having an individual model for each question type label. Also, the proposed question segregation framework provides meaningful answers for a given question by filtering out unrealistic answers by restricting the answer space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ST-VQA: shrinkage transformer with accurate alignment for visual question answering

Article 06 May 2023

Visual question answering: a state-of-the-art review

Article 08 April 2020

Multiple answers to a question: a new approach for visual question answering

Article 01 January 2020

References

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Shrestha, R.; Kafle, K.; Kanan, C.: Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10472–10481 (2019)
Yang, C.; Jiang, M.; Jiang, B.; Zhou, W.; Li, K.: Co-attention network with question type for visual question answering. IEEE Access 7, 40771–40781 (2019)
Article Google Scholar
Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)
Article Google Scholar
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Ilievski, I.; Feng, J.: Multimodal learning and reasoning for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Lu, J.; Yang, J.; Batra, D.; Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Noh, H.; Seo, P.H.; Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38 (2016).
Schwartz, I.; Schwing, A.; Hazan, T.: High-order attention models for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Shi, Y., Furlanello, T.; Zha, S.; Anandkumar, A.: Question type guided attention in visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–166 (2018)
Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Gao, D.; Wang, R.; Shan, S.; Chen, X.: Learning to recognize visual concepts for visual question answering with structural label space. IEEE J. Sel. Top. Signal Process. 14(3), 494–505 (2020)
Article Google Scholar
Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D.: Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)
Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K.: Learning to reason: End-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813 (2017)
Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998 (2017)
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A.: Film: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Teney, D.; Liu, L.; van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)
Jing, Yu.; Zhang, W.; Yuhang, L.; Qin, Z.; Yue, H.; Tan, J.; Qi, W.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia 22(12), 3196–3209 (2020)
Article Google Scholar
Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015).
Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L.: Visual7w: Grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Li, X.; Gao, L.; Wang, X.; Liu, W.; Xu, X.; Shen, H.T.; Song, J.: Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1166–1174 (2019)
Malinowski, M.; Rohrbach, M.; Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)
Ghosh, S.; Burachas, G.; Ray, A.; Ziskind, A.: Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv preprint arXiv:1902.05715 (2019)
Xi, Y.; Zhang, Y.; Ding, S.; Wan, S.: Visual question answering model based on visual relationship detection. Signal Process. Image Commun. 80, 115648 (2020)
Article Google Scholar
Qi, W.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)
Google Scholar
Wang, P.; Qi, W.; Shen, C.; Dick, A.; Van Den Hengel, A.: Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)
Article Google Scholar
Zhou, Yu.; Jun, Yu.; Xiang, C.; Fan, J.; Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Article Google Scholar
Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6, 31516–31524 (2018)
Article Google Scholar
Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Multimodal local perception bilinear pooling for visual question answering. IEEE Access 6, 57923–57932 (2018)
Article Google Scholar
Kou, G.; Xiao, H.; Cao, M.; Lee, L.H.: Optimal computing budget allocation for the vector evaluated genetic algorithm in multi-objective simulation optimization. Automatica 129, 109599 (2021)
Article MathSciNet MATH Google Scholar
Kou, G.; Liu, Y.; Xiao, H.; Peng, R.: Optimal inspection policy for a three-stage system considering the production wait time. IEEE Trans. Reliability (2022)
Kou, G.; Yi, K.; Xiao, H.; Peng, R.: Reliability of a distributed data storage system considering the external impacts. IEEE Trans. Reliability (2022)
Xiao, H.; Yi, K.; Peng, R.; Kou, G.: Reliability of a distributed computing system with performance sharing. IEEE Trans. Reliability (2021)
Yu, Z.; Yu, J.; Fan, J.; Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Jung, B.; Gu, L.; Harada, T.: bumjun_jung at vqa-med 2020: Vqa model based on feature extraction and multi-modal feature fusion. In: CLEF (Working Notes) (2020)
Kim, J.-H.; Jun, J.; Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Wen, Z.; Guanghui, X.; Tan, M.; Qingyao, W.; Qi, W.: Debiased visual question answering from feature and sample perspectives. Adv. Neural. Inf. Process. Syst. 34, 3784–3796 (2021)
Google Scholar
Han, X.; Wang, S.; Su, C.; Huang, Q.; Tian, Q.: Greedy gradient ensemble for robust visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1584–1593 (2021)
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Hu, R.; Singh, A.: Unit: Multimodal multitask learning with a unified transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1439–1449 (2021)
Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018).
Smith, D.R.: The design of divide and conquer algorithms. Sci. Comput. Programm. 5, 37–58 (1985)
Article MathSciNet MATH Google Scholar
Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)
Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Bernardo, J.M.; Smith, A.F.M.: Bayesian Theory, vol. 405. Wiley, New York (2009)
Google Scholar
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015).
Magnus, J.R.; De Luca, G.: Weighted-average least squares (WALS): a survey. J. Econ. Surv. 30(1), 117–148 (2016)
Article Google Scholar
Yu, Z.; Cui, Y.; Yu, J.; Wang, M.; Tao, D.; Tian, Q.: Deep multimodal neural architecture search. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3743–3752 (2020)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Silchar, Silchar, India
Souvik Chowdhury & Badal Soni

Authors

Souvik Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Badal Soni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Souvik Chowdhury.

Ethics declarations

Conflict of interest

Souvik Chowdhury is a PhD student under the supervision of Dr. Badal Soni, both belong to National Institute of Technology, Silchar, India, and have not received any financial benefit from any organization for this work. Both do not have any conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chowdhury, S., Soni, B. QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework. Arab J Sci Eng 48, 10479–10491 (2023). https://doi.org/10.1007/s13369-023-07661-8

Download citation

Received: 06 July 2022
Accepted: 25 January 2023
Published: 17 February 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s13369-023-07661-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework

Abstract

Access this article

Similar content being viewed by others

ST-VQA: shrinkage transformer with accurate alignment for visual question answering

Visual question answering: a state-of-the-art review

Multiple answers to a question: a new approach for visual question answering

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework

Abstract

Access this article

Similar content being viewed by others

ST-VQA: shrinkage transformer with accurate alignment for visual question answering

Visual question answering: a state-of-the-art review

Multiple answers to a question: a new approach for visual question answering

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation