Skip to main content
Log in

QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

In recent years, multimodal learning has gained acceptability because of the availability of low resource-consuming fusion techniques and robust and powerful deep learning architectures. Visual question answering (VQA) is an interdisciplinary research domain in natural language processing and computer vision. In previous works, researchers have tried to optimize the VQA problems primarily with the help of optimized bilinear fusion techniques. In this paper, we propose a novel question segregation framework for visual question answering to optimize the VQA problem where the VQA framework is segregated by the question type labels. The main contribution of the proposed question segregation framework is the reduction in the execution time and computational resource requirement of VQA models. Six VQA models are tested under the proposed framework and have promising results. The proposed question segregation framework can be extended to other VQA models and datasets, and the problem of model bias toward larger volume question type labels is rectified by having an individual model for each question type label. Also, the proposed question segregation framework provides meaningful answers for a given question by filtering out unrealistic answers by restricting the answer space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

  2. Shrestha, R.; Kafle, K.; Kanan, C.: Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10472–10481 (2019)

  3. Yang, C.; Jiang, M.; Jiang, B.; Zhou, W.; Li, K.: Co-attention network with question type for visual question answering. IEEE Access 7, 40771–40781 (2019)

    Article  Google Scholar 

  4. Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)

    Article  Google Scholar 

  5. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

  6. Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  7. Ilievski, I.; Feng, J.: Multimodal learning and reasoning for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  8. Lu, J.; Yang, J.; Batra, D.; Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

  9. Noh, H.; Seo, P.H.; Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38 (2016).

  10. Schwartz, I.; Schwing, A.; Hazan, T.: High-order attention models for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  11. Shi, Y., Furlanello, T.; Zha, S.; Anandkumar, A.: Question type guided attention in visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–166 (2018)

  12. Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

  13. Gao, D.; Wang, R.; Shan, S.; Chen, X.: Learning to recognize visual concepts for visual question answering with structural label space. IEEE J. Sel. Top. Signal Process. 14(3), 494–505 (2020)

    Article  Google Scholar 

  14. Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D.: Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)

  15. Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K.: Learning to reason: End-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813 (2017)

  16. Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998 (2017)

  17. Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A.: Film: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  18. Teney, D.; Liu, L.; van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)

  19. Jing, Yu.; Zhang, W.; Yuhang, L.; Qin, Z.; Yue, H.; Tan, J.; Qi, W.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia 22(12), 3196–3209 (2020)

    Article  Google Scholar 

  20. Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015).

  21. Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L.: Visual7w: Grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)

  22. Li, X.; Gao, L.; Wang, X.; Liu, W.; Xu, X.; Shen, H.T.; Song, J.: Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1166–1174 (2019)

  23. Malinowski, M.; Rohrbach, M.; Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)

  24. Ghosh, S.; Burachas, G.; Ray, A.; Ziskind, A.: Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv preprint arXiv:1902.05715 (2019)

  25. Xi, Y.; Zhang, Y.; Ding, S.; Wan, S.: Visual question answering model based on visual relationship detection. Signal Process. Image Commun. 80, 115648 (2020)

    Article  Google Scholar 

  26. Qi, W.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)

    Google Scholar 

  27. Wang, P.; Qi, W.; Shen, C.; Dick, A.; Van Den Hengel, A.: Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)

    Article  Google Scholar 

  28. Zhou, Yu.; Jun, Yu.; Xiang, C.; Fan, J.; Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)

    Article  Google Scholar 

  29. Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6, 31516–31524 (2018)

    Article  Google Scholar 

  30. Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Multimodal local perception bilinear pooling for visual question answering. IEEE Access 6, 57923–57932 (2018)

    Article  Google Scholar 

  31. Kou, G.; Xiao, H.; Cao, M.; Lee, L.H.: Optimal computing budget allocation for the vector evaluated genetic algorithm in multi-objective simulation optimization. Automatica 129, 109599 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  32. Kou, G.; Liu, Y.; Xiao, H.; Peng, R.: Optimal inspection policy for a three-stage system considering the production wait time. IEEE Trans. Reliability (2022)

  33. Kou, G.; Yi, K.; Xiao, H.; Peng, R.: Reliability of a distributed data storage system considering the external impacts. IEEE Trans. Reliability (2022)

  34. Xiao, H.; Yi, K.; Peng, R.; Kou, G.: Reliability of a distributed computing system with performance sharing. IEEE Trans. Reliability (2021)

  35. Yu, Z.; Yu, J.; Fan, J.; Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)

  36. Jung, B.; Gu, L.; Harada, T.: bumjun_jung at vqa-med 2020: Vqa model based on feature extraction and multi-modal feature fusion. In: CLEF (Working Notes) (2020)

  37. Kim, J.-H.; Jun, J.; Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

  38. Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)

  39. Wen, Z.; Guanghui, X.; Tan, M.; Qingyao, W.; Qi, W.: Debiased visual question answering from feature and sample perspectives. Adv. Neural. Inf. Process. Syst. 34, 3784–3796 (2021)

    Google Scholar 

  40. Han, X.; Wang, S.; Su, C.; Huang, Q.; Tian, Q.: Greedy gradient ensemble for robust visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1584–1593 (2021)

  41. Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)

  42. Hu, R.; Singh, A.: Unit: Multimodal multitask learning with a unified transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1439–1449 (2021)

  43. Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018).

  44. Smith, D.R.: The design of divide and conquer algorithms. Sci. Comput. Programm. 5, 37–58 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  45. Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)

  46. Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)

  47. Bernardo, J.M.; Smith, A.F.M.: Bayesian Theory, vol. 405. Wiley, New York (2009)

    Google Scholar 

  48. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015).

  49. Magnus, J.R.; De Luca, G.: Weighted-average least squares (WALS): a survey. J. Econ. Surv. 30(1), 117–148 (2016)

    Article  Google Scholar 

  50. Yu, Z.; Cui, Y.; Yu, J.; Wang, M.; Tao, D.; Tian, Q.: Deep multimodal neural architecture search. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3743–3752 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Souvik Chowdhury.

Ethics declarations

Conflict of interest

Souvik Chowdhury is a PhD student under the supervision of Dr. Badal Soni, both belong to National Institute of Technology, Silchar, India, and have not received any financial benefit from any organization for this work. Both do not have any conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chowdhury, S., Soni, B. QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework. Arab J Sci Eng 48, 10479–10491 (2023). https://doi.org/10.1007/s13369-023-07661-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-023-07661-8

Keywords

Navigation