Abstract
In recent years, multimodal learning has gained acceptability because of the availability of low resource-consuming fusion techniques and robust and powerful deep learning architectures. Visual question answering (VQA) is an interdisciplinary research domain in natural language processing and computer vision. In previous works, researchers have tried to optimize the VQA problems primarily with the help of optimized bilinear fusion techniques. In this paper, we propose a novel question segregation framework for visual question answering to optimize the VQA problem where the VQA framework is segregated by the question type labels. The main contribution of the proposed question segregation framework is the reduction in the execution time and computational resource requirement of VQA models. Six VQA models are tested under the proposed framework and have promising results. The proposed question segregation framework can be extended to other VQA models and datasets, and the problem of model bias toward larger volume question type labels is rectified by having an individual model for each question type label. Also, the proposed question segregation framework provides meaningful answers for a given question by filtering out unrealistic answers by restricting the answer space.
Similar content being viewed by others
References
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Shrestha, R.; Kafle, K.; Kanan, C.: Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10472–10481 (2019)
Yang, C.; Jiang, M.; Jiang, B.; Zhou, W.; Li, K.: Co-attention network with question type for visual question answering. IEEE Access 7, 40771–40781 (2019)
Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Ilievski, I.; Feng, J.: Multimodal learning and reasoning for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Lu, J.; Yang, J.; Batra, D.; Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Noh, H.; Seo, P.H.; Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 30–38 (2016).
Schwartz, I.; Schwing, A.; Hazan, T.: High-order attention models for visual question answering. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Shi, Y., Furlanello, T.; Zha, S.; Anandkumar, A.: Question type guided attention in visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–166 (2018)
Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Gao, D.; Wang, R.; Shan, S.; Chen, X.: Learning to recognize visual concepts for visual question answering with structural label space. IEEE J. Sel. Top. Signal Process. 14(3), 494–505 (2020)
Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D.: Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)
Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K.: Learning to reason: End-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813 (2017)
Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998 (2017)
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A.: Film: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Teney, D.; Liu, L.; van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)
Jing, Yu.; Zhang, W.; Yuhang, L.; Qin, Z.; Yue, H.; Tan, J.; Qi, W.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia 22(12), 3196–3209 (2020)
Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015).
Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L.: Visual7w: Grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Li, X.; Gao, L.; Wang, X.; Liu, W.; Xu, X.; Shen, H.T.; Song, J.: Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1166–1174 (2019)
Malinowski, M.; Rohrbach, M.; Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)
Ghosh, S.; Burachas, G.; Ray, A.; Ziskind, A.: Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv preprint arXiv:1902.05715 (2019)
Xi, Y.; Zhang, Y.; Ding, S.; Wan, S.: Visual question answering model based on visual relationship detection. Signal Process. Image Commun. 80, 115648 (2020)
Qi, W.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)
Wang, P.; Qi, W.; Shen, C.; Dick, A.; Van Den Hengel, A.: Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)
Zhou, Yu.; Jun, Yu.; Xiang, C.; Fan, J.; Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6, 31516–31524 (2018)
Lao, M.; Guo, Y.; Wang, H.; Zhang, X.: Multimodal local perception bilinear pooling for visual question answering. IEEE Access 6, 57923–57932 (2018)
Kou, G.; Xiao, H.; Cao, M.; Lee, L.H.: Optimal computing budget allocation for the vector evaluated genetic algorithm in multi-objective simulation optimization. Automatica 129, 109599 (2021)
Kou, G.; Liu, Y.; Xiao, H.; Peng, R.: Optimal inspection policy for a three-stage system considering the production wait time. IEEE Trans. Reliability (2022)
Kou, G.; Yi, K.; Xiao, H.; Peng, R.: Reliability of a distributed data storage system considering the external impacts. IEEE Trans. Reliability (2022)
Xiao, H.; Yi, K.; Peng, R.; Kou, G.: Reliability of a distributed computing system with performance sharing. IEEE Trans. Reliability (2021)
Yu, Z.; Yu, J.; Fan, J.; Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Jung, B.; Gu, L.; Harada, T.: bumjun_jung at vqa-med 2020: Vqa model based on feature extraction and multi-modal feature fusion. In: CLEF (Working Notes) (2020)
Kim, J.-H.; Jun, J.; Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Wen, Z.; Guanghui, X.; Tan, M.; Qingyao, W.; Qi, W.: Debiased visual question answering from feature and sample perspectives. Adv. Neural. Inf. Process. Syst. 34, 3784–3796 (2021)
Han, X.; Wang, S.; Su, C.; Huang, Q.; Tian, Q.: Greedy gradient ensemble for robust visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1584–1593 (2021)
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Hu, R.; Singh, A.: Unit: Multimodal multitask learning with a unified transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1439–1449 (2021)
Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018).
Smith, D.R.: The design of divide and conquer algorithms. Sci. Comput. Programm. 5, 37–58 (1985)
Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)
Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Bernardo, J.M.; Smith, A.F.M.: Bayesian Theory, vol. 405. Wiley, New York (2009)
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015).
Magnus, J.R.; De Luca, G.: Weighted-average least squares (WALS): a survey. J. Econ. Surv. 30(1), 117–148 (2016)
Yu, Z.; Cui, Y.; Yu, J.; Wang, M.; Tao, D.; Tian, Q.: Deep multimodal neural architecture search. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3743–3752 (2020)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Souvik Chowdhury is a PhD student under the supervision of Dr. Badal Soni, both belong to National Institute of Technology, Silchar, India, and have not received any financial benefit from any organization for this work. Both do not have any conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chowdhury, S., Soni, B. QSFVQA: A Time Efficient, Scalable and Optimized VQA Framework. Arab J Sci Eng 48, 10479–10491 (2023). https://doi.org/10.1007/s13369-023-07661-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-023-07661-8