VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering

  • Marc BolañosEmail author
  • Álvaro Peris
  • Francisco Casacuberta
  • Petia Radeva
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10255)


In this paper, we address the problem of visual question answering by proposing a novel model, called VIBIKNet. Our model is based on integrating Kernelized Convolutional Neural Networks and Long-Short Term Memory units to generate an answer given a question about an image. We prove that VIBIKNet is an optimal trade-off between accuracy and computational load, in terms of memory and time consumption. We validate our method on the VQA challenge dataset and compare it to the top performing methods in order to illustrate its performance and speed.


Visual Qestion Aswering Convolutional Neural Networks Long short-term memory networks 



This work was partially funded by TIN2015-66951-C2-1-R, SGR 1219, CERCA Programme/Generalitat de Catalunya, CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO/FEDER), PrometeoII/2014/030 and R-MIPRCV. P. Radeva is partially supported by ICREA Academia2014. We acknowledge NVIDIA Corporation for the donation of a GPU used in this work.


  1. 1.
    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh., D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325 (2015)
  3. 3.
    Cheng, G., Zhou, P., Han, J.: RIFD-CNN: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In: CVPR, pp. 2884–2893 (2016)Google Scholar
  4. 4.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 (2016)
  5. 5.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRefGoogle Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Kim, J.-H., Lee, S.-W., Kwak, D.-H., Heo, M.-O., Kim, J., Ha, J.-W., Zhang, B.-T.: Multimodal residual learning for visual QA. arXiv:1606.01455 (2016)
  8. 8.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
  9. 9.
    Liu, Z.: Kernelized deep convolutional neural network for describing complex images. arXiv:1509.04581 (2015)
  10. 10.
    Nam, H., Ha, J.-W., Kim, J.: Dual attention networks for multimodal reasoning and matching. arXiv:1611.00471 (2016)
  11. 11.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  12. 12.
    Peris, Á., Bolaños, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016). doi: 10.1007/978-3-319-44781-0_1 CrossRefGoogle Scholar
  13. 13.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_11 CrossRefGoogle Scholar
  14. 14.
    Sivic, J., Zisserman, A.: Efficient visual search of videos cast as text retrieval. PAMI 31(4), 591–606 (2009)CrossRefGoogle Scholar
  15. 15.
    Specia, L., Frank, S., Sima’an, K., Elliott, D.: A shared task on multimodal machine translation and crosslingual image description. In: Proceedings of the First Conference on Machine Translation, pp. 543–553. ACL (2016)Google Scholar
  16. 16.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, vol. 27, pp. 3104–3112 (2014)Google Scholar
  17. 17.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  18. 18.
    Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044 (2015)
  19. 19.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). doi: 10.1007/978-3-319-10602-1_26 Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marc Bolaños
    • 1
    • 2
    Email author
  • Álvaro Peris
    • 3
  • Francisco Casacuberta
    • 3
  • Petia Radeva
    • 1
    • 2
  1. 1.Universitat de BarcelonaBarcelonaSpain
  2. 2.Computer Vision CenterBellaterraSpain
  3. 3.PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations