Combining Multiple Cues for Visual Madlibs Question Answering

Tommasi, Tatiana; Mallya, Arun; Plummer, Bryan; Lazebnik, Svetlana; Berg, Alexander C.; Berg, Tamara L.

doi:10.1007/s11263-018-1096-0

Combining Multiple Cues for Visual Madlibs Question Answering

Published: 03 May 2018

Volume 127, pages 38–60, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Tatiana Tommasi ORCID: orcid.org/0000-0001-8229-7159¹,
Arun Mallya²,
Bryan Plummer²,
Svetlana Lazebnik²,
Alexander C. Berg³ &
…
Tamara L. Berg³

977 Accesses
9 Citations
Explore all metrics

Abstract

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Question Type Guided Attention in Visual Question Answering

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Article 29 December 2023

Himanshu Sharma & Swati Srivastava

Notes

Note that the images of the Visual Madlibs dataset are sampled from the MSCOCO dataset (Lin et al. 2014) to contain at least one person.
The Madlibs training set contains only the correct image descriptions, not the incorrect distractor choices.

References

Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Deep compositional question answering with neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. In IEEE International Conference on Computer Vision (ICCV).
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). DBpedia: A Nucleus for a Web of Open Data. In International Semantic Web Conference, Asian Semantic Web Conference (ISWC + ASWC).
Bourdev, L., Maji, S., & Malik, J. (2011). Describing people: Poselet-based attribute classification. In IEEE International Conference on Computer Vision (ICCV).
Chao, YW., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A Benchmark for Recognizing Human-Object Interactions in Images. In IEEE International Conference on Computer Vision (ICCV).
Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In International Conference on Machine Learning (ICML).
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In Neural Information Processing Systems (NIPS).
Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual turing test for computer vision systems. PNAS, 112(12), 3618–23.
Google Scholar
Girshick, R. (2015). Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV).
Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2), 210–233.
Article Google Scholar
Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.
Article MATH Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778.
Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.
Article MATH Google Scholar
Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485.
Lassila, O., & Swick, R. R. (1999). Resource Description Framework (RDF) Model and Syntax Specification. Tech. rep., W3C, http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV).
Liu, H., & Singh, P. (2004). Conceptnet—a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4), 211–226.
Article Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2015). SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325.
Lyu, S. (2005). Mercer kernels for object recognition with local features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Neural Information Processing Systems (NIPS).
Mallya, A., & Lazebnik, S. (2016). Learning models for actions and person-object interactions with transfer to question answering. In European Conference on Computer Vision (ECCV).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems (NIPS).
Mokarian, A., Malinowski, M., & Fritz, M. (2016). Mean box pooling: A rich image representation and output embedding for the visual madlibs task. In British Machine Vision Conference (BMVC).
Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In German Conference on Pattern Recognition (GCPR).
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2017). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123(1), 74–93.
Article MathSciNet Google Scholar
Ren, M., Kiros, R., & Zemel, R. (2015a). Exploring models and data for image question answering. In Neural Information Processing Systems (NIPS).
Ren, S., He, K., Girshick, R., & Sun, J. (2015b). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Image net large scale visual recognition challenge. IJCV, 115(3), 211–252.
Article Google Scholar
Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2017). Dualnet: Domain-invariant network for visual question answering. In IEEE International Conference on Multimedia and Expo (ICME), pp 829–834.
Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing With Compositional Vector Grammars. In ACL.
Sudowe, P., Spitzer, H., & Leibe, B. (2015). Person attribute recognition with a jointly-trained holistic cnn model. In ICCV’15 ChaLearn Looking at People Workshop.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR).
Tandon, N., de Melo, G., Suchanek, F., & Weikum, G. (2014). Webchild: Harvesting and organizing commonsense knowledge from the web. In ACM International Conference on Web Search and Data Mining.
Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, AC., & Berg, TL. (2016). Solving visual madlibs with multiple cues. In British Machine Vision Conference (BMVC).
Wang, P., Wu, Q., Shen, C., Dick, A., & van den Hengel, A. (2017a). FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
Wang, P., Wu, Q., Shen, C., & van den Hengel, A. (2017b). The VQA-machine: Learning how to use existing vision algorithms to answer new questions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wu, Q., Shen, C., Hengel, Avd., Wang, P., & Dick, A. (2016a). Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814.
Wu, Q., Wang, P., Shen, C., Dick, A. R., & van den Hengel, A. (2016b). Ask me anything: Free-form visual question answering based on knowledge from external sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4622–4630.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Xu, H., & Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv preprint arXiv:1511.05234.
Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. J. (2016). Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yu, D., Fu, J., Mei, T., & Rui, Y. (2017). Multi-level attention networks for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank Image Generation and Question Answering. In IEEE International Conference on Computer Vision (ICCV).
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Neural Information Processing Systems (NIPS).
Zhu, Y., Zhang, C., Ré, C., & Fei-Fei, L. (2015). Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670.
Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants 1302438, 1563727, 1405822, 1444234, 1562098, 1633295, 1452851, Xerox UAC, Microsoft Research Faculty Fellowship, and the Sloan Foundation Fellowship. T.T. was partially supported by the ERC Grant 637076 - RoboExNovo.

Author information

Authors and Affiliations

Italian Institute of Technology, Milan, Italy
Tatiana Tommasi
University of Illinois at Urbana Champaign, Urbana, IL, USA
Arun Mallya, Bryan Plummer & Svetlana Lazebnik
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Alexander C. Berg & Tamara L. Berg

Authors

Tatiana Tommasi
View author publications
You can also search for this author in PubMed Google Scholar
Arun Mallya
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Plummer
View author publications
You can also search for this author in PubMed Google Scholar
Svetlana Lazebnik
View author publications
You can also search for this author in PubMed Google Scholar
Alexander C. Berg
View author publications
You can also search for this author in PubMed Google Scholar
Tamara L. Berg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatiana Tommasi.

Additional information

Communicated by Xiaoou Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tommasi, T., Mallya, A., Plummer, B. et al. Combining Multiple Cues for Visual Madlibs Question Answering. Int J Comput Vis 127, 38–60 (2019). https://doi.org/10.1007/s11263-018-1096-0

Download citation

Received: 02 November 2016
Accepted: 18 April 2018
Published: 03 May 2018
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s11263-018-1096-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Combining Multiple Cues for Visual Madlibs Question Answering

Abstract

Access this article

Similar content being viewed by others

Question Type Guided Attention in Visual Question Answering

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining Multiple Cues for Visual Madlibs Question Answering

Abstract

Access this article

Similar content being viewed by others

Question Type Guided Attention in Visual Question Answering

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation