Skip to main content
Log in

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Humans tend to decompose a sentence into different parts like sth do sth at someplace and then fill each part with certain content. Inspired by this, we follow the principle of modular design to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the widely used neural module networks in VQA, where the language (i.e., question) is fully observable, the task of collocating visual-linguistic modules is more challenging. This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: (1) distinguishable module design—four modules in the encoder including one linguistic module for function words and three visual modules for different content words (i.e., noun, adjective, and verb) and another linguistic one in the decoder for commonsense reasoning, (2) a self-attention based module controller for robustifying the visual reasoning, (3) a part-of-speech based syntax loss imposed on the module controller for further regularizing the training of our CVLNM. Extensive experiments on the MS-COCO dataset show that our CVLNM is more effective, e.g., achieving a new state-of-the-art 129.5 CIDEr-D, and more robust, e.g., being less likely to overfit to dataset bias and suffering less when fewer training samples are available. Codes are available at https://github.com/GCYZSL/CVLMN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. “Robustness” means that the soft fusion strategy helps generate more “accurate” captions.

  2. Learnable label embeddings mean that they are 4-dimensional one hot vectors multiplied with a learnable \(4\times d_z\) matrix.

  3. Intuitively, by randomly choosing, at each time, the probability of choosing the right one from 4 modules is \(1/4=25\%\).

References

  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In European conference on computer vision (pp. 382–398). Springer.

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In CVPR (Vol. 5, p. 6).

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT (pp. 1545–1554).

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 39–48).

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).

  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65–72).

  • Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in neural information processing systems 28: Annual conference on neural information processing systems.

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.

    Article  Google Scholar 

  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. S. (2017), SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR.

  • Chen, L., Zhang, H., Xiao, J., He, X., Pu, S., & Chang, S. F. (2019). Counterfactual critic multi-agent training for scene graph generation. In ICCV.

  • Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578–10587).

  • Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017). Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 326–335).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.

  • Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., & Deng, L. (2017). Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5630–5639).

  • Gu, J., Cai, J., Wang, G., & Chen, T. (2017). Stack-captioning: Coarse-to-fine learning for image captioning. In AAAI.

  • Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., & Lu, H. (2020). Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10327–10336).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Herdade, S., Kappeler, A., Boakye, K., & Soares, J. (2019). Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems, 32, 11137–11147.

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hu, R., Andreas, J., Darrell, T., & Saenko, K. (2018). Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV) (pp. 53–69).

  • Hu, R., Andreas, J., Rohrbach, M., Darrell, T., & Saenko, K. (2017a). Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 804–813).

  • Hu, R., Rohrbach, M., Andreas, J., Darrell, T., & Saenko, K. (2017b). Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1115–1124).

  • Huang, L., Wang, W., Chen, J., & Wei, X. Y. (2019). Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4634–4643).

  • Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In 5th International conference on learning representations.

  • Jiang, W., Ma, L., Jiang, Y. G., Liu, W., & Zhang, T. (2018). Recurrent fusion network for image captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 499–515).

  • Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2901–2910).

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).

  • Kim, Y., Denton, C., Hoang, L., & Rush, A. M. (2017). Structured attention networks. In 5th International conference on learning representations.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations.

  • Kitaev, N., & Klein, D. (2018). Constituency parsing with a self-attentive encoder. In Proceedings of the 56th annual meeting of the association for computational linguistics.

  • Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 317–325).

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer.

  • Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-volume 1, association for computational linguistics (pp. 359–368).

  • Li, G., Zhu, L., Liu, P., & Yang, Y (2019). Entangled transformer for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8928–8937)

  • Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.

  • Liu, D., Zha, Z. J., Zhang, H., Zhang, Y., & Wu, F. (2018a). Context-aware visual policy network for sequence-level image captioning. In Proceedings of the 26th ACM international conference on multimedia (pp. 1416–1424).

  • Liu, D., Zhang, H., Zha, Z. J., & Wu, F. (2018b). Explainability by parsing: Neural module tree networks for natural language visual grounding. arXiv preprint arXiv:1812.03299

  • Liu, D., Zhang, H., Wu, F., & Zha, Z. J. (2019). Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE international conference on computer vision (pp. 4673–4682).

  • Liu, H., & Singh, P. (2004). ConceptNet—a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4), 211–226.

    Article  Google Scholar 

  • Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th international conference on machine learning.

  • Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European conference on computer vision (pp. 852–869). Springer.

  • Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 6, p. 2).

  • Lu, J., Yang, J., Batra, D., & Parikh, D. (2018). Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7219–7228).

  • Luo, R. (2017). An image captioning codebase in pytorch.

  • Luo, R., Price, B., Cohen, S., & Shakhnarovich, G. (2018). Discriminability objective for training descriptive captions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6964–6974).

  • Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York, NY, USA: Henry Holt and Co., Inc.,.

    Google Scholar 

  • Mascharka, D., Tran, P., Soklaski, R., & Majumdar, A. (2018). Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4942–4950).

  • Miller, A., Fisch, A., Dodge, J., Karimi, A. H., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. In Proceedings of the 2016 conference on empirical methods in natural language processing.

  • Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., & Daumé III, H. (2012). Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th conference of the European chapter of the association for computational linguistics (pp. 747–756). Association for Computational Linguistics.

  • Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., & Wen, J. R. (2019). Recursive visual attention in visual dialog. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Pan, Y., Yao, T., Li, Y., & Mei, T. (2020). X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10971–10980).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies.

  • Qin, Y., Du, J., Zhang, Y., & Lu, H. (2019). Look back and predict forward in image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8367–8375).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In CVPR (Vol. 1, p. 3).

  • Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., & Saenko, K. (2018). Object hallucination in image captioning. In Proceedings of the 2018 conference on empirical methods in natural language processing.

  • Shen, Y., Tan, S., Sordoni, A., & Courville, A. (2019). Ordered neurons: Integrating tree structures into recurrent neural networks. In 7th International conference on learning representations.

  • Shi, J., Zhang, H., & Li, J. (2019). Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8376–8384).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.

  • Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-end memory networks. In Advances in neural information processing systems (pp. 2440–2448).

  • Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).

  • Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing.

  • Tang, K., Zhang, H., Wu, B., Luo, W., & Liu, W. (2019). Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6619–6628).

  • Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 joint SIGDAT conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics-volume 13 (pp. 63–70). Association for Computational Linguistics.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR.

  • Xu, B., Wang, N., Chen, T., & Li, M. (2015a). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.

  • Xu, D., Zhu, Y., Choy, C. B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410–5419).

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).

  • Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph R-CNN for scene graph generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 670–685).

  • Yang, X., Tang, K., Zhang, H., & Cai, J. (2019a). Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10685–10694).

  • Yang, X., Zhang, H., & Cai, J. (2019b). Learning to collocate neural modules for image captioning. In Proceedings of the IEEE international conference on computer vision (pp. 4250–4260).

  • Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. In IEEE International conference on computer vision, ICCV (pp. 22–29).

  • Yao, T., Pan, Y., Li, Y., & Mei, T. (2018). Exploring visual relationship for image captioning. In Computer vision–ECCV 2018 (pp. 711–727). Springer.

  • Yao, T., Pan, Y., Li, Y., & Mei, T. (2019). Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2621–2629).

  • Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., & Tenenbaum, J. (2018). Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in neural information processing systems (pp. 1031–1042).

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651–4659).

  • Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., & Berg, T. L. (2018). Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1307–1315).

  • Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5831–5840).

  • Zha, Z. J., Liu, D., Zhang, H., Zhang, Y., & Wu, F. (2019). Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Zhang, H., Kyaw, Z., Chang, S. F., & Chua, T. S. (2017). Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5532–5540).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanwang Zhang.

Additional information

Communicated by Alexander Schwing.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, X., Zhang, H., Gao, C. et al. Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning. Int J Comput Vis 131, 82–100 (2023). https://doi.org/10.1007/s11263-022-01692-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01692-8

Keywords

Navigation