Skip to main content
Log in

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory and Transformer, and jointly optimized. Experiments on public datasets demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer semantic topics and generate diverse and coherent captions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6077–6086).

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

    MATH  Google Scholar 

  • Burkhardt, S., & Kramer, S. (2019). Decoupling sparsity and smoothness in the Dirichlet variational autoencoder topic model. Journal of Machine Learning Research, 20(131), 1–27.

    MathSciNet  MATH  Google Scholar 

  • Chatterjee, M., & Schwing, AG .(2018) . Diverse and coherent paragraph generation from images. In Proceedings of the European conference on computer vision (ECCV) (pp. 729–744).

  • Chen, F., Xie, S., Li, X., Li, S., Tang, J., & Wang, T. (2019). What topics do images say: A neural image captioning model with topic representation. In 2019 IEEE international conference on multimedia & expo workshops (ICMEW) (pp. 447–452), IEEE.

  • Chen, Z., Song, Y., Chang, T. H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 1439–1449).

  • Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1724–1734).

  • Cong, Y., Chen, B., Liu, H., & Zhou, M. (2017). Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In International conference on machine learning (pp. 864–873), PMLR.

  • Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578–10587).

  • Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304–310.

    Article  Google Scholar 

  • Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376–380).

  • Fan, H., Zhu, L., Yang, Y., & Wu, F. (2020). Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(3), 1–16.

    Article  Google Scholar 

  • Fu, K., Jin, J., Cui, R., Sha, F., & Zhang, C. (2017). Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2321–2334.

    Article  Google Scholar 

  • Goodfellow, .I, Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).

  • Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645–6649), IEEE.

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235.

    Article  Google Scholar 

  • Guo, D., Chen, B., Lu, R., & Zhou, M.(2020). Recurrent hierarchical topic-guided rnn for language generation. In International conference on machine learning (pp. 3810–3821), PMLR.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Huang, L., Wang, W., Chen, J., & Wei, X. Y. (2019). Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4634–4643).

  • Johnson, A. E., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C.y., Peng, Y., Lu, Z., Mark, R. G., Berkowitz, S. J., & Horng, S. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations.

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In 2nd International conference on learning representations.

  • Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 317–325).

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Li, G., Zhu, L., Liu, P., & Yang, Y.(2019). Entangled transformer for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8928–8937).

  • Liang, X., Hu, Z., Zhang, H., Gan, C., & Xing, E. P. (2017). Recurrent topic-transition GAN for visual paragraph generation. In Proceedings of the IEEE international conference on computer vision (pp. 3362–3371).

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55–60).

  • Mao, Y., Zhou, C., Wang, X., & Li, R. (2018). Show and tell more: Topic-oriented multi-sentence image captioning. In Proceedings of the twenty-seventh international joint conference on artificial intelligence (pp. 4258–4264).

  • Melas-Kyriazi, L., Rush, A. M., & Han, G. (2018). Training for diversity in image paragraph captioning. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 757–761).

  • Miao, Y., Yu, L., & Blunsom, P.(2016). Neural variational inference for text processing. In International conference on machine learning (pp. 1727–1736), PMLR.

  • Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., Stratos, K., Goyal, A., Dodge, J., Mensch, A., et al. (2016). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision, 119(1), 46–59.

    Article  MathSciNet  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318).

  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Blog

  • Ren, S., He, K., Girshick, R. B., & Sun, J.(2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Shi, Y., Liu, Y., Feng, F., Li, R., Ma, Z., & Wang, X. (2021). S2TD: A tree-structured decoder for image paragraph captioning. In ACM Multimedia Asia (pp. 1–7).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In 3rd international conference on learning representations.

  • Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. In International conference on learning representations

  • Tang, J., Wang, J., Li, Z., Fu, J., & Mei, T. (2019). Show, reward, and tell: Adversarial visual story generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(2), 1–20.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164).

  • Wang, J., Pan, Y., Yao, T., Tang, J., & Mei, T. (2019). Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the twenty-eighth international joint conference on artificial intelligence (pp. 940–946).

  • Wang, J., Tang, J., Yang, M., Bai, X., & Luo, J. (2021). Improving OCR-based image captioning by incorporating geometrical relationship. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1306–1315).

  • Xu, C., Li, Y., Li, C., Ao, X., Yang, M., & Tian, J. (2020). Interactive key-value memory-augmented attention for image paragraph captioning. In Proceedings of the 28th international conference on computational linguistics (pp. 3132–3142).

  • Xu, C., Yang, M., Ao, X., Shen, Y., Xu, R., & Tian, J. (2021). Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowledge-Based Systems, 214, 106730.

    Article  Google Scholar 

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning| (pp. 2048–2057), PMLR.

  • Xu, Z., Yang, Y., & Hauptmann, A. G. (2015b). A discriminative CNN video representation for event detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1798–1807).

  • Yu, N., Hu, X., Song, B., Yang, J., & Zhang, J. (2018). Topic-oriented image captioning based on order-embedding. IEEE Transactions on Image Processing, 28(6), 2743–2754.

    Article  MathSciNet  Google Scholar 

  • Zhang, H., Chen, B., Guo, D., & Zhou, M. (2018). WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In International conference on learning representations

  • Zhang, H., Chen, B., Tian, L., Wang, Z., & Zhou, M. (2020). Variational hetero-encoder randomized generative adversarial networks for joint image-text modeling. In International conference on learning representations

  • Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., & Buntine, W. (2021). Topic modelling meets deep neural networks: A survey. In The 30th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 4713–4720).

  • Zhou, M., Hannah, L., Dunson, D., & Carin, L. (2012). Beta-negative binomial process and Poisson factor analysis. In Artificial intelligence and statistics (pp. 1462–1471), PMLR.

  • Zhou, M., Cong, Y., & Chen, B. (2016). Augmentable gamma belief networks. Journal of Machine Learning Research, 17(163), 1–44.

    MathSciNet  MATH  Google Scholar 

  • Zhu, L., Fan, H., Luo, Y., Xu, M., & Yang, Y. (2022). Temporal cross-layer correlation mining for action recognition. IEEE Transactions on Multimedia, 24, 668–676. https://doi.org/10.1109/TMM.2021.3057503

    Article  Google Scholar 

  • Zhu, Z., Xue, Z., & Yuan, Z .(2018). Topic-guided attention for image captioning. In 2018 25th IEEE international conference on image processing (ICIP) (pp. 2615–2619), IEEE.

Download references

Acknowledgements

B. Chen acknowledges the support of NSFC (U21B2006 and 61771361), Shaanxi Youth Innovation Team Project, the 111 Project (No. B18039) and the Program for Oversea Talent by Chinese Central Government. M. Zhou acknowledges the support of NSF IIS-1812699.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Chen.

Additional information

Communicated by Kwan-Yee Kenneth Wong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A.1 The Variational Topic Encoder of VTCM

Inspired by Zhang et al. (2018), to approximate the gamma distributed topic weight vector \(\varvec{\theta } ^{l}\) with a Weibull distribution, we assume the topic encoder as \(q(\varvec{\theta } ^{l}|{\overline{{\varvec{v}} }}) = \text{ Weibull }({\varvec{k}} ^{l},\varvec{\lambda } ^{l})\), where the parameters \({\varvec{k}} ^{l}\) and \(\varvec{\lambda } ^{l}\) of \(\varvec{\theta } ^{l}\) can be denoted as

$$\begin{aligned}&{\varvec{k}} ^{l} = \ln [1+\exp (\mathbf{W} _{hk}^{l}{\varvec{h}} ^{l}+ {\varvec{b}} _1^{l})] , \end{aligned}$$
(15)
$$\begin{aligned}&\varvec{\lambda } ^{l} = \ln [1+\exp (\mathbf{W} _{h \lambda }^{l}{\varvec{h}} ^{l}+ {\varvec{b}} _2^{l})] , \end{aligned}$$
(16)

where \({\varvec{h}} ^{l}\) are deterministically nonlinearly transformed from the image pooled representation, stated as \({\varvec{h}} ^{0}={\overline{{\varvec{v}} }}\) and \( {\varvec{h}} ^{l}\!=\!\text{ tanh } \left( \mathbf{W} _v^{l}{\varvec{h}} ^{l-1}+{\varvec{b}} _v^{l}\right) \).

1.2 A.2 The Gating Unit in VTCM-LSTM

Note the input \({\varvec{u}} _{j,t}^{l}\) of sentence-level LSTM at layer l combines the topic weight vectors \(\varvec{\theta } ^{l}\) and hidden output of the sentence-level LSTM \({\varvec{h}} _{j,t}^{s,l}\) at each time step t. To realize \({\varvec{u}} _{j,t}^{l}=g\left( {\varvec{h}} _{j,t}^{s,l}, \varvec{\theta } ^{l}\right) \), we adopt a gating unit similar to the gated recurrent unit (GRU) (Cho et al. 2014), defined as

$$\begin{aligned} {\varvec{u}} _{j,t}^{l}&=\left( 1-{\varvec{z}} _{j,t}^{l}\right) \odot {\varvec{h}} _{j,t}^{s,l} + {\varvec{z}} _{j,t}^{l} \odot {\hat{{\varvec{h}} }}_{j,t}^{s,l}\,\, . \end{aligned}$$
(17)

where

$$\begin{aligned} {\varvec{z}} _{j,t}^{l}&= \sigma \left( \mathbf{W} _{z}^{l} \varvec{\theta } ^{l} + \mathbf{U} _{z}^{l} {\varvec{h}} _{j,t}^{s,l} + {\varvec{b}} _{z}^{l} \right) , \nonumber \\ {\varvec{r}} _{j,t}^{l}&= \sigma \left( \mathbf{W} _{r}^{l} \varvec{\theta } ^{l} + \mathbf{U} _{r}^{l} {\varvec{h}} _{j,t}^{s,l} + {\varvec{b}} _{r}^{l} \right) , \nonumber \\ {\hat{{\varvec{h}} }}_{j,t}^{s,l}&=\tanh \left( \mathbf{W} _{h}^{l} \varvec{\theta } ^{l} + \mathbf{U} _{h}^{l}\left( {\varvec{r}} _{j,t}^{l} \odot {\varvec{h}} _{j, t}^{s,l}\right) +{\varvec{b}} _{h}^{l}\right) . \end{aligned}$$
(18)

Define \({\varvec{u}} _{j,t}^{1:L}\) as the concatenation of \({\varvec{u}} _{j,t}^{l}\) across all layers and \(\mathbf{W} _o\) as a weight matrix with V rows, the conditional distribution probability \(p\left( w_{j,t} \,|\,w_{j,<t}, Img \right) \) of \(w_{j,t}\) becomes

$$\begin{aligned} p\left( w_{j,t} \,|\,w_{j,<t}, {\varvec{v}} _{1:M},\varvec{\theta } ^{1:L}\right) = {\text {softmax}}\left( \mathbf{W} _o {\varvec{u}} _{j,t}^{1:L}\right) . \end{aligned}$$
(19)

There are two advantages to combine \({\varvec{u}} _{j,t}^{l}\) at all layers for language generation. First, the combination can enhance representation power because of different statistical properties at different stochastic layers of the deep topic model. Second, owing “skip connections” from all hidden layers to the output, one can reduce the number of processing steps between the bottom of the network and the top, mitigating the “vanishing gradient” problem (Graves et al. 2013).

1.3 A.3 Likelihood and Inference of VTCM-Transformer

Given an image \(Img \), we can also represent the paragraph as \({\varvec{Y}}=\{y_1,...,y_I\}\), which is suitable for flat language model, such as Transformer-based model. Under the deep topic model (VTCM) and Transformer-based LM, the joint likelihood of the target ground truth paragraph \({\varvec{Y}}\) of \(Img \) and its corresponding BoW count vector \({\varvec{d}} \) is defined as

(20)

Since we introduce a variational topic encoder to learn the multi-layer topic weight vectors \(\varvec{\theta } ^{1:L}\) with the image features \({\overline{{\varvec{v}} }}\) as the input. Thus, a lower bound of the log of (20) can be constructed as

$$\begin{aligned} L_\text {all}&= {\mathbb {E}}_{q(\varvec{\theta } ^{1}|{\overline{{\varvec{v}} }})}\left[ \ln p\left( {\varvec{d}} \,|\,\, \varvec{\varPhi }^{1}\varvec{\theta } ^{1} \right) \right] \nonumber \\&- \sum _{l=1}^L {\mathbb {E}}_{q(\varvec{\theta } ^{l}|{\overline{{\varvec{v}} }})} \left[ \ln \frac{q\left( \varvec{\theta } ^{l}\,|\,{\overline{{\varvec{v}} }} \right) }{p \left( \varvec{\theta } ^{l} \,|\,\varvec{\varPhi }^{l + 1}\varvec{\theta } ^{l+ 1}\right) } \right] \nonumber \\&+\sum _{l=1}^L {\mathbb {E}}_{q(\varvec{\theta } ^{l}|{\overline{{\varvec{v}} }})} \left[ \sum _{i=1}^{I} \ln p\left( {y_{i}}\,|\,y_{<i},{\varvec{v}} _{1:M},\varvec{\theta } _j^{1:L}\right) \right] , \end{aligned}$$
(21)

which unites the first two terms primarily responsible for training the topic model component, and the last term for training the Transformer-based LM component. The parameters \(\varvec{\Omega }_{\text {TM}}\) of the variational topic encoder and the parameters \(\varvec{\Omega }_\text {Trans}\) of Transformer-based LM can be jointly updated by maximizing \(L_\text {all}\). Besides, the global parameters \(\varvec{\varPhi }^{1:L}\) of the topic decoder can be sampled with TLASGR-MCMC in Cong et al. (2017) and presented below. The training strategy of VTCM-Transformer is similar to that of VTCM-LSTM.

Table 4 Comparisons of our proposed VTCM-M-Transformer with M-Transformer on the test sets of IU X-RAY and MIMIC-CXR, where RG-L is ROUGE-L

1.4 A.4 Inference of Global Parameters \(\varvec{\varPhi }^{1:L}\) of VTCM

For scale identifiability and ease of inference and interpretation, the Dirichlet prior is placed on each column of \(\varvec{\varPhi }^{l} \in \) \({\mathbb {R}}_{+}^{K_{l-1} \times K_{l}}\), which means \(0 \le \varPhi ^{l}_{k^{\prime }, k} \le 1 \) and \(\sum _{k^{\prime }=1}^{K_{l-1}} \varPhi ^{l}_{k^{\prime }, k}= 1 \). To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC algorithm described in Cong et al. (2017); Zhang et al. (2018), which can be used to sample simplex-constrained global parameters in a mini-batch based manner. It improves its sampling efficiency via the use of the Fisher information matrix (FIM), with adaptive step-sizes for the topics at different layers. Here, we discuss how to update the global parameters \(\{\varvec{\varPhi }^{l}\}_{l=1}^L\) of VTCM in detail and give a complete one in Algorithm 1.

Sample the auxiliary counts: This step is about the “upward” pass. For the given mini-batch \(\{ Img _n, P_n, {\varvec{d}} _n \}_{n=1}^{N}\) in the training set, \({\varvec{d}} _n\) is the bag of words (BoW) count vector of paragraph \(P_n\) for input image \(Img _n\) and \(\varvec{\theta } _n^{1:L}\) denotes the latent features of the nth image. By transforming standard uniform noises \({\varvec{\epsilon } _n^{l}}\), we can sample \(\varvec{\theta } _n^{l}\) as

$$\begin{aligned} \varvec{\theta } _n^{l}&\small = {\varvec{\lambda } _n^{l}} \left( -\ln (1-{\varvec{\epsilon } _n^{l}})\right) ^ {1/{{\varvec{k}} _n^{l}}}. \end{aligned}$$
(22)

Working upward for \(l = 1,...,L\), we can propagate the latent counts \(x_{vn}^{l}\) of layer l upward to layer \(l + 1\) as

(23)
(24)
(25)

where \(x_{vn}^{1}=d_{vn}\), \({\varvec{d}} _{n}=\{d_{1n},..,d_{vn},..,d_{{V_c}n}\}\), \(V_c\) is the size of vocabulary in VTCM, and \(x_{kn}^{(l+1)}\) denotes the latent counts at layer \(l+1\).

Fig. 8
figure 8

Illustrations of reports from ground-truth, M-Transformer and VTCM-M-Transformer models for two X-ray chest images

Sample the hierarchical components \(\{\varvec{\varPhi }^{l}\}_{l=1}^L\): For \(\varvec{\phi } _k^{l}\), the kth column of the loading matrix \(\varvec{\varPhi }^{l}\) of layer l, its sampling can be efficiently realized as

(26)

where \(\varepsilon _q\) denotes the learning rate at the qth iteration, \(\rho \) the ratio of the dataset size to the mini-batch size, \(P_k^{l}\) is calculated using the estimated FIM, \(\tilde{A}_{{k'}k\varvec{\cdot }}^{{l}} = \sum _{{n} = 1}^{N} {A_{{k'}kn}^{ {l}}}, {{{\tilde{{\varvec{A}} }}}_{:k\varvec{\cdot }}^{l }} = \{{\tilde{A}}_{{1}k\varvec{\cdot }}^{{l}},\cdots , {\tilde{A}}_{{K'}k\varvec{\cdot }}^{{l}} \}^{T} \) and \({{\tilde{A}}_{\varvec{\cdot }k\varvec{\cdot }}^{l}}= \sum _{{k'} = 1}^{K'} {\tilde{A}}_{{k'}k\varvec{\cdot }}^{{l}} \), \({A_{{k'}kn}^{ {l}}}\) comes from the augmented latent counts \(A^{l}\) in (23), \(\eta _{0}^{l}\) is the prior of \({\varvec{\phi } _k^{l}}\), and \([\cdot ]_\angle \) denotes a simplex constraint. More details about TLASGR-MCMC for our proposed model can be found in the Equations (18–19) of Cong et al. (2017).

1.5 A.5 Additional Experimental Results

To validate the generalizability of our proposed model, we also conducted experiments on the task of generating the radiology reports for the chest X-ray images, which is an important task to apply artificial intelligence to the medical domain. We consider the memory-driven Transformer (M-Transformer) designed for the radiology report generation task (Chen et al. 2020) as our baseline, which introduces a relational memory (RM) to record the information from previous generation processes and a memory-driven conditional layer normalization (MCLN) to incorporate the memory into Transformer. For a fair comparison, we adopt the same implementation for our model; see Chen et al. (2020) for more details. Our experiments are performed on two prevailing radiology report datasets. IUX-RAY (Demner-Fushman et al. 2016) is collected by the Indiana University and consists of 7,471 chest X-ray images and 3,955 reports; MIMIC-CXR (Johnson et al. 2019) includes 473,057 chest X-ray images and 206,563 reports from 63,478 patients. Following Chen et al. (2020) , we exclude the samples without reports. Note that we can flexibly select the language model for our plug-and-play system, since we pay more attention to assimilating the multi-layer semantic topic weight vectors into the paragraph generator. We here adopt the same Transformer encoder with the M-Transformer and introduce the three-layer semantic topics into its Transformer decoder, where we only add the concatenated topic proportion \(\varvec{\theta } ^{1:L}\) to the embedding vector of the input token \(y_{t-1}\), which are then embedded to calculate the keys \( {\varvec{K}}\) and values \( {\varvec{V}}\) of the decoder. As summarized in Table 4, our model (VTCM-M-Transformer) outperforms the M-Transformer on METEOR, ROUGE-L, BLEU-1, BLEU-2 and BLEU-3 and is competitive on the BLEU-4. It indicates that the multi-layer semantic topics with VTCM can enhance the radiology report generation despite without designing the complex language model on purpose, proving the generalizability and flexibility of our model. To qualitatively show the effectiveness of our proposed method, we show the reports of two randomly sampled X-ray chest images generated by different methods in Fig. 8, as well as the ground truth reports. Compared with M-Transformer, our VTCM-M-Transformer produces more detailed and coherent reports. For the first image, our VTCM-M-Transformer describes the “heart” and “mediastinal silhouette” in a natural way, while the M-Transformer ignores the “heart”. For the second image, the report generated by our model is closer to the ground-truth report, which summarizes the “lungs are hyperrexpanded”. The above observations show that using the hierarchical semantic topics can enhance the radiology report generation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, D., Lu, R., Chen, B. et al. Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning. Int J Comput Vis 130, 1920–1937 (2022). https://doi.org/10.1007/s11263-022-01624-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01624-6

Keywords

Navigation