Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach

Phueaksri, Itthisak; Kastner, Marc A.; Kawanishi, Yasutomo; Komamizu, Takahiro; Ide, Ichiro

doi:10.1007/978-3-031-27077-2_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13833))

Included in the following conference series:

International Conference on Multimedia Modeling

1352 Accesses
2 Citations

Abstract

Most content summarization models from the field of natural language processing summarize the textual contents of a collection of documents or paragraphs. In contrast, summarizing the visual contents of a collection of images has not been researched to this extent. In this paper, we present a framework for summarizing the visual contents of an image collection. The key idea is to collect the scene graphs for all images in the image collection, create a combined representation, and then generate a visually summarizing caption using a scene-graph captioning model. Note that this aims to summarize common contents across all images in a single caption rather than describing each image individually. After aggregating all the scene graphs of an image collection into a single scene graph, we normalize it by using an additional concept generalization component. This component selects the common concept in each sub-graph with ConceptNet based on word embedding techniques. Lastly, we refine the captioning results by replacing a specific noun phrase with a common concept from the concept generalization component to improve the captioning results. We construct a dataset for this task based on the MS-COCO dataset using techniques from image classification and image-caption retrieval. An evaluation of the proposed method on this dataset shows promising performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.tensorflow.org/datasets/catalog/wikipedia/ (accessed Sept. 9, 2022)

References

Alrasheed, H.: Word synonym relationships for text analysis: a graph-based approach. PLoS ONE 16(7), e0255127 (2021)
Article Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Faghri, F., et al.: VSE++: improving visual-semantic embeddings with hard negatives. In: 29th British Machine Vision Conference (2018)
Google Scholar
Gao, Y., et al.: SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In: 58th Annual Meeting of the Association for Computational Linguistics, pp. 1347–1354 (2020)
Google Scholar
Girshick, R.: Fast R-CNN. In: 16th IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Gupta, S., et al.: Abstractive summarization: an overview of the state of the art. Expert Syst. Appl. 121, 49–65 (2019)
Article Google Scholar
Hailu, T.T., et al.: A framework for word embedding based automatic text summarization and evaluation. Information 11(2), 78–100 (2020)
Article Google Scholar
Han, X., et al.: Image scene graph generation (SGG) benchmark. Comput. Res. Reposit. arXiv preprint arXiv:2107.12604 (2021)
Hasan, T., et al.: XL-Sum: large-scale multilingual abstractive summarization for 44 languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4693–4703 (2021)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hossain, M.Z., et al.: A comprehensive survey of deep learning for image captioning. ACM Comput. Survey 51(6), 1–36 (2019)
Article Google Scholar
Karpathy, A., et al.: Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kingma, D.P., et al.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations (2014)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL-04 Workshop on Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loper, E., et al.: NLTK: the natural language toolkit. In: 42nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 63–70 (2002)
Google Scholar
Milewski, V., et al.: Are scene graphs good enough to improve image captioning? In: Joint Conference 59th Annual Meeting of the Association for Computational Linguistics and 11th International Conference on Natural Language Processing (2020)
Google Scholar
Pennington, J., et al.: GloVe: global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Samani, Z.R., et al.: A knowledge-based semantic approach for image collection summarization. Multimed. Tools Appl. 76(9), 11917–11939 (2017)
Article Google Scholar
Speer, R., et al.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: 31st AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017)
Google Scholar
Trieu, N., et al.: Multi-image summarization: textual summary from a set of cohesive images. Comput. Res. Reposit. arXiv preprint arXiv:2006.08686 (2020)
Vedantam, R., et al.: CIDEr: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wang, J., Xu, W., Wang, Q., Chan, A.B.: Compare and reweight: distinctive image captioning using similar images sets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 370–386. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_22
Chapter Google Scholar
Wang, W., et al.: Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. Compt. Res. Reposit. arXiv preprint arXiv:2208.10442 (2022)
Wasserman, S., et al.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)
Book Google Scholar
Zellers, R., et al.: Neural motifs: scene graph parsing with global context. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)
Google Scholar
Zhang, J., et al.: Graphical contrastive losses for scene graph parsing. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 11535–11543 (2019)
Google Scholar
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. In: 9th International Conference on Learning Representations (2020)
Google Scholar
Zhang, W., et al.: Joint optimisation convex-negative matrix factorisation for multi-modal image collection summarisation based on images and tags. IET Comput. Vis. 13(2), 125–130 (2019)
Article Google Scholar
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 211–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_13
Chapter Google Scholar

Download references

Acknowledgements

Parts of this work were supported by JSPS Grant-in-aid for Scientific Research (21H03519) and a joint research project with National Institute of Informatics.

Author information

Authors and Affiliations

Nagoya University, Nagoya, Aichi, Japan
Itthisak Phueaksri, Yasutomo Kawanishi, Takahiro Komamizu & Ichiro Ide
Kyoto University, Kyoto, Japan
Marc A. Kastner
RIKEN, Seika, Kyoto, Japan
Yasutomo Kawanishi

Authors

Itthisak Phueaksri
View author publications
You can also search for this author in PubMed Google Scholar
Marc A. Kastner
View author publications
You can also search for this author in PubMed Google Scholar
Yasutomo Kawanishi
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Komamizu
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Ide
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Itthisak Phueaksri .

Editor information

Editors and Affiliations

University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phueaksri, I., Kastner, M.A., Kawanishi, Y., Komamizu, T., Ide, I. (2023). Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-27077-2_14
Published: 29 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach