Abstract
Most content summarization models from the field of natural language processing summarize the textual contents of a collection of documents or paragraphs. In contrast, summarizing the visual contents of a collection of images has not been researched to this extent. In this paper, we present a framework for summarizing the visual contents of an image collection. The key idea is to collect the scene graphs for all images in the image collection, create a combined representation, and then generate a visually summarizing caption using a scene-graph captioning model. Note that this aims to summarize common contents across all images in a single caption rather than describing each image individually. After aggregating all the scene graphs of an image collection into a single scene graph, we normalize it by using an additional concept generalization component. This component selects the common concept in each sub-graph with ConceptNet based on word embedding techniques. Lastly, we refine the captioning results by replacing a specific noun phrase with a common concept from the concept generalization component to improve the captioning results. We construct a dataset for this task based on the MS-COCO dataset using techniques from image classification and image-caption retrieval. An evaluation of the proposed method on this dataset shows promising performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://www.tensorflow.org/datasets/catalog/wikipedia/ (accessed Sept. 9, 2022)
References
Alrasheed, H.: Word synonym relationships for text analysis: a graph-based approach. PLoS ONE 16(7), e0255127 (2021)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Faghri, F., et al.: VSE++: improving visual-semantic embeddings with hard negatives. In: 29th British Machine Vision Conference (2018)
Gao, Y., et al.: SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In: 58th Annual Meeting of the Association for Computational Linguistics, pp. 1347–1354 (2020)
Girshick, R.: Fast R-CNN. In: 16th IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Gupta, S., et al.: Abstractive summarization: an overview of the state of the art. Expert Syst. Appl. 121, 49–65 (2019)
Hailu, T.T., et al.: A framework for word embedding based automatic text summarization and evaluation. Information 11(2), 78–100 (2020)
Han, X., et al.: Image scene graph generation (SGG) benchmark. Comput. Res. Reposit. arXiv preprint arXiv:2107.12604 (2021)
Hasan, T., et al.: XL-Sum: large-scale multilingual abstractive summarization for 44 languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4693–4703 (2021)
He, K., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hossain, M.Z., et al.: A comprehensive survey of deep learning for image captioning. ACM Comput. Survey 51(6), 1–36 (2019)
Karpathy, A., et al.: Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Kingma, D.P., et al.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations (2014)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL-04 Workshop on Text Summarization Branches Out, pp. 74–81 (2004)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Loper, E., et al.: NLTK: the natural language toolkit. In: 42nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 63–70 (2002)
Milewski, V., et al.: Are scene graphs good enough to improve image captioning? In: Joint Conference 59th Annual Meeting of the Association for Computational Linguistics and 11th International Conference on Natural Language Processing (2020)
Pennington, J., et al.: GloVe: global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Samani, Z.R., et al.: A knowledge-based semantic approach for image collection summarization. Multimed. Tools Appl. 76(9), 11917–11939 (2017)
Speer, R., et al.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: 31st AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017)
Trieu, N., et al.: Multi-image summarization: textual summary from a set of cohesive images. Comput. Res. Reposit. arXiv preprint arXiv:2006.08686 (2020)
Vedantam, R., et al.: CIDEr: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Wang, J., Xu, W., Wang, Q., Chan, A.B.: Compare and reweight: distinctive image captioning using similar images sets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 370–386. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_22
Wang, W., et al.: Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. Compt. Res. Reposit. arXiv preprint arXiv:2208.10442 (2022)
Wasserman, S., et al.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)
Zellers, R., et al.: Neural motifs: scene graph parsing with global context. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)
Zhang, J., et al.: Graphical contrastive losses for scene graph parsing. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 11535–11543 (2019)
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. In: 9th International Conference on Learning Representations (2020)
Zhang, W., et al.: Joint optimisation convex-negative matrix factorisation for multi-modal image collection summarisation based on images and tags. IET Comput. Vis. 13(2), 125–130 (2019)
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 211–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_13
Acknowledgements
Parts of this work were supported by JSPS Grant-in-aid for Scientific Research (21H03519) and a joint research project with National Institute of Informatics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Phueaksri, I., Kastner, M.A., Kawanishi, Y., Komamizu, T., Ide, I. (2023). Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-27077-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)