NewsStories: Illustrating Articles with Visual Summaries

Tan, Reuben; Plummer, Bryan A.; Saenko, Kate; Lewis, JP; Sud, Avneesh; Leung, Thomas

doi:10.1007/978-3-031-20059-5_37

Reuben Tan¹²,
Bryan A. Plummer¹²,
Kate Saenko¹²,
JP Lewis¹³,
Avneesh Sud¹³ &
…
Thomas Leung¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

European Conference on Computer Vision

1772 Accesses

Abstract

Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is a one-to-one correspondence between images and their (short) captions. However, many tasks require reasoning about multiple images paired with a long text narrative, such as photos in a news article. In this work, we explore a novel setting where the goal is to learn a self-supervised visual-language representation from longer text paired with a set of photos, which we call visual summaries. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset called NewsStories containing over 31 M articles, 22 M images and 1 M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives paired with multiple images, and introduce an intuitive baseline that outperforms these methods, e.g., by 10% on on zero-shot image-set retrieval in the GoodNews dataset. (https://github.com/NewsStoriesData/newsstories.github.io).

R. Tan—Work done as part of an internship at Google

K. Saenko—Also affliated with MIT-IBM Watson AI Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
CommonCrawl [7] can be used to fetch web articles.
2.
https://www.allsides.com/media-bias/media-bias-chart.
3.
https://cloud.google.com/vision/docs/detecting-web.

References

Aneja, S., Bregler, C., Nießner, M.: COSMOS: catching out-of-context misinformation with self-supervised learning. CoRR abs/2101.06278 (2021). https://arxiv.org/abs/2101.06278
Antol, S., et al.: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 2425–2433. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.279
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. CoRR abs/2104.00650 (2021). https://arxiv.org/abs/2104.00650
Biten, A.F., Gomez, L., Rusinol, M., Karatzas, D.: Good news, everyone! context driven entity-aware captioning for news images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12466–12475 (2019)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, pp. 1597–1607 (2020). https://proceedings.mlr.press/v119/chen20j.html
Chun, S., Oh, S.J., de Rezende, R.S., Kalantidis, Y., Larlus, D.: Probabilistic embeddings for cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021 (2021). https://openaccess.thecvf.com/content/CVPR2021/html/Chun_Probabilistic_Embeddings_for_Cross-Modal_Retrieval_CVPR_2021_paper.html
https://commoncrawl.org
Desai, K., Johnson, J.: Virtex: Learning visual representations from textual annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 11162–11173. Computer Vision Foundation / IEEE (2021). https://openaccess.thecvf.com/content/CVPR2021/html/Desai_VirTex_Learning_Visual_Representations_From_Textual_Annotations_CVPR_2021_paper.html
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997). https://doi.org/10.1016/S0004-3702(96)00034-3
Article MATH Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 September 2018, p. 12. BMVA Press (2018). https://doi.org/10.1016/S0004-3702(96)00034-3, https://bmvc2018.org/contents/papers/0344.pdf
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting held 5–8 December 2013, Lake Tahoe, Nevada, United States (2013). https://proceedings.neurips.cc/paper/2013/hash/7cce53cf90577442771720a370c3c723-Abstract.html
Gu, X., et al.: Generating representative headlines for news stories. In: Proceeding of the the Web Conference 2020 (2020)
Google Scholar
Gurevych, I., Miyao, Y. (eds.): Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15–20 July 2018, Volume 1: Long Papers. Association for Computational Linguistics (2018). https://aclanthology.org/volumes/P18-1/
Huang, T.K., et al.: Visual storytelling. CoRR abs/1604.03968 (2016). https://arxiv.org/abs/1604.03968
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML (2021)
Google Scholar
Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Computer Vision - ECCV 2016–14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VII (2016). https://doi.org/10.1007/978-3-319-46478-7_5
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 3128–3137. IEEE Computer Society (2015). https://doi.org/10.1109/CVPR.2015.7298932
Kim, G., Moon, S., Sigal, L.: Ranking and retrieval of image sequences from multiple paragraph queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1993–2001 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). https://arxiv.org/abs/1412.6980
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Google Scholar
Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 4193–4202. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.449 https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.449
Li, M., Chen, X., Gao, S., Chan, Z., Zhao, D., Yan, R.: Vmsmo: learning to generate multimodal summary for video-based news articles. arXiv preprint arXiv:2010.05406 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, F., Wang, Y., Wang, T., Ordonez, V.: Visual news: benchmark and challenges in news image captioning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6761–6771 (2021)
Google Scholar
Liu, J., Liu, T., Yu, C.: NewsEmbed: modeling news through pre-trained document representations. arXiv preprint arXiv:2106.00590 (2021)
Loper, E., Bird, S.: NLTK: the natural language toolkit. CoRR cs.CL/0205028 (2002). https://dblp.uni-trier.de/db/journals/corr/corr0205.html#cs-CL-0205028
Miech, A., Alayrac, J., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020 (2020). https://doi.org/10.1109/CVPR42600.2020.00990, https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html
Oh, S.J., Murphy, K.P., Pan, J., Roth, J., Schroff, F., Gallagher, A.C.: Modeling uncertainty with hedged instance embeddings. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019 (2019). https://openreview.net/forum?id=r1xQQhAqKX
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018). https://arxiv.org/abs/1807.03748
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML (2021)
Google Scholar
Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Computer Vision - ECCV 2020–16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part VIII (2020). https://doi.org/10.1007/978-3-030-58598-3_10
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2019)
Google Scholar
Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2 2019, Volume 1: Long Papers (2019). https://doi.org/10.18653/v1/p19-1644
Tan, R., Plummer, B., Saenko, K.: Detecting cross-modal inconsistency to defend against neural fake news. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2081–2106. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.163, https://aclanthology.org/2020.emnlp-main.163
Thomas, C., Kovashka, A.: Preserving semantic neighborhoods for robust cross-modal retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 317–335. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_19
Chapter Google Scholar
Tran, A., Mathews, A., Xie, L.: Transform and tell: entity-aware news image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13035–13045 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Google Scholar
Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: LUKE: deep contextualized entity representations with entity-aware self-attention. arXiv preprint arXiv:2010.01057 (2020)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Compu. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. CoRR abs/2010.00747 (2020). https://arxiv.org/abs/2010.00747
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 4995–5004. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.540,https://doi.org/10.1109/CVPR.2016.540

Download references

Acknowledgements

This material is based upon work supported, in part, by DARPA under agreement number HR00112020054.

Author information

Authors and Affiliations

Boston University, Boston, USA
Reuben Tan, Bryan A. Plummer & Kate Saenko
Google Research, Mountain View, CA, USA
JP Lewis, Avneesh Sud & Thomas Leung

Authors

Reuben Tan
View author publications
You can also search for this author in PubMed Google Scholar
Bryan A. Plummer
View author publications
You can also search for this author in PubMed Google Scholar
Kate Saenko
View author publications
You can also search for this author in PubMed Google Scholar
JP Lewis
View author publications
You can also search for this author in PubMed Google Scholar
Avneesh Sud
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Leung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reuben Tan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 923 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, R., Plummer, B.A., Saenko, K., Lewis, J., Sud, A., Leung, T. (2022). NewsStories: Illustrating Articles with Visual Summaries. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-20059-5_37
Published: 29 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

NewsStories: Illustrating Articles with Visual Summaries