GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

Wang, Yuxuan; Gao, Difei; Yu, Licheng; Lei, Weixian; Feiszli, Matt; Shou, Mike Zheng

doi:10.1007/978-3-031-19833-5_41

Yuxuan Wang¹²,
Difei Gao¹²,
Licheng Yu¹³,
Weixian Lei¹²,
Matt Feiszli¹³ &
…
Mike Zheng Shou¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

1882 Accesses
5 Citations

Abstract

Cognitive science has shown that humans perceive videos in terms of events separated by the state changes of dominant subjects. State changes trigger new events and are one of the most useful among the large amount of redundant information perceived. However, previous research focuses on the overall understanding of segments without evaluating the fine-grained status changes inside. In this paper, we introduce a new dataset called Kinetic-GEB+. The dataset consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos. Upon this new dataset, we propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes. We evaluate many representative baselines in our dataset, where we also design a new TPD (Temporal-based Pairwise Difference) Modeling method for visual difference and achieve significant performance improvements. Besides, the results show there are still formidable challenges for current methods in the utilization of different granularities, representation of visual difference, and the accurate localization of status changes. Further analysis shows that our dataset can drive developing more powerful methods to understand status changes and thus improve video level comprehension. The dataset is available at https://github.com/Yuxuan-W/GEB-Plus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alcantarilla, P.F., Stent, S., Ros, G., Arroyo, R., Gherardi, R.: Street-view change detection with deconvolutional networks. Auton. Robot. 42(7), 1301–1322 (2018). https://doi.org/10.1007/s10514-018-9734-5
Article Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV, pp. 1728–1738 (2021)
Google Scholar
Bendale, A., Boult, T.: Towards open world recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1893–1902 (2015)
Google Scholar
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. In: EMNLP, pp. 9810–9823 (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200, June 2011. www.aclanthology.org/P11-1020
Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 (2021)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Google Scholar
Ge, R., Gao, J., Chen, K., Nevatia, R.: MAC: mining activity concepts for language-based temporal localization. In: WACV, pp. 245–253. IEEE (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Iashin, V., Rahtu, E.: Multi-modal dense video captioning. In: CVPR, pp. 958–959 (2020)
Google Scholar
Jhamtani, H., Berg-Kirkpatrick, T.: Learning to describe differences between pairs of similar images. In: EMNLP (2018)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: EMNLP (2018)
Google Scholar
Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: CVPR, pp. 7492–7500. IEEE Computer Society (2018)
Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Liu, Z., Li, G., Mercier, G., He, Y., Pan, Q.: Change detection in heterogenous remote sensing images via homogeneous pixel transformation. IEEE Trans. Image Process. 27(4), 1822–1834 (2017)
Article MathSciNet MATH Google Scholar
Luo, H., et al.: Univl: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: CVPR, pp. 10810–10819 (2020)
Google Scholar
Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV, pp. 4624–4633 (2019)
Google Scholar
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using CLIP. In: Roman-Rangel, E., Kuri-Morales, Á.F., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A. (eds.) MCPR 2021. LNCS, vol. 12725, pp. 3–12. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77004-4_1
Chapter Google Scholar
Radvansky, G.A., Zacks, J.M.: Event perception. Wiley Interdisc. Rev. Cogn. Sci. 2(6), 608–620 (2011)
Article Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. TACL 1, 25–36 (2013)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, vol. 28 (2015)
Google Scholar
Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: a benchmark for event segmentation. In: ICCV, pp. 8075–8084 (2021)
Google Scholar
Tian, J., Cui, S., Reinartz, P.: Building change detection based on satellite stereo imagery and digital surface models. IEEE Trans. Geosci. Remote Sens. 52(1), 406–417 (2013)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. PAMI 41(11), 2740–2755 (2018)
Article Google Scholar
Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., Luo, P.: End-to-end dense video captioning with parallel decoding. In: ICCV, pp. 6847–6857 (2021)
Google Scholar
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: Vatex: a large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV, pp. 4581–4591 (2019)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
Google Scholar
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: Temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)
Google Scholar
Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: CVPR, pp. 10287–10296 (2020)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks formoment localization with natural language. In: AAAI (2020)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)
Google Scholar
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: CVPR, pp. 8746–8755 (2020)
Google Scholar

Download references

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its NRFF Award NRF-NRFF13-2021-0008, and Mike Zheng Shou’s Start-Up Grant from NUS. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore.

Author information

Authors and Affiliations

Show Lab, National University of Singapore, Singapore, Singapore
Yuxuan Wang, Difei Gao, Weixian Lei & Mike Zheng Shou
Meta AI, Menlo Park, USA
Licheng Yu & Matt Feiszli

Authors

Yuxuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Difei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Licheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Weixian Lei
View author publications
You can also search for this author in PubMed Google Scholar
Matt Feiszli
View author publications
You can also search for this author in PubMed Google Scholar
Mike Zheng Shou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mike Zheng Shou .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8204 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Gao, D., Yu, L., Lei, W., Feiszli, M., Shou, M.Z. (2022). GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_41
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval