Time-Aware Circulant Matrices for Question-Based Temporal Localization

Bruni, Pierfrancesco; Falcon, Alex; Radeva, Petia

doi:10.1007/978-3-031-43153-1_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14234))

Included in the following conference series:

International Conference on Image Analysis and Processing

603 Accesses

Abstract

Episodic memory involves the ability to recall specific events, experiences, and locations from one’s past. Humans use this ability to understand the context and significance of past events, while also being able to plan for future endeavors. Unfortunately, episodic memory can decline with age and certain neurological conditions. By using machine learning and computer vision techniques, it could be possible to “observe” the daily routines of elderly individuals from their point of view and provide customized healthcare and support. For example, it could help an elderly person remember whether they have taken their daily medication or not. Therefore, considering the important impact on healthcare and societal assistance, this problem has been recently discussed in the research community, naming it Episodic Memory via Natural Language Queries. Recent approaches to this problem mostly rely on the literature related to similar fields, but contextual information from past and future clips is often unexplored. To address this limitation, in this paper we propose the Time-aware Circulant Matrices technique, which aims at introducing awareness of the surrounding clips into the model. In the experimental results, we present the robustness of our method by ablating its components, and confirm its effectiveness on the Ego4D public dataset, achieving an absolute improvement of more than 1% on R@5.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/EGO4D/episodic-memory/tree/main/NLQ/VSLNet.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Google Scholar
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2911–2920 (2017)
Google Scholar
Chen, G., et al.: InternVideo-Ego4D: a pack of champion solutions to Ego4D challenges. arXiv preprint: arXiv:2211.09529 (2022)
Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29
Chapter Google Scholar
Cui, R., et al.: Video moment retrieval from text queries via single frame annotation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1033–1043 (2022)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Chapter Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Gao, J., Sun, X., Xu, M., Zhou, X., Ghanem, B.: Relation-aware video reading comprehension for temporal language grounding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3978–3988 (2021)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)
Google Scholar
Ge, R., Gao, J., Chen, K., Nevatia, R.: MAC: mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112 (2022)
Google Scholar
Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2020)
Google Scholar
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)
Google Scholar
Hou, Z., et al.: An efficient coarse-to-fine alignment framework@ Ego4D natural language queries challenge 2022. arXiv preprint: arXiv:2211.08776 (2022)
Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Advances in Neural Information Processing Systems, vol. 34, pp. 11846–11858 (2021)
Google Scholar
Lin, K.Q., et al.: Egocentric video-language pretraining@ Ego4D challenge 2022. arXiv preprint: arXiv:2207.01622 (2022)
Lin, K.Q., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems, vol. 35, pp. 7575-7586 (2022)
Google Scholar
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 988–996 (2017)
Google Scholar
Liu, N., Wang, X., Li, X., Yang, Y., Zhuang, Y.: Reler@ zju-alibaba submission to the Ego4D natural language queries challenge 2022. arXiv preprint: arXiv:2207.00383 (2022)
Liu, Z., et al.: Video Swin transformer. arXiv preprint: arXiv:2106.13230 (2021)
Mo, S., Mu, F., Li, Y.: A simple transformer-based model for Ego4D natural language queries challenge. arXiv preprint: arXiv:2211.08704 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: NaQ: Leveraging narrations as queries to supervise episodic memory. arXiv preprint: arXiv:2301.00746 (2023)
Soldan, M., et al.: MAD: a scalable dataset for language grounding in videos from movie audio descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5026–5035 (2022)
Google Scholar
Tulving, E.: Episodic and semantic memory: where should we go from here? Behav. Brain Sci. 9(3), 573–577 (1986)
Article Google Scholar
Wu, A., Han, Y.: Multi-modal circulant fusion for video-to-language and backward. In: IJCAI, vol. 3, p. 8 (2018)
Google Scholar
Xiong, C., Zhong, V., Socher, R.: Dynamic Coattention networks for question answering. In: International Conference on Learning Representations (2016)
Google Scholar
Yu, A.W., et al.: QaNet: combining local convolution with global self-attention for reading comprehension. In: International Conference on Learning Representations (2018)
Google Scholar
Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)
Google Scholar
Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13664, pp. 492–510. Springer, Cham (2022)
Chapter Google Scholar
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)
Google Scholar
Zhang, Z., Han, X., Song, X., Yan, Y., Nie, L.: Multi-modal interaction graph convolutional network for temporal language localization in videos. IEEE Trans. Image Process. 30, 8265–8277 (2021)
Article Google Scholar
Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32
Chapter Google Scholar
Zheng, S., Zhang, Q., Liu, B., Jin, Q., Fu, J.: Exploring anchor-based detection for ego4d natural language query. arXiv preprint: arXiv:2208.05375 (2022)

Download references

Acknowledgements

This work was supported by the Department Strategic Plan (PSD) of the University of Udine-Interdepartmental Project on Artificial Intelligence (2020-25). This work was partially funded by the Horizon EU project MUSAE (No. 01070421), 2021-SGR-01094 (AGAUR), Icrea Academia’2022 (Generalitat de Catalunya), Robo STEAM (2022-1-BG01-KA220-VET-000089434, Erasmus+ EU), DeepSense (ACE053 /22/000029, ACCIÓ), DeepFoodVol (AEI-MICINN, PDC2022-133642-I00), and CERCA Programme/Generalitat de Catalunya.

Author information

Authors and Affiliations

University of Udine, Udine, Italy
Pierfrancesco Bruni & Alex Falcon
University of Barcelona, Barcelona, Spain
Petia Radeva

Authors

Pierfrancesco Bruni
View author publications
You can also search for this author in PubMed Google Scholar
Alex Falcon
View author publications
You can also search for this author in PubMed Google Scholar
Petia Radeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Falcon .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bruni, P., Falcon, A., Radeva, P. (2023). Time-Aware Circulant Matrices for Question-Based Temporal Localization. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-43153-1_16
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Time-Aware Circulant Matrices for Question-Based Temporal Localization