Skip to main content

Time-Aware Circulant Matrices for Question-Based Temporal Localization

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Episodic memory involves the ability to recall specific events, experiences, and locations from one’s past. Humans use this ability to understand the context and significance of past events, while also being able to plan for future endeavors. Unfortunately, episodic memory can decline with age and certain neurological conditions. By using machine learning and computer vision techniques, it could be possible to “observe” the daily routines of elderly individuals from their point of view and provide customized healthcare and support. For example, it could help an elderly person remember whether they have taken their daily medication or not. Therefore, considering the important impact on healthcare and societal assistance, this problem has been recently discussed in the research community, naming it Episodic Memory via Natural Language Queries. Recent approaches to this problem mostly rely on the literature related to similar fields, but contextual information from past and future clips is often unexplored. To address this limitation, in this paper we propose the Time-aware Circulant Matrices technique, which aims at introducing awareness of the surrounding clips into the model. In the experimental results, we present the robustness of our method by ablating its components, and confirm its effectiveness on the Ego4D public dataset, achieving an absolute improvement of more than 1% on R@5.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/EGO4D/episodic-memory/tree/main/NLQ/VSLNet.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)

    Google Scholar 

  2. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)

    Google Scholar 

  3. Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2911–2920 (2017)

    Google Scholar 

  4. Chen, G., et al.: InternVideo-Ego4D: a pack of champion solutions to Ego4D challenges. arXiv preprint: arXiv:2211.09529 (2022)

  5. Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29

    Chapter  Google Scholar 

  6. Cui, R., et al.: Video moment retrieval from text queries via single frame annotation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1033–1043 (2022)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  8. Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47

    Chapter  Google Scholar 

  9. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  10. Gao, J., Sun, X., Xu, M., Zhou, X., Ghanem, B.: Relation-aware video reading comprehension for temporal language grounding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3978–3988 (2021)

    Google Scholar 

  11. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

    Google Scholar 

  12. Ge, R., Gao, J., Chen, K., Nevatia, R.: MAC: mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)

    Google Scholar 

  13. Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112 (2022)

    Google Scholar 

  14. Gong, G., Zheng, L., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2020)

    Google Scholar 

  15. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)

    Google Scholar 

  16. Hou, Z., et al.: An efficient coarse-to-fine alignment framework@ Ego4D natural language queries challenge 2022. arXiv preprint: arXiv:2211.08776 (2022)

  17. Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Advances in Neural Information Processing Systems, vol. 34, pp. 11846–11858 (2021)

    Google Scholar 

  18. Lin, K.Q., et al.: Egocentric video-language pretraining@ Ego4D challenge 2022. arXiv preprint: arXiv:2207.01622 (2022)

  19. Lin, K.Q., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems, vol. 35, pp. 7575-7586 (2022)

    Google Scholar 

  20. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 988–996 (2017)

    Google Scholar 

  21. Liu, N., Wang, X., Li, X., Yang, Y., Zhuang, Y.: Reler@ zju-alibaba submission to the Ego4D natural language queries challenge 2022. arXiv preprint: arXiv:2207.00383 (2022)

  22. Liu, Z., et al.: Video Swin transformer. arXiv preprint: arXiv:2106.13230 (2021)

  23. Mo, S., Mu, F., Li, Y.: A simple transformer-based model for Ego4D natural language queries challenge. arXiv preprint: arXiv:2211.08704 (2022)

  24. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  25. Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: NaQ: Leveraging narrations as queries to supervise episodic memory. arXiv preprint: arXiv:2301.00746 (2023)

  26. Soldan, M., et al.: MAD: a scalable dataset for language grounding in videos from movie audio descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5026–5035 (2022)

    Google Scholar 

  27. Tulving, E.: Episodic and semantic memory: where should we go from here? Behav. Brain Sci. 9(3), 573–577 (1986)

    Article  Google Scholar 

  28. Wu, A., Han, Y.: Multi-modal circulant fusion for video-to-language and backward. In: IJCAI, vol. 3, p. 8 (2018)

    Google Scholar 

  29. Xiong, C., Zhong, V., Socher, R.: Dynamic Coattention networks for question answering. In: International Conference on Learning Representations (2016)

    Google Scholar 

  30. Yu, A.W., et al.: QaNet: combining local convolution with global self-attention for reading comprehension. In: International Conference on Learning Representations (2018)

    Google Scholar 

  31. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)

    Google Scholar 

  32. Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13664, pp. 492–510. Springer, Cham (2022)

    Chapter  Google Scholar 

  33. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)

    Google Scholar 

  34. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  35. Zhang, Z., Han, X., Song, X., Yan, Y., Nie, L.: Multi-modal interaction graph convolutional network for temporal language localization in videos. IEEE Trans. Image Process. 30, 8265–8277 (2021)

    Article  Google Scholar 

  36. Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32

    Chapter  Google Scholar 

  37. Zheng, S., Zhang, Q., Liu, B., Jin, Q., Fu, J.: Exploring anchor-based detection for ego4d natural language query. arXiv preprint: arXiv:2208.05375 (2022)

Download references

Acknowledgements

This work was supported by the Department Strategic Plan (PSD) of the University of Udine-Interdepartmental Project on Artificial Intelligence (2020-25). This work was partially funded by the Horizon EU project MUSAE (No. 01070421), 2021-SGR-01094 (AGAUR), Icrea Academia’2022 (Generalitat de Catalunya), Robo STEAM (2022-1-BG01-KA220-VET-000089434, Erasmus+ EU), DeepSense (ACE053 /22/000029, ACCIÓ), DeepFoodVol (AEI-MICINN, PDC2022-133642-I00), and CERCA Programme/Generalitat de Catalunya.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alex Falcon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bruni, P., Falcon, A., Radeva, P. (2023). Time-Aware Circulant Matrices for Question-Based Temporal Localization. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43153-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43152-4

  • Online ISBN: 978-3-031-43153-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics