Skip to main content

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

Abstract

Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. However, recent works find that existing methods suffer a severe temporal bias problem. These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. To this end, this paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy. Our framework introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training. The cross-modal matching task leverages the content consistency between shuffled and original videos to force the grounding model to mine visual contents to semantically match queries. The temporal order discrimination task leverages the difference in temporal order to strengthen the understanding of long-term temporal contexts. Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model’s generalization ability against the different temporal distributions. Code is available at https://github.com/haojc/ShufflingVideosForTSG.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As the grounding model is not our key contribution, more details about the grounding network and inference stage are provided in our supplementary material.

  2. 2.

    https://github.com/yytzsy/grounding_changing_distribution.

  3. 3.

    We used the models with trained parameters provided by their authors if available.

  4. 4.

    The length distribution of predictions can be found in our supplementary material.

References

  1. Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. In: EMNLP, pp. 9810–9823 (2021)

    Google Scholar 

  2. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  3. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.: Temporally grounding natural sentence in video. In: EMNLP (2018)

    Google Scholar 

  4. Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)

    Google Scholar 

  5. Chen, S., Jiang, W., Liu, W., Jiang, Y.-G.: Learning modality interaction for temporal sentence localization and event captioning in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 333–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_20

    Chapter  Google Scholar 

  6. Chen, S., Jiang, Y.: Semantic proposal for activity localization in videos via sentence query. In: AAAI (2019)

    Google Scholar 

  7. Chen, S., Jiang, Y.-G.: Hierarchical visual-textual graph for temporal activity localization via language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 601–618. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_36

    Chapter  Google Scholar 

  8. Chen, Y.W., Tsai, Y.H., Yang, M.H.: End-to-end multi-modal video temporal grounding. In: NIPS 34 (2021)

    Google Scholar 

  9. Choi, J., Sharma, G., Schulter, S., Huang, J.: Shuffle and attend: video domain adaptation. In: Computer Vision - ECCV 2020–16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XII, pp. 678–695 (2020). https://doi.org/10.1007/978-3-030-58610-2_40

  10. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645 (2017)

    Google Scholar 

  11. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV (2017)

    Google Scholar 

  12. Gao, J., Xu, C.: Fast video moment retrieval. In: ICCV, pp. 1523–1532 (2021)

    Google Scholar 

  13. Ghosh, S., Agarwal, A., Parekh, Z., Hauptmann, A.G.: Excl: extractive clip localization using natural language descriptions. In: NAACL (2019)

    Google Scholar 

  14. Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Query-aware video encoder for video moment retrieval. Neurocomputing 483, 72–86 (2022)

    Article  Google Scholar 

  15. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  16. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV (2017)

    Google Scholar 

  17. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with temporal language. In: EMNLP (2018)

    Google Scholar 

  18. Hou, Z., Ngo, C.W., Chan, W.: Conquer: contextual query-aware ranking for video corpus moment retrieval (2021). In: ACM MM, pp. 20–24 (2021)

    Google Scholar 

  19. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 21–24 June 2010, Haifa, Israel, pp. 495–502 (2010)

    Google Scholar 

  20. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)

    Google Scholar 

  21. Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 667–676 (2017)

    Google Scholar 

  22. Lin, Z., Zhao, Z., Zhang, Z., Zhang, Z., Cai, D.: Moment retrieval via cross-modal interaction networks with query reconstruction. In: IEEE TIP (2020)

    Google Scholar 

  23. Liu, B., Yeung, S., Chou, E., Huang, D., Fei-Fei, L., Niebles, J.C.: Temporal modular networks for retrieving complex compositional activities in videos. In: ECCV (2018)

    Google Scholar 

  24. Liu, D., Qu, X., Di, X., Cheng, Y., Xu, Z., Zhou, P.: Memory-guided semantic learning network for temporal sentence grounding. arXiv preprint arXiv:2201.00454 (2022)

  25. Liu, D., Qu, X., Dong, J., Zhou, P.: Adaptive proposal generation network for temporal sentence localization in videos. In: EMNLP, pp. 9292–9301 (2021)

    Google Scholar 

  26. Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR, pp. 11235–11244 (2021)

    Google Scholar 

  27. Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: ACM MM (2020)

    Google Scholar 

  28. Liu, D., Qu, X., Zhou, P.: Progressively guide to attend: an iterative alignment framework for temporal sentence grounding. In: EMNLP, pp. 9302–9311 (2021)

    Google Scholar 

  29. Liu, D., Qu, X., Zhou, P., Liu, Y.: Exploring motion and appearance information for temporal sentence grounding. arXiv preprint arXiv:2201.00457 (2022)

  30. Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.: Attentive moment retrieval in videos. In: SIGIR (2018)

    Google Scholar 

  31. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.: Cross-modal moment localization in videos. In: ACM MM (2018)

    Google Scholar 

  32. Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: DEBUG: a dense bottom-up grounding approach for natural language video localization. In: EMNLP (2019)

    Google Scholar 

  33. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: European Conference on Computer Vision, pp. 527–544. Springer (2016). https://doi.org/10.1007/978-3-319-46448-0_32

  34. Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: CVPR (2020)

    Google Scholar 

  35. Nan, G., et al.: Interventional video grounding with dual contrastive learning. In: CVPR, pp. 2765–2775 (2021)

    Google Scholar 

  36. Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR, pp. 4694–4702 (2015)

    Google Scholar 

  37. Otani, M., Nakashima, Y., Rahtu, E., Heikkilä, J.: Uncovering hidden challenges in query-based video moment retrieval. In: BMVC (2020)

    Google Scholar 

  38. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  39. Rodriguez, C., Marrese-Taylor, E., Saleh, F.S., Li, H., Gould, S.: Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: WACV (2020)

    Google Scholar 

  40. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)

    Google Scholar 

  41. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)

    Google Scholar 

  42. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459 (2018)

    Google Scholar 

  43. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  44. Wang, H., Zha, Z.J., Li, L., Liu, D., Luo, J.: Structured multi-level interaction network for video moment localization via language query. In: CVPR, pp. 7026–7035 (2021)

    Google Scholar 

  45. Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI (2020)

    Google Scholar 

  46. Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 1895–1904 (2021)

    Google Scholar 

  47. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7794–7803 (2018)

    Google Scholar 

  48. Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021, pp. 2986–2994 (2021)

    Google Scholar 

  49. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 10334–10343 (2019)

    Google Scholar 

  50. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI (2019)

    Google Scholar 

  51. Yang, X., Feng, F., Ji, W., Wang, M., Chua, T.: Deconfounded video moment retrieval with causal intervention. In: SIGIR 2021: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021, pp. 1–10 (2021)

    Google Scholar 

  52. Yuan, Y., Lan, X., Chen, L., Liu, W., Wang, X., Zhu, W.: A closer look at temporal sentence grounding in videos: datasets and metrics. CoRR abs/2101.09028 (2021). https://arxiv.org/abs/2101.09028

  53. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: NIPS (2019)

    Google Scholar 

  54. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI (2019)

    Google Scholar 

  55. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: CVPR (2020)

    Google Scholar 

  56. Zhang, D., Dai, X., Wang, X., Wang, Y., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR (2019)

    Google Scholar 

  57. Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.S.M.: Natural language video localization: a revisit in span-based question answering framework. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)

    Google Scholar 

  58. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: ACL (2020)

    Google Scholar 

  59. Zhang, M., et al.: Multi-stage aggregated transformer network for temporal language localization in videos. In: CVPR, pp. 12669–12678 (2021)

    Google Scholar 

  60. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI (2020)

    Google Scholar 

  61. Zhang, Z., Zhao, Z., Zhang, Z., Lin, Z., Wang, Q., Hong, R.: Temporal textual localization in video via adversarial bi-directional interaction networks. In: IEEE TMM (2020)

    Google Scholar 

  62. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, 21–25 July 2019, pp. 655–664 (2019)

    Google Scholar 

  63. Zhao, Y., Zhao, Z., Zhang, Z., Lin, Z.: Cascaded prediction network via segment tree for temporal video grounding. In: CVPR, pp. 4197–4206 (2021)

    Google Scholar 

  64. Zhou, H., Zhang, C., Luo, Y., Chen, Y., Hu, C.: Embracing uncertainty: decoupling and de-bias for robust temporal grounding. In: CVPR, pp. 8445–8454 (2021)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants (62071067, 62001054, 62101064, 62171057), in part by the Ministry of Education and China Mobile Joint Fund (MCM20200202), China Postdoctoral Science Foundation under Grant 2022M710468, Beijing University of Posts and Telecommunications-China Mobile Research Institute Joint Innovation Center.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Haifeng Sun or Jingyu Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2148 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J. (2022). Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics