Skip to main content

Cross-Graph Transformer Network for Temporal Sentence Grounding

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14259))

Included in the following conference series:

  • 882 Accesses

Abstract

Temporal sentence grounding aims to retrieve moments associated with the given sentences in untrimmed videos, which is a multi-modal problem and needs the adequate understanding of the sentence and video structure as well as the accurate interaction of the two modals. In this paper, we propose a cross-graph Transformer network (CGTN) model to address this problem, where the sentence is taken as a dependency tree and the video as a graph, according to their non-linear structures. Based on the graph structures, we design the self-graph attention and cross-graph attention to model the relationship between the nodes in the graph and cross the graphs. We test the proposed model on two challenging datasets. Extensive experiments demonstrate the strength of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

    Google Scholar 

  2. Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. arXiv preprint (2021)

    Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: Conference on Empirical Methods in Natural Language Processing, pp. 162–171 (2018)

    Google Scholar 

  5. Chen, J., Ma, L., Chen, X., Jie, Z., Luo, J.: Localizing natural language in videos. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 8175–8182 (2019)

    Google Scholar 

  6. Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)

    Google Scholar 

  7. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)

    Google Scholar 

  8. Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)

    Google Scholar 

  9. Du, Y., Fu, Z., Liu, Q., Wang, Y.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo, pp. 1–6 (2022)

    Google Scholar 

  10. Dwivedi, V.P., Bresson, X.: A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699 (2020)

  11. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285 (2017)

    Google Scholar 

  12. Hou, Z., Ngo, C.W., Chan, W.K.: CONQUER: contextual query-aware ranking for video corpus moment retrieval. In: ACM International Conference on Multimedia, pp. 3900–3908 (2021)

    Google Scholar 

  13. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (2017)

    Google Scholar 

  14. Li, H., Wei, P., Li, J., Ma, Z., Shang, J., Zheng, N.: Asymmetric relation consistency reasoning for video relation grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_8

    Chapter  Google Scholar 

  15. Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: ACM International Conference on Multimedia, pp. 4070–4078 (2020)

    Google Scholar 

  16. Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st international ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)

    Google Scholar 

  17. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM International Conference on Multimedia, pp. 843–851 (2018)

    Google Scholar 

  18. Ma, Z., Wei, P., Li, H., Zheng, N.: HOIG: end-to-end human-object interactions grounding with transformers. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)

    Google Scholar 

  19. Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint (2017)

    Google Scholar 

  20. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

    Google Scholar 

  21. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Article  Google Scholar 

  22. Rodriguez, C., Marrese-Taylor, E., Saleh, F.S., Li, H., Gould, S.: Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2464–2473 (2020)

    Google Scholar 

  23. Shang, J., Wei, P., Li, H., Zheng, N.: Multi-scale interaction transformer for temporal action proposal generation. Image Vis. Comput. 129, 104589 (2023)

    Article  Google Scholar 

  24. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. In: 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  26. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint (2017)

    Google Scholar 

  27. Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 12168–12175 (2020)

    Google Scholar 

  28. Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)

    Google Scholar 

  29. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)

    Google Scholar 

  30. Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XII. LNCS, vol. 12357, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_30

    Chapter  Google Scholar 

  31. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2725–2741 (2020)

    Google Scholar 

  32. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)

    Google Scholar 

  33. Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  34. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. arXiv preprint (2020)

    Google Scholar 

  35. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664 (2019)

    Google Scholar 

  36. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv:2010.04159 (2020)

Download references

Acknowledgement

This research was supported by the grants National Natural Science Foundation of China (No.62088102), the Youth Innovation Team of Shaanxi Universities, and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ping Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shang, J., Wei, P., Zheng, N. (2023). Cross-Graph Transformer Network for Temporal Sentence Grounding. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14259. Springer, Cham. https://doi.org/10.1007/978-3-031-44223-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44223-0_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44222-3

  • Online ISBN: 978-3-031-44223-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics