Abstract
Temporal sentence grounding aims to retrieve moments associated with the given sentences in untrimmed videos, which is a multi-modal problem and needs the adequate understanding of the sentence and video structure as well as the accurate interaction of the two modals. In this paper, we propose a cross-graph Transformer network (CGTN) model to address this problem, where the sentence is taken as a dependency tree and the video as a graph, according to their non-linear structures. Based on the graph structures, we design the self-graph attention and cross-graph attention to model the relationship between the nodes in the graph and cross the graphs. We test the proposed model on two challenging datasets. Extensive experiments demonstrate the strength of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. arXiv preprint (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: Conference on Empirical Methods in Natural Language Processing, pp. 162–171 (2018)
Chen, J., Ma, L., Chen, X., Jie, Z., Luo, J.: Localizing natural language in videos. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 8175–8182 (2019)
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Du, Y., Fu, Z., Liu, Q., Wang, Y.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo, pp. 1–6 (2022)
Dwivedi, V.P., Bresson, X.: A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699 (2020)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285 (2017)
Hou, Z., Ngo, C.W., Chan, W.K.: CONQUER: contextual query-aware ranking for video corpus moment retrieval. In: ACM International Conference on Multimedia, pp. 3900–3908 (2021)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (2017)
Li, H., Wei, P., Li, J., Ma, Z., Shang, J., Zheng, N.: Asymmetric relation consistency reasoning for video relation grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 125–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_8
Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: ACM International Conference on Multimedia, pp. 4070–4078 (2020)
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st international ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM International Conference on Multimedia, pp. 843–851 (2018)
Ma, Z., Wei, P., Li, H., Zheng, N.: HOIG: end-to-end human-object interactions grounding with transformers. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)
Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint (2017)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Rodriguez, C., Marrese-Taylor, E., Saleh, F.S., Li, H., Gould, S.: Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2464–2473 (2020)
Shang, J., Wei, P., Li, H., Zheng, N.: Multi-scale interaction transformer for temporal action proposal generation. Image Vis. Comput. 129, 104589 (2023)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Vaswani, A., et al.: Attention is all you need. In: 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint (2017)
Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 12168–12175 (2020)
Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)
Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XII. LNCS, vol. 12357, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_30
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2725–2741 (2020)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)
Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. arXiv preprint (2020)
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv:2010.04159 (2020)
Acknowledgement
This research was supported by the grants National Natural Science Foundation of China (No.62088102), the Youth Innovation Team of Shaanxi Universities, and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shang, J., Wei, P., Zheng, N. (2023). Cross-Graph Transformer Network for Temporal Sentence Grounding. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14259. Springer, Cham. https://doi.org/10.1007/978-3-031-44223-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-44223-0_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44222-3
Online ISBN: 978-3-031-44223-0
eBook Packages: Computer ScienceComputer Science (R0)