Skip to main content
Log in

Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://github.com/LisaAnne/LocalizingMoments.

  2. https://allenai.org/plato/charades/.

  3. http://activity-net.org.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M. (2016). Tensorflow: A system for large-scale machine learning. In 12th\(\{\)USENIX\(\}\)symposium on operating systems design and implementation (\(\{\)OSDI\(\}\)16), (pp. 265–283).

  • Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., & Russell, B. (2017). Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5803–5812).

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C. & Sung, Y.H. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.

  • Chen, J., Chen, X., Ma, L., Jie, Z., & Chua, T. S. (2018a). Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 162–171).

  • Chen, K., Gao, J., & Nevatia, R. (2018b). Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4042–4050).

  • Chen, L. C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3640–3649).

  • Chen, S. & Jiang, Y. G. (2019). Semantic proposal for activity localization in videos via sentence query. In AAAI.

  • Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., & Divakaran, A. (2019). Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 2601–2610).

  • Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., & Tan, M. (2018). Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 7746–7755).

  • Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., & Russell, B. (2019). Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763.

  • Gao, J., Sun, C., Yang, Z., & Nevatia, R. (2017a). Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101.

  • Gao, J., Yang, Z., Chen, K., Sun, C., & Nevatia, R. (2017b). Turn tap: temporal unit regression network for temporal action proposals. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3628–3636).

  • Gao, J., Chen, K., & Nevatia, R. (2018). Ctap: complementary temporal action proposal generation. In: Proceedings of the european conference on computer vision (ECCV), (pp. 68–83).

  • Gao, M., Davis, L., Socher, R., & Xiong, C. (2019). WSLLN: weakly supervised natural language localization networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp 1481–1487), Association for Computational Linguistics, Hong Kong, China.

  • Ge, R., Gao, J., Chen, K., & Nevatia, R. (2019). Mac: mining activity concepts for language-based temporal localization. In 2019 IEEE winter conference on applications of computer vision (WACV), (pp. 245–253). IEEE.

  • Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In Advances in neural information processing systems, (pp. 34–45).

  • Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., & Schmid, C. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6047–6056).

  • Hendricks, L. A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., & Russell, B. C. (2018). Localizing moments in video with temporal language. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 1380–1390). Brussels, Belgium, October 31 - November 4, 2018.

  • Huang, Y., Wu, Q., Song, C., & Wang, L. (2018). Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6163–6171).

  • Javed, S. A., Saxena, S., & Gandhi, V. (2018). Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:1803.06506.

  • Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the european conference on computer vision (ECCV), (pp. 3–19).

  • Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., & Liu, H. (2020). Weakly-supervised video moment retrieval via semantic completion network. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11539–11546.

    Article  Google Scholar 

  • Liu, N., Han, J., & Yang, M. H. (2018). Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3089–3098).

  • Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. In Proceedings of the ACL workshop on effective tools and methodologies for teaching natural language processing and computational linguistics. Philadelphia: Association for Computational Linguistics.

  • Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., & Yoo, C. D. (2020). Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In European conference on computer vision, (pp. 156–171). Springer.

  • Mithun, N. C., Paul, S., & Roy-Chowdhury, A. K. (2019). Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ning, K., Zhu, L., Cai, M., Yang, Y., Xie, D., & Wu, F. (2018). Attentive sequence to sequence translation for localizing clips of interest by natural language descriptions. arXiv preprint arXiv:1808.08803.

  • Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W. Y., Shen, C., & Hengel, Avd. (2020). Reverie: remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 9982–9991).

  • Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of textual phrases in images by reconstruction. In Proceedings of the european conference on computer vision (ECCV), (pp. 817–834). Springer.

  • Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S. F. (2017). Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 5734–5743).

  • Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, S. F. (2018). Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the european conference on computer vision (ECCV), (pp. 154–171).

  • Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: crowdsourcing data collection for activity understanding. In Proceedings of the european conference on computer vision (ECCV), (pp. 510–526). Springer.

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, (pp. 568–576).

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 4489–4497).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, (pp. 5998–6008).

  • Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4325–4334).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 7794–7803).

  • Wang, X., Zhu, L., Wu, Y., & Yang, Y. (2020). Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3015894

  • Wang, Y., Deng, J., Zhou, W., & Li, H. (2021a). Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia.

  • Wang, Y., Zhou, W., & Li, H. (2021b). Fine-grained semantic alignment network for weakly supervised temporal language grounding. In Findings of the association for computational linguistics: EMNLP 2021, (pp. 89–99).

  • Wu, Q., Shen, C., Wang, P., Dick, A., & van den Hengel, A. (2017). Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1367–1381.

    Article  Google Scholar 

  • Xiao, F., Sigal, L., & Jae Lee, Y. (2017). Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 5945–5954).

  • Xu, H., & Saenko, K. (2016). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the european conference on computer vision (ECCV), (pp. 451–466). Springer

  • Xu, H., He, K., Sigal, L., Sclaroff, S., & Saenko, K. (2019). Multilevel language and vision integration for text-to-clip retrieval. In AAAI, (vol. 2, p 7).

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the international conference on machine learning (ICML), (pp. 2048–2057).

  • Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2020). A weakly supervised multi-task ranking framework for actor-action semantic segmentation. International Journal of Computer Vision, 128(5), 1414–1432.

    Article  Google Scholar 

  • Yang, Y., Zhuang, Y., & Pan, Y. (2021). Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22(12), 1551–1558.

  • Yeh, R. A., Do, M. N., & Schwing, A. G. (2018). Unsupervised textual grounding: Linking words to image concepts. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6125–6134).

  • Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, (pp. 214–223). Springer.

  • Zhang, B., Hu, H., & Sha, F. (2018). Cross-modal and hierarchical modeling of video and text. In Proceedings of the european conference on computer vision (ECCV), (pp. 374–390).

  • Zhang, D., Dai, X., Wang, X., Wang, Y. F., & Davis, L. S. (2019). Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhao, H., Yan, Z., Wang, H., Torresani, L., & Torralba, A. (2017). Slac: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374.

  • Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. G. (2017). Uncovering the temporal context for video question answering. International Journal of Computer Vision, 124(3), 409–421.

    Article  MathSciNet  Google Scholar 

  • Zhu, L., Fan, H., Luo, Y., Xu, M., & Yang, Y. (2022). Temporal cross-layer correlation mining for action recognition. IEEE Transactions on Multimedia, 24, 668–676. https://doi.org/10.1109/TMM.2021.3057503

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially supported by ARC DP200100938.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang.

Additional information

Communicated by Deva Ramanan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, F., Zhu, L. & Yang, Y. Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction. Int J Comput Vis 130, 1244–1258 (2022). https://doi.org/10.1007/s11263-022-01600-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01600-0

Keywords

Navigation