Video Dialog as Conversation About Objects Living in Space-Time

Pham, Hoang-Anh; Le, Thao Minh; Le, Vuong; Phuong, Tu Minh; Tran, Truyen

doi:10.1007/978-3-031-19842-7_41

Video Dialog as Conversation About Objects Living in Space-Time

Hoang-Anh Pham¹²,
Thao Minh Le¹²,
Vuong Le¹²,
Tu Minh Phuong¹³ &
…
Truyen Tran¹²

Conference paper
First Online: 23 October 2022

2396 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Abstract

It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed \(\texttt{COST}\)–which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, \(\texttt{COST}\) maintains and tracks object-associated dialog states, which are updated upon receiving new questions. Object interactions are dynamically and conditionally inferred for each question, and these serve as the basis for relational reasoning among them. \(\texttt{COST}\) also maintains a history of previous answers, and this allows retrieval of relevant object-centric information to enrich the answer forming process. Language production then proceeds in a step-wise manner, taking the context of the current utterance, the existing dialog, and the current question. We evaluate \(\texttt{COST}\) on the AVSD test splits (DSTC7 and DSTC8), demonstrating its competitiveness against state-of-the-arts.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This is related to, but distinct from, the dialog state tracking in typical task-oriented dialogs in NLP [10].
2.
https://github.com/hoanganhpham1006/COST.

References

Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)
Google Scholar
Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 106–122. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_7
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W.: Multi-step joint-modality attention network for scene-aware dialogue system. arXiv preprint arXiv:2001.06206 (2020)
Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. In: IJCAI (2021)
Google Scholar
Das, A., et al.: Visual Dialog. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1242–1256 (2019). https://doi.org/10.1109/TPAMI.2018.2828437
Article Google Scholar
Desta, M.T., Chen, L., Kornuta, T.: Object-based reasoning in VQA. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1814–1823. IEEE (2018)
Google Scholar
Gao, S., Sethi, A., Agarwal, S., Chung, T., Hakkani-Tur, D.: Dialog state tracking: a neural reading comprehension approach. In: Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 264–273 (2019)
Google Scholar
Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Natl. Acad. Sci. 112(12), 3618–3623 (2015)
Article Google Scholar
Geng, S., et al.: Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: Proceedings of AAAI Conference on Artificial Intelligence (2021)
Google Scholar
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643–1653, June 2021
Google Scholar
Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356. IEEE (2019)
Google Scholar
Hori, C., Cherian, A., Marks, T.K., Hori, T.: Joint student-teacher learning for audio-visual scene-aware dialog. In: INTERSPEECH, pp. 1886–1890 (2019)
Google Scholar
Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C.: Location-aware graph convolutional networks for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11021–11028 (2020)
Google Scholar
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4405–4413 (2017)
Google Scholar
Kim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: AAAI (2021)
Google Scholar
Kim, S., et al.: The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394 (2019)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2014)
Google Scholar
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 160–178. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_10
Chapter Google Scholar
Le, H., Chen, N.F.: Multimodal transformer with pointer network for the DSTC8 AVSD challenge. arXiv preprint arXiv:2002.10695 (2020)
Le, H., Chen, N.F., Hoi, S.C.: Learning reasoning paths over semantic graphs for video-grounded dialogues. arXiv preprint arXiv:2103.00820 (2021)
Le, H., Hoi, S., Sahoo, D., Chen, N.: End-to-end multimodal dialog systems with hierarchical multimodal attention on video features. In: DSTC7 at AAAI2019 Workshop (2019)
Google Scholar
Le, H., Sahoo, D., Chen, N., Hoi, S.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5612–5623 (2019)
Google Scholar
Le, H., Sahoo, D., Chen, N., Hoi, S.C.: BiST: bi-directional spatio-temporal reasoning for video-grounded Dialogues. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1846–1859 (2020)
Google Scholar
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Dynamic language binding in relational visual reasoning. In: IJCAI, pp. 818–824 (2020)
Google Scholar
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9972–9981 (2020)
Google Scholar
Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: DSTC8-AVSD: multimodal semantic transformer network with retrieval style word generator. arXiv preprint arXiv:2004.08299 (2020)
Lin, K.Y., Hsu, C.C., Chen, Y.N., Ku, L.W.: Entropy-enhanced multimodal attention model for scene-aware dialogue generation. arXiv preprint arXiv:1908.08191 (2019)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: International Conference on Learning Representations (2017)
Google Scholar
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 336–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_20
Chapter Google Scholar
Nguyen, D.T., Sharma, S., Schulz, H., Asri, L.E.: From film to video: multi-turn question answering with multi-modal context. In: DSTC7 Workshop at AAAI 2019 (2019)
Google Scholar
Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)
Google Scholar
Sanabria, R., Palaskar, S., Metze, F.: CMU Sinbad’s submission for the DSTC7 AVSD challenge. In: DSTC7 at AAAI2019 Workshop, vol. 6 (2019)
Google Scholar
Seo, P.H., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. arXiv preprint arXiv:1709.07992 (2017)
Serban, I., et al.: A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31
Chapter Google Scholar
Spelke, E.S., Kinzler, K.D.: Core knowledge. Dev. Sci. 10(1), 89–96 (2007)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems 28 (2015)
Google Scholar
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Chapter Google Scholar
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. arXiv preprint arXiv:2004.13278 (2020)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP, pp. 3645–3649. IEEE (2017)
Google Scholar
Xie, H., Iacobacci, I.: Audio visual scene-aware dialog system using dynamic memory networks. In: DSTC8 at AAAI2020 Workshop (2020)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1556–1565 (2020)
Google Scholar
Yeh, Y.T., Lin, T.C., Cheng, H.H., Deng, Y.H., Su, S.Y., Chen, Y.N.: Reactive multi-stage feature fusion for multimodal dialogue modeling. arXiv preprint arXiv:1908.05067 (2019)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)
Yoshino, K., et al.: Dialog system technology challenge 7. arXiv preprint arXiv:1901.03461 (2019)
Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103 (2019)
Google Scholar
Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12647–12657, June 2021
Google Scholar

Download references

Author information

Authors and Affiliations

Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia
Hoang-Anh Pham, Thao Minh Le, Vuong Le & Truyen Tran
Posts and Telecommunications Institute of Technology, Ho Chi Minh City, Vietnam
Tu Minh Phuong

Authors

Hoang-Anh Pham
View author publications
You can also search for this author in PubMed Google Scholar
Thao Minh Le
View author publications
You can also search for this author in PubMed Google Scholar
Vuong Le
View author publications
You can also search for this author in PubMed Google Scholar
Tu Minh Phuong
View author publications
You can also search for this author in PubMed Google Scholar
Truyen Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoang-Anh Pham .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4249 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pham, HA., Le, T.M., Le, V., Phuong, T.M., Tran, T. (2022). Video Dialog as Conversation About Objects Living in Space-Time. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_41
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics