Skip to main content

Video Dialog as Conversation About Objects Living in Space-Time

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Abstract

It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed \(\texttt{COST}\)–which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, \(\texttt{COST}\) maintains and tracks object-associated dialog states, which are updated upon receiving new questions. Object interactions are dynamically and conditionally inferred for each question, and these serve as the basis for relational reasoning among them. \(\texttt{COST}\) also maintains a history of previous answers, and this allows retrieval of relevant object-centric information to enrich the answer forming process. Language production then proceeds in a step-wise manner, taking the context of the current utterance, the existing dialog, and the current question. We evaluate \(\texttt{COST}\) on the AVSD test splits (DSTC7 and DSTC8), demonstrating its competitiveness against state-of-the-arts.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This is related to, but distinct from, the dialog state tracking in typical task-oriented dialogs in NLP [10].

  2. 2.

    https://github.com/hoanganhpham1006/COST.

References

  1. Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)

    Google Scholar 

  2. Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 106–122. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_7

    Chapter  Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

    Google Scholar 

  4. Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)

    Google Scholar 

  5. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  6. Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W.: Multi-step joint-modality attention network for scene-aware dialogue system. arXiv preprint arXiv:2001.06206 (2020)

  7. Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. In: IJCAI (2021)

    Google Scholar 

  8. Das, A., et al.: Visual Dialog. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1242–1256 (2019). https://doi.org/10.1109/TPAMI.2018.2828437

    Article  Google Scholar 

  9. Desta, M.T., Chen, L., Kornuta, T.: Object-based reasoning in VQA. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1814–1823. IEEE (2018)

    Google Scholar 

  10. Gao, S., Sethi, A., Agarwal, S., Chung, T., Hakkani-Tur, D.: Dialog state tracking: a neural reading comprehension approach. In: Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 264–273 (2019)

    Google Scholar 

  11. Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Natl. Acad. Sci. 112(12), 3618–3623 (2015)

    Article  Google Scholar 

  12. Geng, S., et al.: Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: Proceedings of AAAI Conference on Artificial Intelligence (2021)

    Google Scholar 

  13. Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643–1653, June 2021

    Google Scholar 

  14. Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356. IEEE (2019)

    Google Scholar 

  15. Hori, C., Cherian, A., Marks, T.K., Hori, T.: Joint student-teacher learning for audio-visual scene-aware dialog. In: INTERSPEECH, pp. 1886–1890 (2019)

    Google Scholar 

  16. Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C.: Location-aware graph convolutional networks for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11021–11028 (2020)

    Google Scholar 

  17. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4405–4413 (2017)

    Google Scholar 

  18. Kim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: AAAI (2021)

    Google Scholar 

  19. Kim, S., et al.: The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394 (2019)

  20. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  21. Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 160–178. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_10

    Chapter  Google Scholar 

  22. Le, H., Chen, N.F.: Multimodal transformer with pointer network for the DSTC8 AVSD challenge. arXiv preprint arXiv:2002.10695 (2020)

  23. Le, H., Chen, N.F., Hoi, S.C.: Learning reasoning paths over semantic graphs for video-grounded dialogues. arXiv preprint arXiv:2103.00820 (2021)

  24. Le, H., Hoi, S., Sahoo, D., Chen, N.: End-to-end multimodal dialog systems with hierarchical multimodal attention on video features. In: DSTC7 at AAAI2019 Workshop (2019)

    Google Scholar 

  25. Le, H., Sahoo, D., Chen, N., Hoi, S.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5612–5623 (2019)

    Google Scholar 

  26. Le, H., Sahoo, D., Chen, N., Hoi, S.C.: BiST: bi-directional spatio-temporal reasoning for video-grounded Dialogues. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1846–1859 (2020)

    Google Scholar 

  27. Le, T.M., Le, V., Venkatesh, S., Tran, T.: Dynamic language binding in relational visual reasoning. In: IJCAI, pp. 818–824 (2020)

    Google Scholar 

  28. Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9972–9981 (2020)

    Google Scholar 

  29. Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: DSTC8-AVSD: multimodal semantic transformer network with retrieval style word generator. arXiv preprint arXiv:2004.08299 (2020)

  30. Lin, K.Y., Hsu, C.C., Chen, Y.N., Ku, L.W.: Entropy-enhanced multimodal attention model for scene-aware dialogue generation. arXiv preprint arXiv:1908.08191 (2019)

  31. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: International Conference on Learning Representations (2017)

    Google Scholar 

  32. Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 336–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_20

    Chapter  Google Scholar 

  33. Nguyen, D.T., Sharma, S., Schulz, H., Asri, L.E.: From film to video: multi-turn question answering with multi-modal context. In: DSTC7 Workshop at AAAI 2019 (2019)

    Google Scholar 

  34. Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)

    Google Scholar 

  35. Sanabria, R., Palaskar, S., Metze, F.: CMU Sinbad’s submission for the DSTC7 AVSD challenge. In: DSTC7 at AAAI2019 Workshop, vol. 6 (2019)

    Google Scholar 

  36. Seo, P.H., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. arXiv preprint arXiv:1709.07992 (2017)

  37. Serban, I., et al.: A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

    Google Scholar 

  38. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  39. Spelke, E.S., Kinzler, K.D.: Core knowledge. Dev. Sci. 10(1), 89–96 (2007)

    Article  Google Scholar 

  40. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  41. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems 28 (2015)

    Google Scholar 

  42. Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25

    Chapter  Google Scholar 

  43. Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. arXiv preprint arXiv:2004.13278 (2020)

  44. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP, pp. 3645–3649. IEEE (2017)

    Google Scholar 

  45. Xie, H., Iacobacci, I.: Audio visual scene-aware dialog system using dynamic memory networks. In: DSTC8 at AAAI2020 Workshop (2020)

    Google Scholar 

  46. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)

    Google Scholar 

  47. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19

    Chapter  Google Scholar 

  48. Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1556–1565 (2020)

    Google Scholar 

  49. Yeh, Y.T., Lin, T.C., Cheng, H.H., Deng, Y.H., Su, S.Y., Chen, Y.N.: Reactive multi-stage feature fusion for multimodal dialogue modeling. arXiv preprint arXiv:1908.05067 (2019)

  50. Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)

  51. Yoshino, K., et al.: Dialog system technology challenge 7. arXiv preprint arXiv:1901.03461 (2019)

  52. Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103 (2019)

    Google Scholar 

  53. Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12647–12657, June 2021

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoang-Anh Pham .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4249 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pham, HA., Le, T.M., Le, V., Phuong, T.M., Tran, T. (2022). Video Dialog as Conversation About Objects Living in Space-Time. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics