Skip to main content

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

Abstract

A long-standing goal of intelligent assistants such as AR glasses/robots has been to assist users in affordance-centric real-world scenarios, such as “how can I run the microwave for 1 min?”. However, there is still no clear task definition and suitable benchmarks. In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user’s view. To support the task, we constructed AssistQ, a new dataset comprising 531 question-answer samples from 100 newly filmed instructional videos. We also developed a novel Question-to-Actions (Q2A) model to address the AQTC task and validate it on the AssistQ dataset. The results show that our model significantly outperforms several VQA-related baselines while still having large room for improvement. We expect our task and dataset to advance Egocentric AI Assistant’s development. Our project page is available at: https://showlab.github.io/assistq/.

B. Wong, J. Chen and Y. Wu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(Q_U\) in Sect. 3.2 denotes the question under user view. To simplify, we use Q here.

  2. 2.

    We use the same notation of Sect. 5. V: video, S: script, Q: question, A: answer.

References

  1. AlAmri, H., et al.: Audio visual scene-aware dialog. In: CVPR, pp. 7558–7567 (2019)

    Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  3. Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  4. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846 (2021)

    Google Scholar 

  5. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

    Google Scholar 

  6. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  7. Chadha, A., Arora, G., Kaloty, N.: iPerceive: applying common-sense reasoning to multi-modal dense video captioning and video question answering. arXiv:2011.07735 (2020)

  8. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)

  9. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR, pp. 1–10 (2018)

    Google Scholar 

  10. Das, A., et al.: TarMAC: targeted multi-agent communication. In: ICML, pp. 1538–1546 (2019)

    Google Scholar 

  11. Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. In: CoRL, pp. 53–62 (2018)

    Google Scholar 

  12. Das, A., et al.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)

    Google Scholar 

  13. Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)

    Google Scholar 

  14. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  15. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  16. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R.B., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: CVPR (2021)

    Google Scholar 

  17. Gao, D., Wang, R., Bai, Z., Chen, X.: Env-QA: a video question answering benchmark for comprehensive understanding of dynamic environments. In: CVPR, pp. 1675–1685 (2021)

    Google Scholar 

  18. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR, pp. 4089–4098 (2018)

    Google Scholar 

  19. Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127(4), 398–414 (2019)

    Article  Google Scholar 

  20. Grauman, K., et al: Ego4D: around the world in 3, 000 hours of egocentric video. arXiv:2110.07058 (2021)

  21. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  22. Jain, U., et al.: Two body problem: collaborative visual task completion. In: CVPR, pp. 6689–6699 (2019)

    Google Scholar 

  23. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR, pp. 1359–1367 (2017)

    Google Scholar 

  24. Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 160–178. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_10

    Chapter  Google Scholar 

  25. Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: EMNLP, pp. 1369–1379 (2018)

    Google Scholar 

  26. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) ACL, pp. 8211–8225 (2020)

    Google Scholar 

  27. Li, Z., Li, Z., Zhang, J., Feng, Y., Zhou, J.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE ACM Trans. Audio Speech Lang. Process. 29, 2476–2483 (2021)

    Google Scholar 

  28. Lin, K.Q., et al.: Egocentric video-language pretraining. arXiv:2206.01670 (2022)

  29. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2017)

    Google Scholar 

  30. Paszke, A., Gross, S., Massa, F.E.A.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8026–8037 (2019)

    Google Scholar 

  31. Sax, A., et al.: Learning to navigate using mid-level visual priors. In: CoRL, pp. 791–812 (2019)

    Google Scholar 

  32. Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: CVPR, pp. 12548–12558 (2019)

    Google Scholar 

  33. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR, pp. 4631–4640 (2016)

    Google Scholar 

  34. Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602 (2022)

  35. Wang, A.J., et al.: All in one: exploring unified video-language pre-training. arXiv:2203.07303 (2022)

  36. Wang, J., et al.: Object-aware video-language pre-training for retrieval. In: CVPR, pp. 3313–3322 (2022)

    Google Scholar 

  37. Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: CVPR, pp. 6750–6759 (2019)

    Google Scholar 

  38. Yan, R., et al.: Video-text pre-training with learned regions. arXiv:2112.01194 (2021)

  39. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020). https://ieeexplore.ieee.org/document/9241410

  40. Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. In: ICLR (2019)

    Google Scholar 

  41. Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T.L., Batra, D.: Multi-target embodied question answering. In: CVPR, pp. 6309–6318 (2019)

    Google Scholar 

Download references

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its NRFF Award NRF-NRFF13-2021-0008, and Mike Zheng Shou’s Start-Up Grant from NUS. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mike Zheng Shou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6474 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wong, B. et al. (2022). AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics