AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant

Wong, Benita; Chen, Joya; Wu, You; Lei, Stan Weixian; Mao, Dongxing; Gao, Difei; Shou, Mike Zheng

doi:10.1007/978-3-031-20059-5_28

Benita Wong¹²,
Joya Chen¹²,
You Wu¹²,
Stan Weixian Lei¹²,
Dongxing Mao¹²,
Difei Gao¹² &
…
Mike Zheng Shou¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

European Conference on Computer Vision

1847 Accesses
4 Citations

Abstract

A long-standing goal of intelligent assistants such as AR glasses/robots has been to assist users in affordance-centric real-world scenarios, such as “how can I run the microwave for 1 min?”. However, there is still no clear task definition and suitable benchmarks. In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user’s view. To support the task, we constructed AssistQ, a new dataset comprising 531 question-answer samples from 100 newly filmed instructional videos. We also developed a novel Question-to-Actions (Q2A) model to address the AQTC task and validate it on the AssistQ dataset. The results show that our model significantly outperforms several VQA-related baselines while still having large room for improvement. We expect our task and dataset to advance Egocentric AI Assistant’s development. Our project page is available at: https://showlab.github.io/assistq/.

B. Wong, J. Chen and Y. Wu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
\(Q_U\) in Sect. 3.2 denotes the question under user view. To simplify, we use Q here.
2.
We use the same notation of Sect. 5. V: video, S: script, Q: question, A: answer.

References

AlAmri, H., et al.: Audio visual scene-aware dialog. In: CVPR, pp. 7558–7567 (2019)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846 (2021)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Chadha, A., Arora, G., Kaloty, N.: iPerceive: applying common-sense reasoning to multi-modal dense video captioning and video question answering. arXiv:2011.07735 (2020)
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR, pp. 1–10 (2018)
Google Scholar
Das, A., et al.: TarMAC: targeted multi-agent communication. In: ICML, pp. 1538–1546 (2019)
Google Scholar
Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. In: CoRL, pp. 53–62 (2018)
Google Scholar
Das, A., et al.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)
Google Scholar
Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R.B., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: CVPR (2021)
Google Scholar
Gao, D., Wang, R., Bai, Z., Chen, X.: Env-QA: a video question answering benchmark for comprehensive understanding of dynamic environments. In: CVPR, pp. 1675–1685 (2021)
Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR, pp. 4089–4098 (2018)
Google Scholar
Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127(4), 398–414 (2019)
Article Google Scholar
Grauman, K., et al: Ego4D: around the world in 3, 000 hours of egocentric video. arXiv:2110.07058 (2021)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jain, U., et al.: Two body problem: collaborative visual task completion. In: CVPR, pp. 6689–6699 (2019)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR, pp. 1359–1367 (2017)
Google Scholar
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 160–178. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_10
Chapter Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: EMNLP, pp. 1369–1379 (2018)
Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) ACL, pp. 8211–8225 (2020)
Google Scholar
Li, Z., Li, Z., Zhang, J., Feng, Y., Zhou, J.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE ACM Trans. Audio Speech Lang. Process. 29, 2476–2483 (2021)
Google Scholar
Lin, K.Q., et al.: Egocentric video-language pretraining. arXiv:2206.01670 (2022)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2017)
Google Scholar
Paszke, A., Gross, S., Massa, F.E.A.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8026–8037 (2019)
Google Scholar
Sax, A., et al.: Learning to navigate using mid-level visual priors. In: CoRL, pp. 791–812 (2019)
Google Scholar
Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: CVPR, pp. 12548–12558 (2019)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR, pp. 4631–4640 (2016)
Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602 (2022)
Wang, A.J., et al.: All in one: exploring unified video-language pre-training. arXiv:2203.07303 (2022)
Wang, J., et al.: Object-aware video-language pre-training for retrieval. In: CVPR, pp. 3313–3322 (2022)
Google Scholar
Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: CVPR, pp. 6750–6759 (2019)
Google Scholar
Yan, R., et al.: Video-text pre-training with learned regions. arXiv:2112.01194 (2021)
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020). https://ieeexplore.ieee.org/document/9241410
Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. In: ICLR (2019)
Google Scholar
Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T.L., Batra, D.: Multi-target embodied question answering. In: CVPR, pp. 6309–6318 (2019)
Google Scholar

Download references

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its NRFF Award NRF-NRFF13-2021-0008, and Mike Zheng Shou’s Start-Up Grant from NUS. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore.

Author information

Authors and Affiliations

Show Lab, National University of Singapore, Singapore, Singapore
Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao & Mike Zheng Shou

Authors

Benita Wong
View author publications
You can also search for this author in PubMed Google Scholar
Joya Chen
View author publications
You can also search for this author in PubMed Google Scholar
You Wu
View author publications
You can also search for this author in PubMed Google Scholar
Stan Weixian Lei
View author publications
You can also search for this author in PubMed Google Scholar
Dongxing Mao
View author publications
You can also search for this author in PubMed Google Scholar
Difei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Mike Zheng Shou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mike Zheng Shou .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6474 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wong, B. et al. (2022). AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-20059-5_28
Published: 29 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant