Skip to main content

Commands for Autonomous Vehicles by Progressively Stacking Visual-Linguistic Representations

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12536)

Abstract

In this work, we focus on the object referral problem in the autonomous driving setting. We use a stacked visual-linguistic BERT model to learn a generic visual-linguistic representation. Each element of the input is either a word or a region of interest from the input image. To train the deep model efficiently, we use a stacking algorithm to transfer knowledge from a shallow BERT model to a deep BERT model.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-66096-3_2
  • Chapter length: 6 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   109.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-66096-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   149.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

References

  1. Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054 (2019)

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)

  3. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)

  4. Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7746–7755 (2018)

    Google Scholar 

  5. Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding. arXiv preprint arXiv:2003.08717 (2020)

  6. Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. arXiv preprint arXiv:1909.10838 (2019)

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  9. Gong, L., He, D., Li, Z., Qin, T., Wang, L., Liu, T.: Efficient training of bert by progressively stacking. In: International Conference on Machine Learning, pp. 2337–2346 (2019)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)

    Google Scholar 

  12. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)

  13. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

    Google Scholar 

  14. Johnson, J., et al: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)

    Google Scholar 

  15. Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998 (2017)

    Google Scholar 

  16. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Google Scholar 

  17. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D., Zhou, M.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344 (2020)

    Google Scholar 

  18. Li, J., Luo, S., Zhu, Z., Dai, H., Krylov, A.S., Ding, Y., Shao, L.: 3D IOU-net: IOU guided 3D object detector for point clouds. arXiv preprint arXiv:2004.04962 (2020)

  19. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  20. Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4673–4682 (2019)

    Google Scholar 

  21. Liu, Y., Lapata, M.: Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345 (2019)

  22. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)

    Google Scholar 

  23. Luo, S., Dai, H., Shao, L., Ding, Y.: C4av: cross-modal representations from transformer. In: ECCV workshop (2020)

    Google Scholar 

  24. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)

    Google Scholar 

  25. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: visual reasoning with a general conditioning layer. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  27. Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (2020)

    Google Scholar 

  28. Sun, F., et al.: Bert4rec: sequential recommendation with bidirectional encoder representations from transformer. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450 (2019)

    Google Scholar 

  29. Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  30. Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprint arXiv:2004.13822 (2020)

  31. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)

    CrossRef  Google Scholar 

  32. Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 349–357. IEEE (2019)

    Google Scholar 

  33. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: arXiv preprint arXiv:1904.07850 (2019)

  34. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710 (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Key Research and Development Program of China (2018YFE0183900).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hang Dai or Yong Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Dai, H., Luo, S., Ding, Y., Shao, L. (2020). Commands for Autonomous Vehicles by Progressively Stacking Visual-Linguistic Representations. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66096-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66095-6

  • Online ISBN: 978-3-030-66096-3

  • eBook Packages: Computer ScienceComputer Science (R0)