Skip to main content

C4AV: Learning Cross-Modal Representations from Transformers

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Abstract

In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   85.59
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR   109.99
Price includes VAT (Finland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054 (2019)

  2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  3. Caesar, H., et al.: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)

  5. Dai, H., Luo, S., Ding, Y., Shao, L.: Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: ECCV workshop (2020)

    Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  7. Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: A multimodal reasoner for visual grounding. arXiv preprint arXiv:2003.08717 (2020)

  8. Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: Taking control of your self-driving car. arXiv preprint arXiv:1909.10838 (2019)

  9. Fan, A., Grave, E., Joulin, A.: Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019)

  10. Guo, M., Zhang, Y., Liu, T.: Gaussian transformer: a lightweight approach for natural language inference. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6489–6496 (2019)

    Google Scholar 

  11. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)

  14. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D., Zhou, M.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344 (2020)

    Google Scholar 

  15. Li, J., et al.: 3D iou-net: Iou guided 3D object detector for point clouds. arXiv preprint arXiv:2004.04962 (2020)

  16. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  17. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)

    Google Scholar 

  18. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)

    Google Scholar 

  19. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)

    Google Scholar 

  20. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)

    Article  Google Scholar 

  21. Raganato, A., et al.: An analysis of encoder representations in transformer-based machine translation. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. The Association for Computational Linguistics (2018)

    Google Scholar 

  22. Su, W., et al.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  23. Sur, C.: Self-segregating and coordinated-segregating transformer for focused deep multi-modular network for visual question answering. arXiv preprint arXiv:2006.14264 (2020)

  24. Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  25. Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)

    Google Scholar 

  26. Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprint arXiv:2004.13822 (2020)

  27. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)

    Google Scholar 

  28. Xia, Y., He, T., Tan, X., Tian, F., He, D., Qin, T.: Tied transformers: neural machine translation with shared encoder and decoder. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5466–5473 (2019)

    Google Scholar 

  29. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)

    Google Scholar 

  30. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

Download references

Acknowledgement

This work was supported in part by the National Key Research and Development Program of China (2018YFE0183900).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hang Dai or Yong Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, S., Dai, H., Shao, L., Ding, Y. (2020). C4AV: Learning Cross-Modal Representations from Transformers. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66096-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66095-6

  • Online ISBN: 978-3-030-66096-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics