Abstract
In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054 (2019)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Caesar, H., et al.: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)
Dai, H., Luo, S., Ding, Y., Shao, L.: Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: ECCV workshop (2020)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: A multimodal reasoner for visual grounding. arXiv preprint arXiv:2003.08717 (2020)
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: Taking control of your self-driving car. arXiv preprint arXiv:1909.10838 (2019)
Fan, A., Grave, E., Joulin, A.: Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019)
Guo, M., Zhang, Y., Liu, T.: Gaussian transformer: a lightweight approach for natural language inference. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6489–6496 (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D., Zhou, M.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344 (2020)
Li, J., et al.: 3D iou-net: Iou guided 3D object detector for point clouds. arXiv preprint arXiv:2004.04962 (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Raganato, A., et al.: An analysis of encoder representations in transformer-based machine translation. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. The Association for Computational Linguistics (2018)
Su, W., et al.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Sur, C.: Self-segregating and coordinated-segregating transformer for focused deep multi-modular network for visual question answering. arXiv preprint arXiv:2006.14264 (2020)
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge. arXiv preprint arXiv:2004.13822 (2020)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
Xia, Y., He, T., Tan, X., Tian, F., He, D., Qin, T.: Tied transformers: neural machine translation with shared encoder and decoder. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5466–5473 (2019)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Acknowledgement
This work was supported in part by the National Key Research and Development Program of China (2018YFE0183900).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, S., Dai, H., Shao, L., Ding, Y. (2020). C4AV: Learning Cross-Modal Representations from Transformers. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-66096-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)
