Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Rufus, Nivedita; Nair, Unni Krishnan R; Krishna, K. Madhava; Gandhi, Vineet

doi:10.1007/978-3-030-66096-3_4

Nivedita Rufus¹⁰,
Unni Krishnan R Nair¹⁰,
K. Madhava Krishna¹⁰ &
…
Vineet Gandhi¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

1884 Accesses
6 Citations

Abstract

In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the given sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DA4AD: End-to-End Deep Attention-Based Visual Localization for Autonomous Driving

YOLOPoint: Joint Keypoint and Object Detection

BEVSeg: Geometry and Data-Driven Based Multi-view Segmentation in Bird’s-Eye-View

References

Akbari, H., Karaman, S., Bhargava, S., Chen, B., Vondrick, C., Chang, S.F.: Multi-level multimodal common semantic space for image-phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12476–12486 (2019)
Google Scholar
Caesar, H., et al.: Nuscenes: a multimodal dataset for autonomous driving (2019)
Google Scholar
Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 824–832 (2017)
Google Scholar
Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2601–2610 (2019)
Google Scholar
Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7746–7755 (2018)
Google Scholar
Deruyttere, T., Collell, G., Moens, M.F.: Giving commands to a self-driving car: a multimodal reasoner for visual grounding (2020)
Google Scholar
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019). 10.18653/v1/d19-1215, http://dx.doi.org/10.18653/v1/D19-1215
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
Google Scholar
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Google Scholar
Engilberge, M., Chevallier, L., Pérez, P., Cord, M.: Finding beans in burgers: deep semantic-visual embedding with localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3984–3993 (2018)
Google Scholar
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European conference on computer vision (ECCV) (2018)
Google Scholar
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)
Google Scholar
Javed, S.A., Saxena, S., Gandhi, V.: Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:1803.06506 (2018)
Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping (2014)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019)
Google Scholar
Musgrave, K., Belongie, S., Lim, S.N.: A metric learning reality check. arXiv preprint arXiv:2003.08505 (2020)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). http://arxiv.org/abs/1908.10084
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2019)
Google Scholar
Sriram, N.N., Maniar, T., Kalyanasundaram, J., Gandhi, V., Bhowmick, B., Madhava Krishna, K.: Talk to the vehicle: language conditioned autonomous navigation of self driving cars. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5284–5290 (2019)
Google Scholar
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019)
Google Scholar
Vandenhende, S., Deruyttere, T., Grujicic, D.: A baseline for the commands for autonomous vehicles challenge (2020)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)
Article Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Google Scholar
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Google Scholar

Download references

Acknowledgement

This work was supported in part by Qualcomm Innovation Fellowship (QIF 2020) from Qualcomm Technologies, Inc.

Author information

Authors and Affiliations

International Institute of Information Technology, Hyderabad, India
Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna & Vineet Gandhi

Authors

Nivedita Rufus
View author publications
You can also search for this author in PubMed Google Scholar
Unni Krishnan R Nair
View author publications
You can also search for this author in PubMed Google Scholar
K. Madhava Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Vineet Gandhi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nivedita Rufus .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rufus, N., Nair, U.K.R., Krishna, K.M., Gandhi, V. (2020). Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-66096-3_4
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Abstract

Access this chapter

Similar content being viewed by others

DA4AD: End-to-End Deep Attention-Based Visual Localization for Autonomous Driving

YOLOPoint: Joint Keypoint and Object Detection

BEVSeg: Geometry and Data-Driven Based Multi-view Segmentation in Bird’s-Eye-View

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Abstract

Access this chapter

Similar content being viewed by others

DA4AD: End-to-End Deep Attention-Based Visual Localization for Autonomous Driving

YOLOPoint: Joint Keypoint and Object Detection

BEVSeg: Geometry and Data-Driven Based Multi-view Segmentation in Bird’s-Eye-View

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation