SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

Qu, Mengxue; Wu, Yu; Liu, Wu; Gong, Qiqi; Liang, Xiaodan; Russakovsky, Olga; Zhao, Yao; Wei, Yunchao

doi:10.1007/978-3-031-19833-5_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2064 Accesses
8 Citations

Abstract

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity. Code is available at https://github.com/qumengxue/siri-vg.git.

M. Qu—Work done during an internship at JD Explore Academy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We train more epochs until converging in small-scale experiments.

References

Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR (2010)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W., Luo, J.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)
Chen, Y.-C., et al.: UNITER: learning universal image-text representations (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: ICCV (2021)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: NeruIPS (2020)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)
Google Scholar
Han, S., et al.: DSD: dense-sparse-sense training for deep neural networks. In: ICLR (2017)
Google Scholar
Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001 (1990)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596 (2014)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Google Scholar
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. In: ICLR (2017)
Google Scholar
Huang, J., Qu, L., Jia, R., Zhao, B.: O2U-Net: a simple noisy label detection approach for deep neural networks. In: ICCV (2019)
Google Scholar
Huang, X.S., et al.: Improving transformer optimization through better initialization. In: ICML (2020)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Krogh, A., Vedelsby, J., et al.: Neural network ensembles, cross validation, and active learning. In: NeruIPS (1995)
Google Scholar
Liang, C., Wu, Y., Luo, Y., Yang, Y.: ClawCraneNet: leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702 (2021)
Liang, C., et al.: Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021)
Liao, Y., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR (2020)
Google Scholar
Liu, D., Zhang, H., Wu, F., Zha, Z.-J.: Learning to assemble neural module tree networks for visual grounding. In: ICCV (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)
Google Scholar
Jiasen, L., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeruIPS (2019)
Google Scholar
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: ICCV (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV, pp. 464–472. IEEE Computer Society (2017)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Google Scholar
Weijie, S., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeuruIPS (2017)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 394–407 (2018)
Google Scholar
Wang, P., Qi, W., Cao, J., Shen, C., Gao, L., van den Hengel, A.: Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: CVPR (2019)
Google Scholar
Wu, Y., Jiang, L., Yang, Y.: Switchable novel object captioner. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2022)
Google Scholar
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV (2019)
Google Scholar
Yang, S., Li, G., Yu, Y.: Graph-structured referring expression reasoning in the wild. In: CVPR (2020)
Google Scholar
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Chapter Google Scholar
Yang, Z., et al.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)
Google Scholar
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV (2019)
Google Scholar
Yu, F., et al.: ERNRE-ViL knowledge enhanced vision-language representations through scene graph (2020)
Google Scholar
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: CVPR (2018)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Zhang, H., Niu, Y., Chang, S.-F.: Grounding referring expressions in images by variational context. In: CVPR (2018)
Google Scholar
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129, 3069–3087 (2021)
Google Scholar
Zheng, K., Liu, W., Liu, J., Zha, Z.-J., Mei, T.: Hierarchical Gumbel attention network for text-based person search. In: ACM Multimedia, pp. 3441–3449. ACM (2020)
Google Scholar
Zhuang, B., Wu, Q., Shen, C., Reid, I., Van Den Hengel, A.: Parallel attention: a unified framework for visual object discovery through dialogs and queries. In CVPR (2018)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key R &D Program of China (No.2021ZD0112100), the National NSF of China (No. U1936212, No. 62120106009), the Fundamental Research Funds for the Central Universities (No. K22RC00010). We thank Princeton Visual AI Lab members (Dora Zhao, Jihoon Chung, and others) for their helpful suggestions.

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, China
Mengxue Qu, Qiqi Gong, Yao Zhao & Yunchao Wei
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China
Mengxue Qu, Qiqi Gong, Yao Zhao & Yunchao Wei
Princeton University, Princeton, USA
Yu Wu & Olga Russakovsky
JD Explore Academy, Beijing, China
Wu Liu
Sun Yat-sen University, Guangzhou, China
Xiaodan Liang

Authors

Mengxue Qu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qiqi Gong
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Olga Russakovsky
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunchao Wei .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 793 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qu, M. et al. (2022). SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_32
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding