Abstract
In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity. Code is available at https://github.com/qumengxue/siri-vg.git.
M. Qu—Work done during an internship at JD Explore Academy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We train more epochs until converging in small-scale experiments.
References
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR (2010)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W., Luo, J.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)
Chen, Y.-C., et al.: UNITER: learning universal image-text representations (2019)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: ICCV (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: NeruIPS (2020)
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)
Han, S., et al.: DSD: dense-sparse-sense training for deep neural networks. In: ICLR (2017)
Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001 (1990)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596 (2014)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. In: ICLR (2017)
Huang, J., Qu, L., Jia, R., Zhao, B.: O2U-Net: a simple noisy label detection approach for deep neural networks. In: ICCV (2019)
Huang, X.S., et al.: Improving transformer optimization through better initialization. In: ICML (2020)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Krogh, A., Vedelsby, J., et al.: Neural network ensembles, cross validation, and active learning. In: NeruIPS (1995)
Liang, C., Wu, Y., Luo, Y., Yang, Y.: ClawCraneNet: leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702 (2021)
Liang, C., et al.: Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021)
Liao, Y., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: CVPR (2020)
Liu, D., Zhang, H., Wu, F., Zha, Z.-J.: Learning to assemble neural module tree networks for visual grounding. In: ICCV (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)
Jiasen, L., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeruIPS (2019)
Luo, G., et al.: Multi-task collaborative network for joint referring expression comprehension and segmentation. In: CVPR (2020)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: ICCV (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV, pp. 464–472. IEEE Computer Society (2017)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Weijie, S., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Vaswani, A., et al.: Attention is all you need. In: NeuruIPS (2017)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 394–407 (2018)
Wang, P., Qi, W., Cao, J., Shen, C., Gao, L., van den Hengel, A.: Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: CVPR (2019)
Wu, Y., Jiang, L., Yang, Y.: Switchable novel object captioner. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2022)
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: ICCV (2019)
Yang, S., Li, G., Yu, Y.: Graph-structured referring expression reasoning in the wild. In: CVPR (2020)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Yang, Z., et al.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR (2022)
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV (2019)
Yu, F., et al.: ERNRE-ViL knowledge enhanced vision-language representations through scene graph (2020)
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: CVPR (2018)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zhang, H., Niu, Y., Chang, S.-F.: Grounding referring expressions in images by variational context. In: CVPR (2018)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129, 3069–3087 (2021)
Zheng, K., Liu, W., Liu, J., Zha, Z.-J., Mei, T.: Hierarchical Gumbel attention network for text-based person search. In: ACM Multimedia, pp. 3441–3449. ACM (2020)
Zhuang, B., Wu, Q., Shen, C., Reid, I., Van Den Hengel, A.: Parallel attention: a unified framework for visual object discovery through dialogs and queries. In CVPR (2018)
Acknowledgements
This work was supported in part by the National Key R &D Program of China (No.2021ZD0112100), the National NSF of China (No. U1936212, No. 62120106009), the Fundamental Research Funds for the Central Universities (No. K22RC00010). We thank Princeton Visual AI Lab members (Dora Zhao, Jihoon Chung, and others) for their helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qu, M. et al. (2022). SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-19833-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)