Skip to main content
Log in

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Pre-trained vision transformers have strong representations benefit to various downstream tasks. Recently many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called “Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments outperform full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 780\(\times \) fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot learning surpass other PEFT methods with lower parameter costs, demonstrating our proposed tuning technique’s strong capability and effectiveness in the low-data regime. The code will be available at https://github.com/zhaohengyuan1/SCT.git

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are available as follows: VTAB-1K (Zhai et al., 2019): https://github.com/google-research/task_adaptationhttps://github.com/google-research/task_adaptation. ImageNet-1K (Deng et al., 2009): https://www.image-net.org/. ImageNet-V2 (Recht et al., 2019): https://github.com/modestyachts/ImageNetV2. ImageNet-Sketch (Wang et al., 2019): https://github.com/HaohanWang/ImageNet-Sketch. ImageNet-A (Hendrycks et al., 2021b): https://github.com/hendrycks/natural-adv-examples. ImageNet-R (Hendrycks et al., 2021a): https://github.com/hendrycks/imagenet-r. Food101 (Bossard et al., 2014): https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101. Stanford Cars (Krause et al., 2013): http://ai.stanford.edu/\(\sim \)jkrause/car196. Oxford-Flowers102 Nilsback and Zisserman (2006): https://www.robots.ox.ac.uk/\(\sim \)vgg/data/flowers/102/102flowers.tgz. FGVC-Aircraft (Maji et al., 2013): https://www.robots.ox.ac.uk/\(\sim \)vgg/data/fgvc-aircraft/archives/fgvc-aircraft-2013b.tar.gz. Oxford-Pets (Parkhi et al., 2012): https://www.robots.ox.ac.uk/\(\sim \)vgg/data/pets/data/images.tar.gz.

References

  • Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. (2021). Xcit: Cross-covariance image transformers. In NeurIPS.

  • Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., & Efros, A.A. (2022). Visual prompting via image inpainting. arXiv preprint arXiv:2209.00647.

  • Beattie, C., Leibo, J.Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq, A., Green, S., Valdés, V., Sadik, A., et al. (2016). Deepmind lab. arXiv preprint arXiv:1612.03801.

  • Bossard, L., Guillaumin, M., & Gool, L.V. (2014). Food-101–mining discriminative components with random forests. In European conference on computer vision (ECCV), Springer, pp 446–461.

  • Cai, H., Gan, C., Zhu, L., & Han, S. (2020). Tinytl: Reduce memory, not parameters for efficient on-device learning. In NeurIPS.

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision, Springer, pp 213–229.

  • Chen, C.F.R., Fan, Q., & Panda, R. (2021a). Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366.

  • Chen, H., Tao, R., Zhang, H., Wang, Y., Ye, W., Wang, J., Hu, G., & Savvides, M. (2022a). Conv-adapter: Exploring parameter efficient transfer learning for convnets. arXiv preprint arXiv:2208.07463.

  • Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., & Luo, P. (2022b). Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535.

  • Chen, X., Xie, S., & He, K. (2021b). An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9640–9649.

  • Cheng, G., Han, J., & Lu, X. (2017). Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10), 1865–1883.

    Article  Google Scholar 

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp 248–255.

  • Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12124–12134.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

  • d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, PMLR, pp 2286–2296.

  • Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6824–6835.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InConference on computer vision and pattern recognition workshop, IEEE, pp 178–178.

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.

    Article  Google Scholar 

  • Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In NeurIPS.

  • Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28.

  • He, Y., Kang, G., Dong, X., Fu, Y., & Yang, Y. (2018). Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI International Joint Conference on Artificial Intelligence.

  • Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226. https://doi.org/10.1109/JSTARS.2019.2918242

  • Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., et al. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 8340–8349.

  • Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b). Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 15262–15271.

  • Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In: International conference on machine learning (ICML), PMLR, pp 2790–2799.

  • Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

  • Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S., & Lim, S.N. (2021). Exploring visual engagement signals for representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4206–4217.

  • Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S.N. (2022). Visual prompt tuning. In ECCV.

  • Jie, S., & Deng, Z.H. (2022). Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039.

  • Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2901–2910.

  • Kaggle & EyePacs (2015). Kaggle diabetic retinopathy detection https://www.kaggle.com/c/diabetic-retinopathy-detection/data.

  • Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp 554–561.

  • Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.

  • LeCun, Y., Huang, F.J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE, vol 2, pp II–104.

  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H.P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.

  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H.P. (2017). Pruning filters for efficient convnets. In: International conference on learning representations, https://openreview.net/forum?id=rJqFGTslg.

  • Li, Y., Xie, S., Chen, X., Dollar, P., He, K., & Girshick, R. (2021). Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429.

  • Lian, D., Zhou, D., Feng, J., & Wang, X. (2022). Scaling & shifting your features: A new baseline for efficient model tuning. In Advances in neural information processing systems (NeurIPS).

  • Liao, N., Shi, B., Cao, M., Zhang, X., Tian, Q., & Yan, J. (2023). Rethinking visual prompt learning as masked visual token modeling. arXiv preprint arXiv:2303.04998.

  • Liu, L., Yu, B.X., Chang, J., Tian, Q., & Chen, C.W. (2022). Prompt-matched semantic segmentation. arXiv preprint arXiv:2208.10159.

  • Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2018). Rethinking the value of network pruning. In: International conference on learning representations.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022.

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

  • Luo, J.H., Wu, J., & Lin, W. (2017). Thinet: A filter level pruning method for deep neural network compression. In: Proceedings of the IEEE international conference on computer vision, pp 5058–5066.

  • Luo, X., Xu, J., & Xu, Z. (2022). Channel importance matters in few-shot image classification. In: International conference on machine learning, PMLR, pp 14542–14559.

  • Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., & Van Der Maaten, L. (2018). Exploring the limits of weakly supervised pretraining. In: Proceedings of the European conference on computer vision (ECCV), pp 181–196.

  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.

  • Manli, S., Weili, N., De-An, H., Zhiding, Y., Tom, G., Anima, A., & Chaowei, X. (2022). Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS.

  • Matthey, L., Higgins, I., Hassabis, D., & Lerchner, A. (2017). dsprites: Disentanglement testing sprites dataset.

  • Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011). Reading digits in natural images with unsupervised feature learning.

  • Nie, X., Ni, B., Chang, J., Meng, G., Huo, C., Zhang, Z., Xiang, S., Tian, Q., & Pan, C. (2022). Pro-tuning: Unified prompt tuning for vision tasks. arXiv preprint arXiv:2207.14381.

  • Nilsback, M.E., & Zisserman, A. (2006). A visual vocabulary for flower classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, vol 2, pp 1447–1454.

  • Pan, J., Lin, Z., Zhu, X., Shao, J., & Li, H. (2022). St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35, 26462–26477.

    Google Scholar 

  • Parkhi, O.M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 3498–3505.

  • Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., & Hsieh, C.J. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. In NeurIPS.

  • Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In: International conference on machine learning (ICML), PMLR, pp 5389–5400.

  • Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272.

  • Sung, Y.L., Cho, J., & Bansal, M. (2022). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5227–5237.

  • Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 32–42.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.

  • Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., & Welling, M. (2018). Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, Springer, pp 210–218.

  • Wang, H., Ge, S., Lipton, Z., & Xing, E.P. (2019). Learning robust global representations by penalizing local predictive power. In NeurIPS.

  • Wang, P., Wang, X., Wang, F., Lin, M., Chang, S., Xie, W., Li, H., & Jin, R. (2021). Kvt: k-nn attention for boosting vision transformers. arXiv preprint arXiv:2106.00515.

  • Wang, S., Chang, J., Wang, Z., Li, H., Ouyang, W., & Tian, Q. (2022). Fine-grained retrieval prompt tuning. arXiv preprint arXiv:2207.14465.

  • Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492.

  • Xing, Y., Wu, Q., Cheng, D., Zhang, S., Liang, G., & Zhang, Y. (2022). Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340.

  • Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., & Huang, F. (2021). Raise a child in large language model: Towards effective and generalizable fine-tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.

  • Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.

  • Yang, J., Wang, P., Zou, D., Zhou, Z., Ding, K., Peng, W., Wang, H., Chen, G., Li, B., Sun, Y., et al. (2022a). Openood: Benchmarking generalized out-of-distribution detection. arXiv preprint arXiv:2210.07242.

  • Yang, J., Zhou, K., & Liu, Z. (2022b). Full-spectrum out-of-distribution detection. arXiv preprint arXiv:2204.05306.

  • Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International conference on computer vision, pp 558–567.

  • Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5), 6575–6586.

    Google Scholar 

  • Zang, Y., Li, W., Zhou, K., Huang, C., & Loy, C.C. (2022). Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225.

  • Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P., Riquelme, C., Lucic, M., Djolonga, J., Pinto, A.S., Neumann, M., Dosovitskiy, A., et al. (2019). A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867.

  • Zhang, B., Jin, X., Gong, W., Xu, K., Zhang, Z., Wang, P., Shen, X., & Feng, J. (2023a). Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868.

  • Zhang, Y., Zhou, K., & Liu, Z. (2022). Neural prompt search. arXiv preprint arXiv:2206.04673.

  • Zhang, Y., Zhou, K., & Liu, Z. (2023b). What makes good examples for visual in-context learning?.

  • Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., & Lee, G.H. (2022). Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In ECCV.

  • Zheng, Z., Yue, X., Wang, K., & You, Y. (2022). Prompt vision transformer for domain generalization. arXiv preprint arXiv:2208.08914.

  • Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., & Jin, R. (2021a). Elsa: Enhanced local self-attention for vision transformer. arXiv preprint arXiv:2112.12786.

  • Zhou, K., Yang, Y., Qiao, Y., & Xiang, T. (2021b). Domain generalization with mixstyle. In ICLR.

  • Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In: IEEE/CVF Conference on computer vision and pattern recognition (CVPR).

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9), 2337–2348.

    Article  Google Scholar 

  • Zhou, K., Zhang, Y., Zang, Y., Yang, J., Loy, C.C., & Liu, Z. (2022c). On-device domain generalization. arXiv preprint arXiv:2209.07521.

Download references

Funding

This research is supported by National Research Foundation, Singapore and A*STAR, under its RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) grant call (Grant No. I2001E0059) - SIA-NUS Digital Aviation Corp Lab. Mike Zheng Shou is supported by the National Research Foundation, Singapore, under its NRFF Award NRF-NRFF13-2021-0008, and Mike Zheng Shou’s Start-Up Grant from NUS. Henry Hengyuan Zhao is partially supported by Alibaba Group through Alibaba Research Intern Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mike Zheng Shou.

Ethics declarations

Conflict of interest

We declare that there are no conflicts of interest.

Additional information

Communicated by Kaiyang Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, H.H., Wang, P., Zhao, Y. et al. SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels. Int J Comput Vis 132, 731–749 (2024). https://doi.org/10.1007/s11263-023-01918-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01918-3

Keywords

Navigation