Skip to main content

ViTAS: Vision Transformer Architecture Search

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose identity shifting to alleviate the many-to-one issue in superformer and leverage weak augmentation and regularization techniques for more steady training empirically. Based on these, our proposed method, ViTAS, has achieved significant superiority in both DeiT- and Twins-based ViTs. For example, with only 1.4G FLOPs budget, our searched architecture achieves \(3.3\%\) higher accuracy than the baseline DeiT on ImageNet-1k dataset. With 3.0G FLOPs, our results achieve \(82.0\%\) accuracy on ImageNet-1k, and \(45.9\%\) mAP on COCO2017, which is \(2.4\%\) superior than other ViTs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In the superformer, \(\text {Max}_\text {a}\) indicates the output of the first fully connected (FC) layer, which should be able to be divided by all “ratios” and “Heads”, i.e., \(\text {Max}_\text {a}|(Ratio \times Heads), \forall Ratio, Heads\). Therefore, we select least common multiple of “ratios” and “Heads” for \(\text {Max}_\text {a}\).

  2. 2.

    In the superformer, “Max Dim” indicates the output dimensions of both attention and MLP blocks.

  3. 3.

    We constructed the ViTAS-Twins-T transformer space from Twins-S similar to Table 1, and the Twins-T was uniformly scaled from Twins-S.

References

  1. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  2. Chen, C.F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)

  3. Chen, K., et al.: Mmdetection: ppen mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  4. Chen, M., Peng, H., Fu, J., Ling, H.: AutoFormer: searching transformers for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12270–12280, October 2021

    Google Scholar 

  5. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. arXiv preprint arXiv:2104.13840 (2021)

  6. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)

    Article  Google Scholar 

  7. Deng, C., Yang, E., Liu, T., Tao, D.: Two-stream deep hashing with class-specific centers for supervised image search. IEEE Trans. Neural Networks Learn. Syst. 31(6), 2189–2201 (2019)

    Article  Google Scholar 

  8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  10. Diamond, S., Boyd, S.: Cvxpy: a python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17(1), 2909–2913 (2016)

    MathSciNet  MATH  Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  12. Du, R., Xie, J., Ma, Z., Chang, D., Song, Y.Z., Guo, J.: Progressive learning of category-consistent multi-granularity features for fine-grained visual classification (2021)

    Google Scholar 

  13. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)

    Article  Google Scholar 

  14. Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single path one-shot neural architecture search with uniform sampling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 544–560. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_32

    Chapter  Google Scholar 

  15. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  18. Huang, T., et al.: Greedynasv2: Greedier search with a greedy path filter. arXiv preprint arXiv:2111.12609 (2021)

  19. Huang, T., et al.: Explicitly learning topology for differentiable neural architecture search. arXiv preprint arXiv:2011.09300 (2020)

  20. Huang, T., et al.: Dyrep: bootstrapping training with dynamic re-parameterization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 588–597 (2022)

    Google Scholar 

  21. Li, C., et al.: BossNAS: exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search (2021)

    Google Scholar 

  22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  23. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  24. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9 (2019)

    Google Scholar 

  25. Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  26. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)

    Google Scholar 

  27. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018)

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  29. Park, J., Boyd, S.: General heuristics for nonconvex quadratically constrained quadratic programming. arXiv preprint arXiv:1703.07870 (2017)

  30. Paszke, A., Gross, S., Chintala, S., Chanan, G.: Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: tensors and dynamic neural networks in Python with strong GPU acceleration 6 (2017)

    Google Scholar 

  31. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning. Tech. rep, OpenAI (2018)

    Google Scholar 

  32. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  33. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)

    Google Scholar 

  34. Su, X., et al.: Prioritized architecture sampling with monto-carlo tree search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10968–10977 (2021)

    Google Scholar 

  35. Su, X., et al.: Locally free weight sharing for network width search. arXiv preprint arXiv:2102.05258 (2021)

  36. Su, X., You, S., Wang, F., Qian, C., Zhang, C., Xu, C.: Bcnet: searching for network width with bilaterally coupled network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2175–2184 (2021)

    Google Scholar 

  37. Su, X., et al.: K-shot nas: Learnable weight-sharing for nas with k-shot supernets. arXiv preprint arXiv:2106.06442 (2021)

  38. Tang, Y., You, S., Xu, C., Han, J., Qian, C., Shi, B., Xu, C., Zhang, C.: Reborn filters: pruning convolutional neural networks with limited data. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5972–5980 (2020)

    Google Scholar 

  39. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Contrastive multiview coding. LNCS, vol. 12356, pp. 776–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_45

    Chapter  Google Scholar 

  40. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)

  41. Wan, A., et al.: Fbnetv2: differentiable neural architecture search for spatial and channel dimensions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12965–12974 (2020)

    Google Scholar 

  42. Wang, C., Xu, C., Yao, X., Tao, D.: Evolutionary generative adversarial networks. IEEE Trans. Evol. Comput. 23(6), 921–934 (2019). https://doi.org/10.1109/TEVC.2019.2895748

    Article  Google Scholar 

  43. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)

  44. Wu, H., et al.: Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)

  45. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Unified perceptual parsing for scene understanding. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26

    Chapter  Google Scholar 

  46. Xie, J., Ma, Z., Chang, D., Zhang, G., Guo, J.: GPCA: a probabilistic framework for Gaussian process embedded channel attention (2021)

    Google Scholar 

  47. Xie, J., et al.: Advanced dropout: a model-free methodology for Bayesian dropout optimization (2021)

    Google Scholar 

  48. Xie, J., et al.: DS-UI: Dual-supervised mixture of Gaussian mixture models for uncertainty inference in image recognition 30, 9208–9219 (2021)

    Google Scholar 

  49. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

    Google Scholar 

  50. Xu, H., Su, X., Wang, D.: Cnn-based local vision transformer for covid-19 diagnosis. arXiv preprint arXiv:2207.02027 (2022)

  51. Xu, H., Su, X., Wang, Y., Cai, H., Cui, K., Chen, X.: Automatic bridge crack detection using a convolutional neural network. Appl. Sci. 9(14), 2867 (2019)

    Article  Google Scholar 

  52. Xu, H., et al.: Data agnostic filter gating for efficient deep networks. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3503–3507. IEEE (2022)

    Google Scholar 

  53. Xu, H., Wang, D., Sowmya, A.: Multi-scale alignment and spatial roi module for covid-19 diagnosis. arXiv preprint arXiv:2207.01345 (2022)

  54. Yan, Z., Dai, X., Zhang, P., Tian, Y., Wu, B., Feiszli, M.: Fp-nas: fast probabilistic neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15139–15148 (2021)

    Google Scholar 

  55. Yang, Y., Li, H., You, S., Wang, F., Qian, C., Lin, Z.: Ista-nas: Efficient and consistent neural architecture search by sparse coding. Advances in Neural Information Processing Systems 33 (2020)

    Google Scholar 

  56. Yang, Y., You, S., Li, H., Wang, F., Qian, C., Lin, Z.: Towards improving the consistency, efficiency, and flexibility of differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6667–6676 (2021)

    Google Scholar 

  57. You, S., Huang, T., Yang, M., Wang, F., Qian, C., Zhang, C.: Greedynas: towards fast one-shot nas with greedy supernet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2008 (2020)

    Google Scholar 

  58. You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285–1294 (2017)

    Google Scholar 

  59. Yu, J., Huang, T.: Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728 8 (2019)

  60. Yuan, L., et al.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)

  61. Zheng, M., et al.: Weakly supervised contrastive learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10042–10051 (2021)

    Google Scholar 

  62. Zheng, M., You, S., Huang, L., Wang, F., Qian, C., Xu, C.: Simmatch: semi-supervised learning with similarity matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14471–14481 (2022)

    Google Scholar 

  63. Zheng, M., You, S., Wang, F., Qian, C., Zhang, C., Wang, X., Xu, C.: Ressl: relational self-supervised learning with weak augmentation. Adv. Neural. Inf. Process. Syst. 34, 2543–2555 (2021)

    Google Scholar 

  64. Zheng, M., et al.: Relational self-supervised learning. arXiv preprint arXiv:2203.08717 (2022)

  65. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

    Google Scholar 

  66. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)

    Google Scholar 

  67. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shan You .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 600 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Su, X. et al. (2022). ViTAS: Vision Transformer Architecture Search. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13681. Springer, Cham. https://doi.org/10.1007/978-3-031-19803-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19803-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19802-1

  • Online ISBN: 978-3-031-19803-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics