EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Zhang, Jiangning; Li, Xiangtai; Wang, Yabiao; Wang, Chengjie; Yang, Yibo; Liu, Yong; Tao, Dacheng

doi:10.1007/s11263-024-02034-6

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Published: 02 April 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jiangning Zhang¹^na1,
Xiangtai Li²^na1,
Yabiao Wang³,
Chengjie Wang³,
Yibo Yang²,
Yong Liu¹^na1 &
…
Dacheng Tao⁴

257 Accesses
Explore all metrics

Abstract

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Article 24 January 2024

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Article 12 January 2023

Multi-domain Multi-definition Landmark Localization for Small Datasets

Data Availability

All the datasets used in this paper are available online. ImageNet-1K (http://image-net.org), COCO 2017 (https://cocodataset.org), and ADE20K (http://sceneparsing.csail.mit.edu) can be downloaded from their official website accordingly.

References

Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., & Jegou H (2021). Xcit: Cross-covariance image transformers. In NeurIPS.
Atito, S., Awais, M., & Kittler, J. (2021). Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML.
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEit: BERT pre-training of image transformers. In ICLR.
Bartz-Beielstein, T., Branke, J., Mehnen, J., & Mersmann, O. (2014). Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Bello, I. (2021). Lambdanetworks: Modeling long-range interactions without attention. In ICLR.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML.
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In ICCV.
Bhowmik, P., Pantho, M. J. H., & Bobda, C. (2021). Bio-inspired smart vision sensor: Toward a reconfigurable hardware modeling of the hierarchical processing in the brain. Journal of Real-Time Image Processing, 18, 157–174.
Article Google Scholar
Brest, J., Greiner, S., Boskovic, B., Mernik, M., & Zumer, V. (2006). Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. In TEC.
Brest, J., Zamuda, A., Boskovic, B., Maucec, M. S., & Zumer, V. (2008). High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. In CEC.
Brest, J., Zamuda, A., Fister, I., & Maučec, M. S. (2010). Large scale global optimization using self-adaptive differential evolution algorithm. In CEC.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In NeurIPS.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV.
Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021). Glit: Neural architecture search for global and local image transformer. In ICCV.
Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W. (2021). Pre-trained image processing transformer. In CVPR.
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. In ICCV.
Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., & Ling, H. (2021). Searching the search space of vision transformer. In NeurIPS.
Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., & Cheng, J., Wang, J. (2022). Mixformer: Mixing features across windows and dimensions. In CVPR.
Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2seq: A language modeling framework for object detection. In ICLR.
Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2023). Context autoencoder for self-supervised representation learning. In IJCV.
Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised visual transformers. In ICCV.
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In CVPR.
Chen, Z., & Kang, L. (2005). Multi-population evolutionary algorithm for solving constrained optimization problems. In AIAI.
Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM.
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR.
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In NeurIPS.
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. In ICLR.
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS.
Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., & Shen, C. (2023). Conditional positional encodings for vision transformers. In ICLR.
Coello, C. A. C., & Lamont, G. B. (2004). Applications of multi-objective evolutionary algorithms (Vol. 1). World Scientific.
Cordonnier, J. B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. In ICLR.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV.
Das, S., & Suganthan, P. N. (2010). Differential evolution: A survey of the state-of-the-art. TEC.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo, B. (2023). Peco: Perceptual codebook for Bert pre-training of vision transformers. In AAAI.
Dong, Y., Cordonnier, J. B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16 \(\times \) 16 words: Transformers for image recognition at scale. In ICLR.
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In ICML.
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., & Liu, W. (2021). You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex.
Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., & Qiao, Y. (2022). Mcmae: Masked convolution meets masked autoencoders. In NeurIPS.
García-Martínez, C., & Lozano, M. (2008). Local search based on genetic algorithms. In Advances in metaheuristics for hard optimization. Springer.
Goyal, A., & Bengio, Y. (2022). Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478, 20210068.
Article MathSciNet Google Scholar
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR.
Guo, M. H., Lu, C. Z., Liu, Z.N., Cheng, M. M., & Hu, S. M. (2023). Visual attention network. In CVM.
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In NeurIPS.
Hao, Y., Dong, L., Wei, F., & Xu, K. (2021). Self-attention attribution: Interpreting information interactions inside transformer. In AAAI.
Hart, W. E., Krasnogor, N., & Smith, J. E. (2005). Memetic evolutionary algorithms. In Recent advances in memetic algorithms (pp. 3–27). Springer.
Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., & Prasath, V. (2019). Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information, 10, 390.
Article Google Scholar
Hassani, A., Walton, S., Li, J., Li, S., & Shi, H. (2023). Neighborhood attention transformer. In CVPR.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
He, R., Ravula, A., Kanagal, B., & Ainslie, J. (2020). Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747.
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Le, Q. V. (2019). Searching for mobilenetv3. In ICCV.
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650.
Hudson, D. A., & Zitnick, L. (2021). Generative adversarial transformers. In ICML.
Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. In NeurIPS.
Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., & Feng, J. (2021). All tokens matter: Token labeling for training better vision transformers. In NeurIPS.
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML.
Khare, V., Yao, X., & Deb, K. (2003). Performance scaling of multi-objective evolutionary algorithms. In EMO.
Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., & Hong, S. (2022). Pure transformers are powerful graph learners. In NeurIPS.
Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. In ICLR.
Kolen, A., & Pesch, E. (1994). Genetic local search in combinatorial optimization. Discrete Applied Mathematics.
Kumar, S., Sharma, V. K., & Kumari, R. (2014). Memetic search in differential evolution algorithm. arXiv preprint arXiv:1408.0101.
Land, M. W. S. (1998). Evolutionary algorithms with local search for combinatorial optimization. University of California.
Lee, Y., Kim, J., Willette, J., & Hwang, S. J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In CVPR.
Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., & Chang, X. (2021). Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In ICCV.
Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatial-temporal representation learning. In ICLR.
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2023). Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI.
Li, X., Wang, L., Jiang, Q., & Li, N. (2021). Differential evolution algorithm with multi-population cooperation and multi-strategy integration. Neurocomputing, 421, 285–302.
Article Google Scholar
Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. In ICCV.
Li, Y., Zhang, K., Cao, J., Timofte, R., & Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. In ICCV.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, J., & Lampinen, J. (2005). A fuzzy adaptive differential evolution algorithm. Soft Computing, 9, 448–462.
Article Google Scholar
Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., & Wang, S. (2022). Rethinking attention-model explainability through faithfulness violation test. In ICML.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022). Swin transformer v2: Scaling up capacity and resolution. In CVPR.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In ICLR.
Lu, J., Mottaghi, R., & Kembhavi, A. (2021). Container: Context aggregation networks. In NeurIPS.
Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S. W., Anwer, R. M., & Shahbaz Khan, F. (2023). Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In ECCVW.
Mehta, S., & Rastegari, M. (2022). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR.
Min, J., Zhao, Y., Luo, C., & Cho, M. (2022). Peripheral vision transformer. In NeurIPS.
Moscato, P., et al. (1989). On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech Concurrent Computation Program, C3P Report, 826, 1989.
Google Scholar
Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919.
Article Google Scholar
Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., & Satoh, Y. (2022). Can vision transformers learn without natural images? In AAAI.
Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In ICCV.
Opara, K. R., & Arabas, J. (2019). Differential evolution: A survey of theoretical analyses. Swarm and Evolutionary Computation, 44, 546–558.
Article Google Scholar
Padhye, N., Mittal, P., & Deb, K. (2013). Differential evolution: Performances and analyses. In CEC.
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., & Huang, G. (2022). On the integration of self-attention and convolution. In CVPR.
Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al. (2020). Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence, 90, 103479.
Article Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In ICLR.
Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., & Zhu, D. (2022). Attcat: Explaining transformers via attentive class activation tokens. In NeurIPS.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog.
Raghu, M., Unterthiner, T., Kornblith, S., & Zhang, C. (2021). Dosovitskiy, A. Do vision transformers see like convolutional neural networks? In NeurIPS.
Ren, S., Zhou, D., He, S., Feng, J., & Wang, X. (2022). Shunted self-attention via multi-scale token aggregation. In CVPR.
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., & Parikh, D. (2017). Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV.
Shi, E. C., Leung, F. H., & Law, B. N. (2014). Differential evolution with adaptive population size. In ICDSP.
Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., & Yan, S. (2022). Inception transformer. In NeurIPS.
Sloss, A. N., & Gustafson, S. (2020). 2019 evolutionary algorithms review. In Genetic programming theory and practice XVII.
Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. In CVPR.
Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 341–359
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR.
Toffolo, A., & Benini, E. (2003). Genetic diversity as an objective in multi-objective evolutionary algorithms. Evolutionary Computation, 11, 151–167.
Article Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In ICCV.
Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. In ECCV.
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. In MICCAI.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Vikhar, P. A. (2016). Evolutionary algorithms: A critical review and its future prospects. In ICGTSPICC.
Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., & Luo, J. (2022). Facial attribute transformers for precise and robust makeup transfer. In CACV.
Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., & Han, S. (2020). Hat: Hardware-aware transformers for efficient natural language processing. In ACL.
Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y. G., Zhou, L., & Yuan, L. (2022). Bevt: Bert pretraining of video transformers. In CVPR.
Wang, S., Li, B., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV.
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. CVM.
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2022). Crossformer: A versatile vision transformer hinging on cross-scale attention. In ICLR.
Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., & Tong, Y. (2021). Evolving attention with residual convolutions. In ICML.
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In CVPR.
Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models
Wightman, R., Touvron, H., & Jegou, H. (2021). Resnet strikes back: An improved training procedure in timm. In NeurIPSW.
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In ICCV.
Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In CVPR.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS.
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In CVPR.
Xu, L., Yan, X., Ding, W., & Liu, Z. (2023). Attribution rollout: a new way to interpret visual transformer. JAIHC.
Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., & Soatto, S. (2021). Long short-term transformer for online action detection. In NeurIPS.
Xu, W., Xu, Y., Chang, T., & Tu, Z. (2021). Co-scale conv-attentional image transformers. In ICCV.
Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In NeurIPS.
Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., & Yuille, A. (2022). Lite vision transformer with enhanced self-attention. In CVPR.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In CVPR.
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021). Incorporating convolution designs into visual transformers. In ICCV.
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV.
Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. In TPAMI.
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. In NeurIPS.
Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., & Yang, M. H. (2022). Restormer: Efficient transformer for high-resolution image restoration. In CVPR.
Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., & Wang, C. (2023). Rethinking mobile block for efficient attention-based models. In ICCV.
Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., & Liu, Y. (2021). Analogous to evolutionary algorithm: Designing a unified sequence model. In NeurIPS.
Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR.
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. In IJCV.
Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., & Feng, J. (2021). Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886
Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. In CVPR.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable {detr}: Deformable transformers for end-to-end object detection. In ICLR.

Download references

Acknowledgements

This work was supported by a Grant from The National Natural Science Foundation of China(No. 62103363)

Author information

Jiangning Zhang and Xiangtai Li have contributed equally as first authors.

Authors and Affiliations

Institute of Cyber-Systems and Control, Advanced Perception on Robotics and Intelligent Learning Lab (APRIL), Zhejiang University, Hangzhou, China
Jiangning Zhang & Yong Liu
School of Artificial Intelligence, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, China
Xiangtai Li & Yibo Yang
Youtu Lab, Tencent, Shanghai, China
Yabiao Wang & Chengjie Wang
School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW, 2008, Australia
Dacheng Tao

Authors

Jiangning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangtai Li
View author publications
You can also search for this author in PubMed Google Scholar
Yabiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chengjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yibo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Liu.

Additional information

Communicated by Suha Kwak.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, J., Li, X., Wang, Y. et al. EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02034-6

Download citation

Received: 18 June 2022
Accepted: 12 February 2024
Published: 02 April 2024
DOI: https://doi.org/10.1007/s11263-024-02034-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Abstract

Access this article

Similar content being viewed by others

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Multi-domain Multi-definition Landmark Localization for Small Datasets

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Abstract

Access this article

Similar content being viewed by others

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Multi-domain Multi-definition Landmark Localization for Small Datasets

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation