Abstract
A long time ago in the machine learning literature, the idea of incorporating a mechanism inspired by the human visual system into neural networks was introduced. This idea is named the attention mechanism, and it has gone through a long development period. Today, many works have been devoted to this idea in a variety of tasks. Remarkable performance has recently been demonstrated. The goal of this paper is to provide an overview from the early work on searching for ways to implement attention idea with neural networks until the recent trends. This review emphasizes the important milestones during this progress regarding different tasks. By this way, this study aims to provide a road map for researchers to explore the current development and get inspired for novel approaches beyond the attention.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07366-3/MediaObjects/521_2022_7366_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07366-3/MediaObjects/521_2022_7366_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-022-07366-3/MediaObjects/521_2022_7366_Fig3_HTML.png)
Similar content being viewed by others
References
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press
Noton D, Stark L (1971) Eye movements and visual perception. Sci Am 224(6):34
Noton D, Stark L (1971) Scanpaths in saccadic eye movements while viewing and recognizing patterns. Vision Res 11:929
Alpaydın E (1995) Selective attention for handwritten digit recognition. Adv Neural Inf Process Syst 8:771–777
Ahmad S (1991) VISIT: a neural model of covert visual attention. Adv Neural Inf Process Syst 4:420–427
Posner M, Petersen S (1990) The attention system of the human brain. Annu Rev Neurosci 13(1):25
Bundesen C (1990) A theory of visual attention. Psychol Rev 97(4):523
Desimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Annu Rev Neurosci 18(1):193
Corbetta M, Shulman G (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201
Petersen S, Posner M (2012) The attention system of the human brain: 20 years after. Annu Rev Neurosci 35:73
Rimey R, Brown C (1990) Selective attention as sequential behaviour: modeling eye movements with an augmented hidden markov model, Technical Report, University of Rochester
Sheliga B, Riggio L, Rizzolatti G (1994) Orienting of attention and eye movements. Exp Brain Res 98(3):507
Sheliga B, Riggio L, Rizzolatti G (1995) Spatial attention and eye movements. Exp Brain Res 105(2):261
Hoffman J, Subramaniam B (1995) The role of visual attention in saccadic eye movements. Percept Psychophys 57(6):787
Chaudhari S et al (2021) An attentive survey of attention models, ACM Transactions on Intelligent Systems and Technology (TIST) pp 1–32
Galassi A et al (2020) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32:4291–4308
Lee J et al (2019) Attention models in graphs: a survey. ACM Trans Knowl Discov Data (TKDD) 13(6):1
Brown T et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193
Fukushima K (1987) Neural network model for selective attention in visual pattern recognition and associative recall. Appl Opt 26(23):4985
Fukushima K, Imagawa T (1993) Recognition and segmentation of connected characters with selective attention. Neural Netw 6(1):33
Postma E, den Herik HV, Hudson P (1997) SCAN: a scalable model of attentional selection. Neural Netw 10(6):993
Schmidhuber J, Huber R (1991) Learning to generate artificial fovea trajectories for target detection. Int J Neural Syst 2:125–134
Milanese R et al (1994) Integration of bottom-up and top-down cues for visual attention using non-linear relaxation. In: IEEE computer society conference on computer vision and pattern recoginition, Seattle, WA, pp 781–785
Tsotsos J et al (1995) Modeling visual attention via selective tuning. Artif Intell 78(1–2):507
Culhane S, Tsotsos J (1992) A prototype for data-driven visual attention. In: Proceedings of the 11th IAPR international conference on pattern recognition, The Hague, pp 36–40
Reisfeld D, Wolfson H (1995) Yeshurun Y, Context-free attentional operators: the generalized symmetry transform. Int J Comput Vis 14(2):119
Rybak I et al (1998) A model of attention-guided visual perception and recognition. Vis Res 38(15–16):2387
Keller J et al (1999) Object recognition based on human saccadic behaviour. Pattern Anal Appl 2(3):251–263
Miau F, Itti L (2001) A neural model combining attentional orienting to object recognition: preliminary explorations on the interplay between where and what. In: Proceedings of the 23rd annual international conference of the IEEE engineering in medicine and biology society, Istanbul, pp 789–792
Zhang W et al (2006) A computational model of eye movements during object class detection. Adv Neural Inf Process Syst 19:1609–1616
Salah A, Alpaydın E, Akarun L (2002) A selective attention-based method for visual pattern recognition with application to handwritten digit recognition and face recognition. IEEE Trans Pattern Anal Mach Intell 24(3):420
Walther D et al (2002) Attentional selection for object recognition—A gentle way. In: Bu¨lthoff HH, Wallraven C, Lee SW, Poggio TA (eds) International workshop on biologically motivated computer vision. Springer, Berlin, Heidelberg, pp 472–479
Schill K et al (2001) Scene analysis with saccadic eye movements: top-down and bottom-up modeling. J Electron Imaging 10(1):152
Paletta L, Fritz G, Seifert C (2005) Q-learning of sequential attention for visual object recognition from informative local descriptors. In: International conference on machine learning
Meur O.L (2006) A coherent computational approach to model bottom-up visual attention. IEEE Trans Pattern Anal Mach Intell 28(5):802
Gould S et al (2007) Peripheral-foveal vision for real-time object recognition and tracking in video. In: International joint conference on artificial intelligence (IJCAI) pp 2115–2121
Larochelle H, Hinton G (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. Adv Neural Inf Process Syst 23:1243–1251
Bazzani L et al (2011) Learning attentional policies for tracking and recognition in video with deep networks
Mnih V et al (2014) Recurrent models of visual attention. Adv Neural Inf Process Syst 27:2204–2212
Stollenga M et al (2014) Deep networks with internal selective attention through feedback connections. Adv Neural Inf Process Syst 27:3545–3553
Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. Advances in Neural Information Processing Systems, 27
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112
Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1724–1734
Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673
Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Vinyals O et al (2015) Show and tell: a neural image caption generator. In: In proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229
Luong MT, Manning HPC (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, pp 1412–1421
Lu J et al (2016) Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems. 29
Weston J, Chopra S, Bordes A (2014) Memory networks
Graves A, Wayne G, Danihelka I.(2014) Neural Turing Machines, arXiv preprint arXiv:1410.5401
Sukhbaatar S et al (2015) End-to-end memory networks. Adv Neural Inf Process Syst 28:2440–2448
Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 551–561
Parikh A et al (2016) A decomposable attention model for natural language inference. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 2249–2255
You Q et al (2016) Image captioning with semantic attention. In: In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 4651–4659
Rush A, Chopra S, Weston J (2015) A neural attention model for sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, pp 379–389
Yu D et al (2016) Deep convolutional neural networks with layer-wise context expansion and attention, Interspeech pp 17–21
Chorowski J et al (2015) Attention-based models for speech recognition. Adv Neural Inf Process Syst 28:577–585
Zanfir M, Marinoiu E, Sminchisescu C (2016) Spatio-temporal attention models for grounded video captioning. In: Asian conference on computer vision. Springer, Cham, pp 104–119
Cheng Y et al (2016) Agreement-based joint training for bidirectional attention-based neural machine translation. In: Proceedings of the 25th international joint conference on artificial intelligence
Rockt T (2016) Reasoning about entailment with neural attention
Y. Zhu, et al.,(2016) Visual7W:Grounded question answering in images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 4995–5004
Chen K et al (2015) ABC-CNN: An attention based convolutional neural network for visual question answering, arXiv preprint arXiv:1511.05960
Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: In European conference on computer vision, pp 451–466
Yin W et al (2016) ABCNN: Attention-based convolutional neural network for modeling sentence pairs. Trans Assoc Comput Linguist 4:259
Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention
Yang Z et al (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Sorokin I et al (2015) Deep attention recurrent Q-network, arXiv preprint arXiv:1512.01693
Ba J et al (2015) Learning wake-sleep recurrent attention models. Adv Neural Inf Process Syst 28:2593–2601
Gregor K et al (2015) DRAW: a recurrent neural network for image generation. In: International conference on machine learning, pp 1462–1471
Mansimov E et al (2016) Generating images from captions with attention. In: International conference on learning representations
Reed S et al (2016) Learning what and where to draw. Adv Neural Inf Process Syst 29:217–225
Voita E et al (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: In proceedings of the 57th annual meeting of the association for computational linguistics, florence, pp 5797–5808
Kerg G et al (2020) Untangling tradeoffs between recurrence and self-attention in neural networks. Advances in Neural Information Processing Systems, 33
Cordonnier JB, Loukas A, Jaggi M (2020) On the relationship between self-attention and convolutional layers
Lin Z et al (2017) A structured self-attentive sentence embedding. In: International conference on learning representations
Paulus R, Xiong C, Socher R (2018) A deep reinforced model for abstractive summarization. In: International conference on learning representations
Kitaev N, Klein D (2018) Constituency parsing with a self-attentive encoder. In: In proceedings of the 56th annual meeting of the association for computational linguistics (Long papers) pp 2676–2686
Povey D et al (2018) A time-restricted self-attention layer for ASR. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5874–5878
Vyas A et al (2020) Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33
Chan W et al (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, pp 4960–4964
Sperber M et al (2018) Self-attentional acoustic models. In: In proceedings of annual conference of the international speech communication association (InterSpeech), pp 3723–3727
Kaiser L et al (2017) One model to learn them all. arXiv preprint arXiv:1706.05137
Xu C et al (2018) Cross-target stance classification with self-attention networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Short papers), Melbourne, pp 778–783
Maruf S, Martins A, Haffari G (2019) Selective attention for context-aware neural machine translation. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota, pp 3092–3102
Ramachandran P et al (2019) Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 32:68–80
Li Y et al (2019) Area attention
Goodfellow I et al (2014) Generative adversarial networks. Adv Neural Inf Process Syst 27:2672–2680
Zhang H et al (2019) Self-attention generative adversarial networks. In: International conference on machine learning, pp 7354–7363
Xu T et al (2018) Attn GAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) pp 1316–1324
Yu A et al (2018) QANet: combining local convolution with global self-attention for reading comprehension. In: International conference on learning representations
Zhang J et al (2018) Ga AN: gated attention networks for learning on large and spatiotemporal graphs. In: Conference on uncertainty in artificial intelligence
Romero D et al (2020) Attentive group equivariant convolutional networks
Al-Rfou R et al (2019) Character-level language modeling with deeper self-attention. AAAI Conf Artif Intell 33:3159
Du J et al (2018) Multi-level structured self-attentions for distantly supervised relation extraction. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2216–2225
Li X et al (2020) SAC: accelerating and structuring self-attention via sparse adaptive connection. Advances in Neural Information Processing Systems, 33
Yang B et al (2019) Context-aware self-attention networks. AAAI Conf Artif Intell 33:387
Yang B et al (2018) Modeling localness for self-attention networks. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, pp 4449–4458
Attention augmented convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 3286–3295
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of NAACL-HLT, New Orleans, Louisiana, pp 464–468
Shen T et al (2018) Di SAN: directional self-attention network for RNN/CNN-free language understanding. In: AAAI Conference on artificial intelligence, pp 5446–5455
Shen T et al (2018) Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI-18), pp 4345–4352
Le H, Tran T, Venkatesh S (2020) Self-attentive associative memory
Shen T et al (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. In: International conference on learning representations
Bhojanapalli S et al (2020) Low-rank bottleneck in multi-head attention models
Tay Y et al (2020) Sparse sinkhorn attention
Sukhbaatar S et al (2019) Adaptive attention span in transformers. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, pp 331–335
Jernite Y et al (2017) Variable computation in recurrent neural networks. In: International conference on learning representations
Shu R, Nakayama H (2017) An empirical study of adequate vision span for attention-based neural machine translation. In: Proceedings of the first workshop on neural machine translation, Vancouver, pp 1–10
Hao J et al (2019) Modeling recurrence for transformer. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota, pp 1198–1207
Huang X et al (2020) Improving transformer optimization through better initialization
Shiv V, Quirk C (2019) Novel positional encodings to enable tree-based transformers. Adv Neural Inf Process Syst 32:12081–12091
Li Z et al (2020) Train large, then compress: rethinking model size for efficient training and inference of transformers
Hoshen Y (2017) VAIN: Attentional Multi-agent predictive modeling, Advances in Neural Information Processing Systems, 30, Long Beach, CA
Hu S et al (2021) UPDeT: universal multi-agent reinforcement learning via policy decoupling with transformers. In: International conference on learning representations
Parisotto E, Salakhutdinov R (2021) Efficient transformers in reinforcement learning using actor-learner distillation
Wu S et al (2020) Adversarial sparse transformer for time series forecasting. Advances in Neural Information Processing Systems, 33
Bosselut A et al (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics
So D, Liang C, Le Q (2019) The evolved transformer
Choi K et al (2020) Encoding musical style with transformer autoencoders
Doersch C, Gupta A, Zisserman A (2020) Cross Transformers: spatially-aware few-shot transfer. Adv Neural Inf Process Syst 33:21981–21993
Carion N et al (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229
Zhu X et al (2021) Deformable DETR: deformable transformers for end-to-end object detection. In: International conference on learning representations
Liu X et al (2020) Learning to encode position for transformer with continuous dynamical model. In: International conference on machine learning, pp 6327–6335
Kasai J et al (2020) Non-autoregressive machine translation with disentangled context transformer
Hudson D, Zitnick L (2021) Generative adversial transformers. In: International conference on machine learning, pp 4487–4499
Radford A et al (2018) Improving language understanding by generative pre-training. Technical Report, OpenAI
Radford A et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Dehghani M et al (2019) Universal transformers. In: International conference on learning representations
Parmar N (2018) Image transformer
Dai Z et al (2019) Transformer- XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 2978–2988
Parisotto E (2020) Stabilizing transformers for reinforcement learning
Ma X et al (2019) A Tensorized Transformer for Language Modeling. Adv Neural Inf Process Syst 32:2232–2242
Lathauwer L (2008) Decompositions of a higher-order tensor in block terms-part ii: definitions and uniqueness. SIAM J Matrix Anal Appl 30(3):1033
Tucker L (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279
Devlin J et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proc of NAACL-HLT 2019:4171–4186
Taylor W (1953) Cloze procedure: a new tool for measuring readability. J Bull 30(4):415
Clark K et al (2019) What does BERT look at? An analysis of BERT’s attention, arXiv preprint arXiv:1906.04341
Sun S et al (2019) Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Hong Kong, pp 4323–4332
Wang W et al (2020) MINILM: deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33
McCarley J, Chakravarti R, Sil A (2020) Structured pruning of a BERT-based question answering model, arXiv preprint arXiv:1910.06360
Zafrir O et al (2019) Q8 BERT: quantized 8Bit BERT. In: The 5th workshop on energy efficient machine learning and cognitive computing - NeurIPS
Joshi M et al (2019) BERT for coreference resolution: baselines and analysis. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 5803–5808
Gong L et al (2019) Efficient training of BERT by progressively stacking. In: International conference on machine learning, pp 2337–2346
Lan Z et al (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In: International conference on learning representations
Goyal S et al (2020) Po WER-BERT: accelerating BERT inference via progressive word-vector elimination
Jiao X et al (2019) Tiny BERT: distilling BERT for natural language understanding, arXiv preprint arXiv:1909.10351
Joshi M et al (2020) Span BERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64
Liu Y et al (2019) Ro BERTa: a robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692
He P et al (2021) DeBERTa: decoding-enhanced BERT with disentangled attention. In: International conference on learning representations
Sanh V et al (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: The 5th workshop on energy efficient machine learning and cognitive computing—NeurIPS
Wang W et al (2020) Struct BERT: incorporating language structures into pre-training for deep language understanding. In: International conference on learning representations
Shen S et al (2020) Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. AAAI Conf Artif Intell 34:8815
Lee J et al (2020) Bio BERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234
Prakash P et al (2021) Rare BERT: transformer architecture for rare disease patient identification using administrative claims. AAAI Conf Artif Intell 35:453
Wu Z et al (2020) Lite transformer with long-short range attention. In: International conference on learning representations UK
Mehta S et al (2021) DeLighT: deep and light-weight transformer. In: International conference on learning representations
Tay Y et al (2021) HyperGrid transformers: towards a single model for multiple tasks. In: International conference on learning representations
Yun S et al (2018) Graph transformer networks. In: International conference on learning representations
Rong Y et al (2020) Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33
J. Yang, et al.,(2021) GraphFormers:GNN-nested transformers for representation learning on textual graph. Advances in Neural Information Processing Systems, 34
ZhaoC et al (2020) Transformer-XH: multi-evidence reasoning with extra hop attention. In: International conference on learning representations
You R et al (2019) AttentionXML: label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Advances in Neural Information Processing Systems, 32
Fan X et al (2020) Bayesian attention modules. Advances in Neural Information Processing Systems, 33
Brunner G et al (2020) On identifiability in transformers. In: International conference on learning representations
Dosovitskiy A et al (2021) An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International conference on learning representations
Katharopoulos A et al (2020) Transformers are RNNs: fast autoregressive transformers with linear attention
Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J Comput Phys 73(2):325
Nguyen T et al (2021) FMMformer: efficient and flexible transformer via decomposed near-field and far-field attention. Advances in Neural Information Processing Systems, 34
Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer
Lee J et al (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In: International conference on machine learning pp 3744–3753
Roy A et al (2020) Efficient content-based sparse attention with routing transformers. Trans Assoc Comput Linguist 9:53–68
Child R et al (2019) Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509
Correia G, Niculae V, Martins A (2019) Adaptively sparse transformers. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing pp 2174–2184
Peng H et al (2021) Random feature attention. In: International conference on learning representations UK
Chen Y et al (2021) Skyformer: remodel self-attention with Gaussian kernel and Nyström method. Advances in Neural Information Processing Systems, 34
Zaheer M et al (2020) Big bird: transformers for longer sequences. Advances in Neural Information Processing Systems, 33
Huang CZ et al (2019) Music transformer: generating music with long-term structure. In: International conference on learning representations UK
Lu J et al (2021) SOFT: softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34
Pan Z et al (2021) Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/cvf international conference on computer vision, pp 377–386
Zhu C et al (2021) Long-short transformer: efficient transformers for language and vision. Advances in Neural Information Processing Systems, 34
Jaegle A et al (2021) Perceiver: general perception with iterative attention. In: International conference on machine learning, pp 4651–4664
Choromanski K et al (2021) Rethinking attention with performers. In: International conference on learning representations
El-Nouby A et al. (2021) XCiT: cross-covariance image transformers. Advances in neural information processing systems, 34
Yu Q et al (2021) Glance-and-gaze vision transformer. Advances in Neural Information Processing Systems, 34
Zeng Z et al (2021) You only sample (almost) once: linear cost self-attention via Bernoulli sampling. In: International conference on machine learning, pp 12,321–12,332
Shen Z et al (2021) Efficient attention: attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3531–3539
Luo S et al (2021) Stable, fast and accurate: kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34
Ma X et al (2021) Luna: linear unified nested attention. Advances in Neural Information Processing Systems, 34
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Soydaner, D. Attention mechanism in neural networks: where it comes and where it goes. Neural Comput & Applic 34, 13371–13385 (2022). https://doi.org/10.1007/s00521-022-07366-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07366-3