Skip to main content
Log in

Attention mechanism in neural networks: where it comes and where it goes

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

A long time ago in the machine learning literature, the idea of incorporating a mechanism inspired by the human visual system into neural networks was introduced. This idea is named the attention mechanism, and it has gone through a long development period. Today, many works have been devoted to this idea in a variety of tasks. Remarkable performance has recently been demonstrated. The goal of this paper is to provide an overview from the early work on searching for ways to implement attention idea with neural networks until the recent trends. This review emphasizes the important milestones during this progress regarding different tasks. By this way, this study aims to provide a road map for researchers to explore the current development and get inspired for novel approaches beyond the attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press

    MATH  Google Scholar 

  2. Noton D, Stark L (1971) Eye movements and visual perception. Sci Am 224(6):34

    Google Scholar 

  3. Noton D, Stark L (1971) Scanpaths in saccadic eye movements while viewing and recognizing patterns. Vision Res 11:929

    Article  Google Scholar 

  4. Alpaydın E (1995) Selective attention for handwritten digit recognition. Adv Neural Inf Process Syst 8:771–777

    Google Scholar 

  5. Ahmad S (1991) VISIT: a neural model of covert visual attention. Adv Neural Inf Process Syst 4:420–427

    Google Scholar 

  6. Posner M, Petersen S (1990) The attention system of the human brain. Annu Rev Neurosci 13(1):25

    Article  Google Scholar 

  7. Bundesen C (1990) A theory of visual attention. Psychol Rev 97(4):523

    Article  Google Scholar 

  8. Desimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Annu Rev Neurosci 18(1):193

    Article  Google Scholar 

  9. Corbetta M, Shulman G (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201

    Article  Google Scholar 

  10. Petersen S, Posner M (2012) The attention system of the human brain: 20 years after. Annu Rev Neurosci 35:73

    Article  Google Scholar 

  11. Rimey R, Brown C (1990) Selective attention as sequential behaviour: modeling eye movements with an augmented hidden markov model, Technical Report, University of Rochester

  12. Sheliga B, Riggio L, Rizzolatti G (1994) Orienting of attention and eye movements. Exp Brain Res 98(3):507

    Article  Google Scholar 

  13. Sheliga B, Riggio L, Rizzolatti G (1995) Spatial attention and eye movements. Exp Brain Res 105(2):261

    Article  Google Scholar 

  14. Hoffman J, Subramaniam B (1995) The role of visual attention in saccadic eye movements. Percept Psychophys 57(6):787

    Article  Google Scholar 

  15. Chaudhari S et al (2021) An attentive survey of attention models, ACM Transactions on Intelligent Systems and Technology (TIST) pp 1–32

  16. Galassi A et al (2020) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32:4291–4308

    Article  Google Scholar 

  17. Lee J et al (2019) Attention models in graphs: a survey. ACM Trans Knowl Discov Data (TKDD) 13(6):1

    Article  Google Scholar 

  18. Brown T et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  19. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

    Google Scholar 

  20. Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193

    Article  MATH  Google Scholar 

  21. Fukushima K (1987) Neural network model for selective attention in visual pattern recognition and associative recall. Appl Opt 26(23):4985

    Article  Google Scholar 

  22. Fukushima K, Imagawa T (1993) Recognition and segmentation of connected characters with selective attention. Neural Netw 6(1):33

    Article  Google Scholar 

  23. Postma E, den Herik HV, Hudson P (1997) SCAN: a scalable model of attentional selection. Neural Netw 10(6):993

    Article  Google Scholar 

  24. Schmidhuber J, Huber R (1991) Learning to generate artificial fovea trajectories for target detection. Int J Neural Syst 2:125–134

    Article  Google Scholar 

  25. Milanese R et al (1994) Integration of bottom-up and top-down cues for visual attention using non-linear relaxation. In: IEEE computer society conference on computer vision and pattern recoginition, Seattle, WA, pp 781–785

  26. Tsotsos J et al (1995) Modeling visual attention via selective tuning. Artif Intell 78(1–2):507

    Article  MathSciNet  Google Scholar 

  27. Culhane S, Tsotsos J (1992) A prototype for data-driven visual attention. In: Proceedings of the 11th IAPR international conference on pattern recognition, The Hague, pp 36–40

  28. Reisfeld D, Wolfson H (1995) Yeshurun Y, Context-free attentional operators: the generalized symmetry transform. Int J Comput Vis 14(2):119

    Article  Google Scholar 

  29. Rybak I et al (1998) A model of attention-guided visual perception and recognition. Vis Res 38(15–16):2387

    Article  Google Scholar 

  30. Keller J et al (1999) Object recognition based on human saccadic behaviour. Pattern Anal Appl 2(3):251–263

    Article  Google Scholar 

  31. Miau F, Itti L (2001) A neural model combining attentional orienting to object recognition: preliminary explorations on the interplay between where and what. In: Proceedings of the 23rd annual international conference of the IEEE engineering in medicine and biology society, Istanbul, pp 789–792

  32. Zhang W et al (2006) A computational model of eye movements during object class detection. Adv Neural Inf Process Syst 19:1609–1616

    Google Scholar 

  33. Salah A, Alpaydın E, Akarun L (2002) A selective attention-based method for visual pattern recognition with application to handwritten digit recognition and face recognition. IEEE Trans Pattern Anal Mach Intell 24(3):420

    Article  Google Scholar 

  34. Walther D et al (2002) Attentional selection for object recognition—A gentle way. In: Bu¨lthoff HH, Wallraven C, Lee SW, Poggio TA (eds) International workshop on biologically motivated computer vision. Springer, Berlin, Heidelberg, pp 472–479

    Google Scholar 

  35. Schill K et al (2001) Scene analysis with saccadic eye movements: top-down and bottom-up modeling. J Electron Imaging 10(1):152

    Article  Google Scholar 

  36. Paletta L, Fritz G, Seifert C (2005) Q-learning of sequential attention for visual object recognition from informative local descriptors. In: International conference on machine learning

  37. Meur O.L (2006) A coherent computational approach to model bottom-up visual attention. IEEE Trans Pattern Anal Mach Intell 28(5):802

    Article  Google Scholar 

  38. Gould S et al (2007) Peripheral-foveal vision for real-time object recognition and tracking in video. In: International joint conference on artificial intelligence (IJCAI) pp 2115–2121

  39. Larochelle H, Hinton G (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. Adv Neural Inf Process Syst 23:1243–1251

    Google Scholar 

  40. Bazzani L et al (2011) Learning attentional policies for tracking and recognition in video with deep networks

  41. Mnih V et al (2014) Recurrent models of visual attention. Adv Neural Inf Process Syst 27:2204–2212

    Google Scholar 

  42. Stollenga M et al (2014) Deep networks with internal selective attention through feedback connections. Adv Neural Inf Process Syst 27:3545–3553

    Google Scholar 

  43. Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. Advances in Neural Information Processing Systems, 27

  44. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate

  45. Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112

    Google Scholar 

  46. Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1724–1734

  47. Schuster M, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673

    Article  Google Scholar 

  48. Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  49. Vinyals O et al (2015) Show and tell: a neural image caption generator. In: In proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  50. Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229

    Article  MATH  Google Scholar 

  51. Luong MT, Manning HPC (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, pp 1412–1421

  52. Lu J et al (2016) Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems. 29

  53. Weston J, Chopra S, Bordes A (2014) Memory networks

  54. Graves A, Wayne G, Danihelka I.(2014) Neural Turing Machines, arXiv preprint arXiv:1410.5401

  55. Sukhbaatar S et al (2015) End-to-end memory networks. Adv Neural Inf Process Syst 28:2440–2448

    Google Scholar 

  56. Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 551–561

  57. Parikh A et al (2016) A decomposable attention model for natural language inference. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 2249–2255

  58. You Q et al (2016) Image captioning with semantic attention. In: In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 4651–4659

  59. Rush A, Chopra S, Weston J (2015) A neural attention model for sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, pp 379–389

  60. Yu D et al (2016) Deep convolutional neural networks with layer-wise context expansion and attention, Interspeech pp 17–21

  61. Chorowski J et al (2015) Attention-based models for speech recognition. Adv Neural Inf Process Syst 28:577–585

    Google Scholar 

  62. Zanfir M, Marinoiu E, Sminchisescu C (2016) Spatio-temporal attention models for grounded video captioning. In: Asian conference on computer vision. Springer, Cham, pp 104–119

  63. Cheng Y et al (2016) Agreement-based joint training for bidirectional attention-based neural machine translation. In: Proceedings of the 25th international joint conference on artificial intelligence

  64. Rockt T (2016) Reasoning about entailment with neural attention

  65. Y. Zhu, et al.,(2016) Visual7W:Grounded question answering in images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 4995–5004

  66. Chen K et al (2015) ABC-CNN: An attention based convolutional neural network for visual question answering, arXiv preprint arXiv:1511.05960

  67. Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: In European conference on computer vision, pp 451–466

  68. Yin W et al (2016) ABCNN: Attention-based convolutional neural network for modeling sentence pairs. Trans Assoc Comput Linguist 4:259

    Article  Google Scholar 

  69. Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention

  70. Yang Z et al (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  71. Sorokin I et al (2015) Deep attention recurrent Q-network, arXiv preprint arXiv:1512.01693

  72. Ba J et al (2015) Learning wake-sleep recurrent attention models. Adv Neural Inf Process Syst 28:2593–2601

    Google Scholar 

  73. Gregor K et al (2015) DRAW: a recurrent neural network for image generation. In: International conference on machine learning, pp 1462–1471

  74. Mansimov E et al (2016) Generating images from captions with attention. In: International conference on learning representations

  75. Reed S et al (2016) Learning what and where to draw. Adv Neural Inf Process Syst 29:217–225

    Google Scholar 

  76. Voita E et al (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: In proceedings of the 57th annual meeting of the association for computational linguistics, florence, pp 5797–5808

  77. Kerg G et al (2020) Untangling tradeoffs between recurrence and self-attention in neural networks. Advances in Neural Information Processing Systems, 33

  78. Cordonnier JB, Loukas A, Jaggi M (2020) On the relationship between self-attention and convolutional layers

  79. Lin Z et al (2017) A structured self-attentive sentence embedding. In: International conference on learning representations

  80. Paulus R, Xiong C, Socher R (2018) A deep reinforced model for abstractive summarization. In: International conference on learning representations

  81. Kitaev N, Klein D (2018) Constituency parsing with a self-attentive encoder. In: In proceedings of the 56th annual meeting of the association for computational linguistics (Long papers) pp 2676–2686

  82. Povey D et al (2018) A time-restricted self-attention layer for ASR. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5874–5878

  83. Vyas A et al (2020) Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33

  84. Chan W et al (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, pp 4960–4964

  85. Sperber M et al (2018) Self-attentional acoustic models. In: In proceedings of annual conference of the international speech communication association (InterSpeech), pp 3723–3727

  86. Kaiser L et al (2017) One model to learn them all. arXiv preprint arXiv:1706.05137

  87. Xu C et al (2018) Cross-target stance classification with self-attention networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Short papers), Melbourne, pp 778–783

  88. Maruf S, Martins A, Haffari G (2019) Selective attention for context-aware neural machine translation. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota, pp 3092–3102

  89. Ramachandran P et al (2019) Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 32:68–80

    Google Scholar 

  90. Li Y et al (2019) Area attention

  91. Goodfellow I et al (2014) Generative adversarial networks. Adv Neural Inf Process Syst 27:2672–2680

    Google Scholar 

  92. Zhang H et al (2019) Self-attention generative adversarial networks. In: International conference on machine learning, pp 7354–7363

  93. Xu T et al (2018) Attn GAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) pp 1316–1324

  94. Yu A et al (2018) QANet: combining local convolution with global self-attention for reading comprehension. In: International conference on learning representations

  95. Zhang J et al (2018) Ga AN: gated attention networks for learning on large and spatiotemporal graphs. In: Conference on uncertainty in artificial intelligence

  96. Romero D et al (2020) Attentive group equivariant convolutional networks

  97. Al-Rfou R et al (2019) Character-level language modeling with deeper self-attention. AAAI Conf Artif Intell 33:3159

    Google Scholar 

  98. Du J et al (2018) Multi-level structured self-attentions for distantly supervised relation extraction. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2216–2225

  99. Li X et al (2020) SAC: accelerating and structuring self-attention via sparse adaptive connection. Advances in Neural Information Processing Systems, 33

  100. Yang B et al (2019) Context-aware self-attention networks. AAAI Conf Artif Intell 33:387

    Google Scholar 

  101. Yang B et al (2018) Modeling localness for self-attention networks. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, pp 4449–4458

  102. Attention augmented convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 3286–3295

  103. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of NAACL-HLT, New Orleans, Louisiana, pp 464–468

  104. Shen T et al (2018) Di SAN: directional self-attention network for RNN/CNN-free language understanding. In: AAAI Conference on artificial intelligence, pp 5446–5455

  105. Shen T et al (2018) Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI-18), pp 4345–4352

  106. Le H, Tran T, Venkatesh S (2020) Self-attentive associative memory

  107. Shen T et al (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. In: International conference on learning representations

  108. Bhojanapalli S et al (2020) Low-rank bottleneck in multi-head attention models

  109. Tay Y et al (2020) Sparse sinkhorn attention

  110. Sukhbaatar S et al (2019) Adaptive attention span in transformers. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, pp 331–335

  111. Jernite Y et al (2017) Variable computation in recurrent neural networks. In: International conference on learning representations

  112. Shu R, Nakayama H (2017) An empirical study of adequate vision span for attention-based neural machine translation. In: Proceedings of the first workshop on neural machine translation, Vancouver, pp 1–10

  113. Hao J et al (2019) Modeling recurrence for transformer. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota, pp 1198–1207

  114. Huang X et al (2020) Improving transformer optimization through better initialization

  115. Shiv V, Quirk C (2019) Novel positional encodings to enable tree-based transformers. Adv Neural Inf Process Syst 32:12081–12091

    Google Scholar 

  116. Li Z et al (2020) Train large, then compress: rethinking model size for efficient training and inference of transformers

  117. Hoshen Y (2017) VAIN: Attentional Multi-agent predictive modeling, Advances in Neural Information Processing Systems, 30, Long Beach, CA

  118. Hu S et al (2021) UPDeT: universal multi-agent reinforcement learning via policy decoupling with transformers. In: International conference on learning representations

  119. Parisotto E, Salakhutdinov R (2021) Efficient transformers in reinforcement learning using actor-learner distillation

  120. Wu S et al (2020) Adversarial sparse transformer for time series forecasting. Advances in Neural Information Processing Systems, 33

  121. Bosselut A et al (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics

  122. So D, Liang C, Le Q (2019) The evolved transformer

  123. Choi K et al (2020) Encoding musical style with transformer autoencoders

  124. Doersch C, Gupta A, Zisserman A (2020) Cross Transformers: spatially-aware few-shot transfer. Adv Neural Inf Process Syst 33:21981–21993

    Google Scholar 

  125. Carion N et al (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229

  126. Zhu X et al (2021) Deformable DETR: deformable transformers for end-to-end object detection. In: International conference on learning representations

  127. Liu X et al (2020) Learning to encode position for transformer with continuous dynamical model. In: International conference on machine learning, pp 6327–6335

  128. Kasai J et al (2020) Non-autoregressive machine translation with disentangled context transformer

  129. Hudson D, Zitnick L (2021) Generative adversial transformers. In: International conference on machine learning, pp 4487–4499

  130. Radford A et al (2018) Improving language understanding by generative pre-training. Technical Report, OpenAI

  131. Radford A et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

    Google Scholar 

  132. Dehghani M et al (2019) Universal transformers. In: International conference on learning representations

  133. Parmar N (2018) Image transformer

  134. Dai Z et al (2019) Transformer- XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 2978–2988

  135. Parisotto E (2020) Stabilizing transformers for reinforcement learning

  136. Ma X et al (2019) A Tensorized Transformer for Language Modeling. Adv Neural Inf Process Syst 32:2232–2242

    Google Scholar 

  137. Lathauwer L (2008) Decompositions of a higher-order tensor in block terms-part ii: definitions and uniqueness. SIAM J Matrix Anal Appl 30(3):1033

    Article  MathSciNet  MATH  Google Scholar 

  138. Tucker L (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279

    Article  MathSciNet  Google Scholar 

  139. Devlin J et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proc of NAACL-HLT 2019:4171–4186

    Google Scholar 

  140. Taylor W (1953) Cloze procedure: a new tool for measuring readability. J Bull 30(4):415

    Google Scholar 

  141. Clark K et al (2019) What does BERT look at? An analysis of BERT’s attention, arXiv preprint arXiv:1906.04341

  142. Sun S et al (2019) Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Hong Kong, pp 4323–4332

  143. Wang W et al (2020) MINILM: deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33

  144. McCarley J, Chakravarti R, Sil A (2020) Structured pruning of a BERT-based question answering model, arXiv preprint arXiv:1910.06360

  145. Zafrir O et al (2019) Q8 BERT: quantized 8Bit BERT. In: The 5th workshop on energy efficient machine learning and cognitive computing - NeurIPS

  146. Joshi M et al (2019) BERT for coreference resolution: baselines and analysis. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 5803–5808

  147. Gong L et al (2019) Efficient training of BERT by progressively stacking. In: International conference on machine learning, pp 2337–2346

  148. Lan Z et al (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In: International conference on learning representations

  149. Goyal S et al (2020) Po WER-BERT: accelerating BERT inference via progressive word-vector elimination

  150. Jiao X et al (2019) Tiny BERT: distilling BERT for natural language understanding, arXiv preprint arXiv:1909.10351

  151. Joshi M et al (2020) Span BERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64

    Article  Google Scholar 

  152. Liu Y et al (2019) Ro BERTa: a robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692

  153. He P et al (2021) DeBERTa: decoding-enhanced BERT with disentangled attention. In: International conference on learning representations

  154. Sanh V et al (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: The 5th workshop on energy efficient machine learning and cognitive computing—NeurIPS

  155. Wang W et al (2020) Struct BERT: incorporating language structures into pre-training for deep language understanding. In: International conference on learning representations

  156. Shen S et al (2020) Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. AAAI Conf Artif Intell 34:8815

    Google Scholar 

  157. Lee J et al (2020) Bio BERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234

    Google Scholar 

  158. Prakash P et al (2021) Rare BERT: transformer architecture for rare disease patient identification using administrative claims. AAAI Conf Artif Intell 35:453

    Google Scholar 

  159. Wu Z et al (2020) Lite transformer with long-short range attention. In: International conference on learning representations UK

  160. Mehta S et al (2021) DeLighT: deep and light-weight transformer. In: International conference on learning representations

  161. Tay Y et al (2021) HyperGrid transformers: towards a single model for multiple tasks. In: International conference on learning representations

  162. Yun S et al (2018) Graph transformer networks. In: International conference on learning representations

  163. Rong Y et al (2020) Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33

  164. J. Yang, et al.,(2021) GraphFormers:GNN-nested transformers for representation learning on textual graph. Advances in Neural Information Processing Systems, 34

  165. ZhaoC et al (2020) Transformer-XH: multi-evidence reasoning with extra hop attention. In: International conference on learning representations

  166. You R et al (2019) AttentionXML: label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Advances in Neural Information Processing Systems, 32

  167. Fan X et al (2020) Bayesian attention modules. Advances in Neural Information Processing Systems, 33

  168. Brunner G et al (2020) On identifiability in transformers. In: International conference on learning representations

  169. Dosovitskiy A et al (2021) An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International conference on learning representations

  170. Katharopoulos A et al (2020) Transformers are RNNs: fast autoregressive transformers with linear attention

  171. Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J Comput Phys 73(2):325

    Article  MathSciNet  MATH  Google Scholar 

  172. Nguyen T et al (2021) FMMformer: efficient and flexible transformer via decomposed near-field and far-field attention. Advances in Neural Information Processing Systems, 34

  173. Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer

  174. Lee J et al (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In: International conference on machine learning pp 3744–3753

  175. Roy A et al (2020) Efficient content-based sparse attention with routing transformers. Trans Assoc Comput Linguist 9:53–68

    Article  Google Scholar 

  176. Child R et al (2019) Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509

  177. Correia G, Niculae V, Martins A (2019) Adaptively sparse transformers. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing pp 2174–2184

  178. Peng H et al (2021) Random feature attention. In: International conference on learning representations UK

  179. Chen Y et al (2021) Skyformer: remodel self-attention with Gaussian kernel and Nyström method. Advances in Neural Information Processing Systems, 34

  180. Zaheer M et al (2020) Big bird: transformers for longer sequences. Advances in Neural Information Processing Systems, 33

  181. Huang CZ et al (2019) Music transformer: generating music with long-term structure. In: International conference on learning representations UK

  182. Lu J et al (2021) SOFT: softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34

  183. Pan Z et al (2021) Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/cvf international conference on computer vision, pp 377–386

  184. Zhu C et al (2021) Long-short transformer: efficient transformers for language and vision. Advances in Neural Information Processing Systems, 34

  185. Jaegle A et al (2021) Perceiver: general perception with iterative attention. In: International conference on machine learning, pp 4651–4664

  186. Choromanski K et al (2021) Rethinking attention with performers. In: International conference on learning representations

  187. El-Nouby A et al. (2021) XCiT: cross-covariance image transformers. Advances in neural information processing systems, 34

  188. Yu Q et al (2021) Glance-and-gaze vision transformer. Advances in Neural Information Processing Systems, 34

  189. Zeng Z et al (2021) You only sample (almost) once: linear cost self-attention via Bernoulli sampling. In: International conference on machine learning, pp 12,321–12,332

  190. Shen Z et al (2021) Efficient attention: attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3531–3539

  191. Luo S et al (2021) Stable, fast and accurate: kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34

  192. Ma X et al (2021) Luna: linear unified nested attention. Advances in Neural Information Processing Systems, 34

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Derya Soydaner.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soydaner, D. Attention mechanism in neural networks: where it comes and where it goes. Neural Comput & Applic 34, 13371–13385 (2022). https://doi.org/10.1007/s00521-022-07366-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07366-3

Keywords

Navigation