Concentration or distraction? A synergetic-based attention weights optimization method

Wang, Zihao; Li, Haifeng; Ma, Lin; Jiang, Feng

doi:10.1007/s40747-023-01133-0

Concentration or distraction? A synergetic-based attention weights optimization method

Original Article
Open access
Published: 30 June 2023

Volume 9, pages 7381–7393, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Concentration or distraction? A synergetic-based attention weights optimization method

Download PDF

Zihao Wang ORCID: orcid.org/0000-0001-9404-8627¹,
Haifeng Li¹,
Lin Ma¹ &
…
Feng Jiang²

899 Accesses
Explore all metrics

Abstract

The attention mechanism empowers deep learning to a broader range of applications, but the contribution of the attention module is highly controversial. Research on modern Hopfield networks indicates that the attention mechanism can also be used in shallow networks. Its automatic sample filtering facilitates instance extraction in Multiple Instances Learning tasks. Since the attention mechanism has a clear contribution and intuitive performance in shallow networks, this paper further investigates its optimization method based on the recurrent neural network. Through comprehensive comparison, we find that the Synergetic Neural Network has the advantage of more accurate and controllable convergences and revertible converging steps. Therefore, we design the Syn layer based on the Synergetic Neural Network and propose the novel invertible activation function as the forward and backward update formula for attention weights concentration or distraction. Experimental results show that our method outperforms other methods in all Multiple Instances Learning benchmark datasets. Concentration improves the robustness of the results, while distraction expands the instance observing space and yields better results. Codes available at https://github.com/wzh134/Syn.

Generalized attention-based deep multi-instance learning

Article 07 September 2022

Lifelong learning with selective attention over seen classes and memorized instances

Article 22 February 2024

Attention Awareness Multiple Instance Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In recent years, the attention mechanism [1,2,3], as the core of the Transformer model [4], has replaced the convolution mechanism as the preferred choice for textual tasks [5,6,7,8] and is being actively extended to audio and visual tasks [9,10,11,12,13,14]. By adding more flexibility to deep networks, i.e., maintaining the trainability of the network under a larger number of parameters, the attention mechanism further improves the performance and applicability of deep networks. However, researchers gradually find that the attention mechanism relies strongly on the network volume [15,16,17]. Their performance is unstable and hardly surpasses convolutional networks of equal volume with fewer parameters [18,19,20]. In addition, such networks often require more prudent parameter tuning to secure the learning process [21]. These requirements make the time cost and equipment requirements unaffordable [22, 23], and thus, most subsequent researches focus on simplifying the attention mechanism to improve efficiency [8, 24,25,26] rather than optimizing it for improvement.

Although the attention mechanism empowers deep learning in various applications, some traditional tasks remain challenging, such as adaptive dynamic programming [27, 28], few-shot learning [29, 30], and multiple instance learning (MIL) [33]. Widrich and Ramsauer et al. generalize the attention mechanism to a shallow network, modern Hopfield neural network (MHNN), and extensively explore its application fields [31, 32, 34]. They demonstrate that the attention mechanism is actually a single iteration of MHNN. Its automatic filtering of instances is a powerful feature extraction module for MIL tasks [35, 36]. The attention mechanism filters effectiveness of the attention mechanism in the shallow MHNN is more significant, for the functionality [37, 38] and the substitutability [39, 40] of the attention mechanism are still highly controversial. However, MHNN emphasizes its equivalence to the attention network instead of delving into attention weights optimization by network recurrence. The effect of attention weights adjustment remains unclear.

This paper explores the effect of attention weights adjustment in attention networks on MIL tasks. Compared to the Hopfield networks family, the synergetic neural network (SNN) [41] has a similar network structure and identical convergent target. Its Synergetics foundation [42] brings fewer unstable attractors and a polynomial-based activation function, providing the network with a stable and revertible convergence process. We use these two properties to propose the Syn layer, which takes the attention weights as input and adjusts the weights by forward or backward iteration. The forwarding iteration concentrates attention, and the backwarding one distracts attention. We use the gradient bypass technique [43, 44] to circumvent the gradient exploding or vanishing from polynomials. Experiment results show that Syn-based attention networks achieve state-of-the-art performances on multiple MIL benchmark datasets.

Background

Attention mechanism and MHNN

At the time when the attention mechanism is formally proposed, its typical behavior can be summarized as the interaction of the query matrix $Q$, key matrix $K$, and value matrix $V$ [4]

$$\begin{array}{c}Z=Vsoftmax\left(\frac{1}{\sqrt{{d}_{k}}}Q{K}^{\mathrm{T}}\right);\end{array}$$

(1)

${d}_{k}$ is the dimension of the key. However, the reason and motivation for choosing softmax are not offered. MHNN complements this problem by constructing a recurrent neural network with the softmax function as the centerpiece. For the query pattern ${\varvec{x}}$ and the matrix of the static memory patterns $V=[{{\varvec{v}}}_{1},\dots ,{{\varvec{v}}}_{{\varvec{N}}}]$, the update formula of MHNN [32] is

$$\begin{array}{c}{{\varvec{\xi}}} =\beta {V}^{\mathrm{T}}{{\varvec{x}}}\end{array}$$

(2)

$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=softmax\left({\varvec{\xi}}\right)\end{array}$$

(3)

$$\begin{array}{c}{{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=V{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}};\end{array}$$

(4)

$\beta $ is the scaling factor. ${{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}$ is the new input of the network to form the recurrence. The attention mechanism is actually a single iteration of the fine-tuned MHNN [32]. Let $\beta =1/\sqrt{{d}_{k}}$ and $K=V$; with input $Q$, MHNN acts as the behavior of the attention mechanism

$$\begin{array}{c}Z=Vsoftmax\left(\Xi \right)=Vsoftmax\left(1/\sqrt{{d}_{k}}Q{K}^{\mathrm{T}}\right).\end{array}$$

(5)

Thus, MHNN can examine the usefulness of the softmax function from a dynamic system perspective. It shows that the softmax function can sparsify the weights and make it converge with high probability to a result close to the one-hot encoding after one iteration, which acts as the concentration of the attention.

MHNN versus SNN

SNN and MHNN have similar working processes and identical working goals, so SNN can also be used for the attention mechanism. The update formula of SNN [41] with the default hyperparameter setting is

$$\begin{array}{c}{{\varvec{\xi}}} ={V}^{+}{{\varvec{x}}}\end{array}$$

(6)

$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=f\left({\varvec{\xi}}\right)=\gamma \left(\frac{{{\varvec{\xi}}}^{3}+{\varvec{\xi}}}{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}+\left(\frac{1}{\gamma }-1\right){\varvec{\xi}}\right)\end{array}$$

(7)

$$\begin{array}{c}{{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=V{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}.\end{array}$$

(8)

${V}^{+}$ is the Moore–Penrose inverse of $V$ [45, 46]. SNN restricts the number of ${\varvec{v}}$ to be less than its dimension and mutual independence, such that ${V}^{+}V$ is the identity matrix. ${\varvec{\xi}}$ is called the vector of order parameters to describe the phenomenon that the variation of ${\varvec{x}}$ is entirely subject to ${\varvec{\xi}}$ [42]. $f$ can be interpreted as a synergetics-based activation function; $\gamma $ is the learning rate. If the ${\varvec{\xi}}$ of MHNN is considered as a generalized order parameter vector, the two networks have very similar working processes. Moreover, both networks have the same working goal of letting ${\varvec{\xi}}$ converge to one-hot.

SNN has the potential to form a better attention mechanism. Figure 1 shows all possible convergence cases of MHNN and SNN. The advantages of MHNN include (i) High recall efficiency. MHNN's convergence is achieved with fewer iterations (sometimes even single iteration), while SNN requires more iterations to converge. (ii) Controllable range of order parameters. MHNN's softmax function keeps the order parameter value between the interval [0,1], while SNN may converge to the negative one-hot (Fig. 1h). (iii) Exponential memory capacity (not shown in Fig. 1). MHNN can store exponential memories, while SNN can only store them at polynomial level. The advantages of SNN include (i) precise convergence results. SNN can converge exactly to one-hot, while MHNN often converges to one-hot with an error. (ii) Deterministic and controllable converging direction. SNN converges to the single local maxima when ${\varvec{\xi}}$ is initialized to 0, and metastable states when ${\varvec{\xi}}$ is initialized with multiple identical maximum values. In contrast, MHNN converges to the single globally stable point when all ${\varvec{v}}$ are similar, metastable states (the arithmetic mean of ${\varvec{v}}$) when some ${\varvec{v}}$ are similar. Therefore, all non-target stationary points of SNN can be avoided by adding a bias to the initial value of ${\varvec{\xi}}$, while MHNN requires modifying multiple columns of $V$ to reduce the similarity. (iii) Convergence independent from memory patterns. MHNN first applies ${V}^{\mathrm{T}}V$ matrix to ${{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}$ and then puts it into the activation function, which means that the update of the order parameters is affected by both ${\varvec{\xi}}$ and $V$. As shown in Fig. 1d, we construct ${\varvec{x}}$ with the i-th row vector of ${V}^{+}$, so ${\varvec{x}}$ is orthogonal to all ${\varvec{v}}$ except ${{\varvec{v}}}_{{\varvec{i}}}$. The update of ${\varvec{\xi}}$ is concentrated in the ith term due to orthogonality, so ${\varvec{x}}$ will converge to ${{\varvec{v}}}_{{\varvec{i}}}$ despite its similarities to other ${\varvec{v}}$. In contrast, SNN applies ${V}^{+}V$ matrix to ${{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}$. ${V}^{+}V$ is the identity matrix, so the update of ${\varvec{\xi}}$ depends only on its initial value, and SNN will converge to the target with similarity-based initialization. From the comparison, SNN has a more accurate and controllable convergence with no retrieval error, contributing to better concentration.

Syn layer

SNN and MHNN have similar working orders and identical operating goals, establishing the basis of SNN's application to the attention mechanism. In addition, the SNN-based attention mechanism has more powerful functions. Since the softmax function can only sparsify the weights to concentrate, the attention networks focus on only a few tokens under multiple faithfulness metrics [47] with a tendency to over-concentrate on specific tokens that lead to misclassification [48]. This phenomenon implies that attentional focus is prone to be anchored during training regardless of its rightfulness, making erroneous attention more difficult to be corrected. In contrast, SNN's activation function can be redesigned to the polynomial shape to calculate its precise inverse function to distract the attention for erroneous focus shifting. To take full advantage of SNN's application in the attention mechanism, we extract its activation function and convert it to the Syn layer for attention weights adjustment. The Syn layer takes attention weights as input and uses the forward or backward iterations within the layer to control the concentration or distraction of attention for focus shifting tendency adjustment.

The function of Syn layer

The Syn layer is a recurrent network layer for adjusting attention weights. It is designed with the objective of two modes of operation, concentration, and distraction. Their activation functions are mutual inverse. Although the Synergetics-based activation function $f$ is already polynomial-shaped, it is not suitable for the Syn layer. The normalization term in $f$ brings correspondence among order parameters, so the calculation of the inverse function requires the solution of a system of $n$-element cubic equations ($n$ is the number of order parameters), which are ${n}^{3}$ shockingly long real or complex roots. Obviously, such a complicated inverse function is not suitable as an activation function. To make the solution of the inverse function feasible, we first regularize the ordinal covariates, so that $f$ reduces to a monadic cubic polynomial

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}=\frac{{\varvec{\xi}}}{{\Vert {\varvec{\xi}}\Vert }_{2}}\end{array}$$

(9)

$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=f\left(\widetilde{{\varvec{\xi}}}\right)=\gamma {\widetilde{{\varvec{\xi}}}}^{3}+\left(1-\gamma \right)\widetilde{{\varvec{\xi}}.}\end{array}$$

(10)

The reduced $f$ corresponds to one real root and two complex roots, so the real root is the inverse function

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}={f}^{-1}\left({{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}\right)={\varvec{a}}+\frac{\gamma -1}{3\gamma {\varvec{a}}},{\varvec{a}}={\left({\left({\left(\frac{{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}}{2\gamma }\right)}^{2}-{\left(\frac{\gamma -1}{3\gamma }\right)}^{3}\right)}^\frac{1}{2}+\frac{{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}}{2\gamma }\right)}^\frac{1}{3}.\end{array}$$

(11)

In summary, the recurrent functions of the Syn layer are

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}=\frac{{\varvec{\xi}}}{{\Vert {\varvec{\xi}}\Vert }_{2}}\end{array}$$

(12)

$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=\left\{\begin{array}{c}f\left(\widetilde{{\varvec{\xi}}}\right), iter=iter+1\\ {f}^{-1}\left(\widetilde{{\varvec{\xi}}}\right),iter=iter-1,\end{array}\right.\end{array}$$

(13)

and they are depicted in Fig. 2. For a set of order parameters not less than 0, Syn's forward iteration amplifies the difference between the maximum order parameter and others until the convergence to one-hot. Its reverse iterations reduce the difference until all nonzero order parameters converge to the same value $1/\sqrt{N}$ ($N$ is the number of nonzero order parameters), while the zero-valued order parameters remain invariant. Theoretical derivations are detailed in Appendix A.

Syn layer as attention

The Syn layer controls the focus shift tendency by adjusting the gap among attention weights. The Syn layer for attention is used inside the multiheaded attention module, which takes the attention weights as input and iterates several times to output the sparsified or densified attention weights as concentration or distraction. Syn's forward iteration increases the gap between the maximum weight and the other weights to sustain the attention focus. Conversely, its backward iteration narrows the gap to prompt the focus shift. The characteristics of the attention module can circumvent most disadvantages of the Syn layer. The attention weights are within the interval [0,1], so Syn does not converge to a negative one-hot. Introducing multi-head attention can solve the network storage capacity inefficiency [4]. The data flow of the multiheaded attention module with the Syn layer is shown in Fig. 3.

For error backpropagation, Syn repeatedly imposes a polynomial function onto the input, which may lead to the gradient exploding or vanishing. The gradient problem is so severe that conventional means like gradient clipping can barely circumvent the non-convergence. To solve this problem, we use the gradient bypass technique [43, 44] in the incomputable partial differential problem from function discontinuity. We treat Syn as a discontinuous function and directly pass the gradient from the output $Z$ to the unadjusted attention weights $\Xi $.

Experiments

Datasets and results

We apply Syn to multiple MIL benchmark datasets: elephant, fox, and tiger (EFT), Musk 1 & 2, UCSB breast cancer classification, and Web Recommendation. EFT datasets consist of color images from the Corel data set that have been preprocessed and segmented. An image (bag) consists of varying numbers of segments (instances), each characterized by color, texture, and shape descriptors. The data set of each animal has 100 positive and 100 negative example images. Negative images are randomly drawn from a photo pool of other animals. Elephant has 1391 instances, Fox has 1320, and Tiger has 1220. All instances have 230 features. For more information on the EFT datasets, see [49]. The Musk datasets, including Musk1 and 2, contain molecular descriptions using multiple low-energy conformations. Each conformation is represented by a 166-dimensional feature vector derived from surface properties. Musk1 contains on average 6 conformations per bag, while Musk2 contains over 60 conformations per bag. For more information on the Musk datasets, see [50]. The UCSB breast cancer classification data set consists of 58 packages and 2002 instances. An instance represents a histopathological image of cancerous or normal tissue. For more information on the UCSB data set, see [51]. Web Recommendation datasets contain nine subsets. Each subset is derived from 113 web index pages annotated by a volunteer based on their interest. Pages are considered as bags and links are instances. The number of instances per package is between 4 and 200, with an average number of 30.29. For more information on the Web Recommendation datasets, see [52]. The defaulted train and test set separation is 9:1 for EFT and Musk, 3:1 for UCSB, and 75:38 for Web. All datasets except the Web perform cross-validation. All experiments were repeated five times. Results are organized via the area under the receiver operation characteristics curve (AUC).

Our approach has set a new state-of-the-art and has outperformed other methods (see Table 1). Since Syn substantially outperforms other methods in the Web datasets, we visualize the raw data and the features extracted by Syn using the t-SNE method. Results are shown in Fig. 4. Syn significantly simplifies the data distribution, resulting in excellent classification performances. These performances revalidate that the attention mechanism is more applicable to textual datasets.

Table 1 AUC and standard error (100×) results of MIL methods

Full size table

Network structure

We adopt the structure of HopfieldPooling [32] and name our method SynPooling, whose query pattern $Q$ is random initialized, and key & value patterns $K$ & $V$ are the layer input. The trainable $Q$ is used for averaging over class-indicative instances, enabling the compression of variable-sized bags to a fixed-sized discriminative representation. The network structure is shown in Fig. 5 with details including (i) Dropout 0.75 to avoid over-fitting. (ii) L fully connected linear embedding layers of D nodes with ReLU activation. (iii) SynPooling. Syn iterates I times with learning rate $\gamma $=1, embedding dimension M, heads H, and scaling factor S. (iv) Fully connected linear layer with Sigmoid activation that performs the classification.

We perform a hyperparameter grid search. The hyperparameter search grid is shown in Table 2. All models are trained for 50 epochs using the AdamW optimizer [53] with 1e−3 learning rate, 1e−4 weight decay, and NO learning rate decay. The weights of all network layers are initialized using Xavier uniform distribution. The loss function is the negative log-likelihood. Gradients are cropped to upper limit 1 under Euclidean length.

Table 2 The grid search space of the network's hyperparameters

Full size table

Ablation study

We use the ablation study to verify the usefulness of the Syn layer and the SynPooling module. Table 3 shows the decrease of AUC after ablating Syn and SynPooling in the network architecture. The attention mechanism from the SynPooling module substantially improves the network's performance, while Syn's attention tuning further optimizes the results.

Table 3 The ablation study results of the Syn layer and the SynPooling module

Full size table

Concentration versus distraction

We singularly control the number of SynPooling iteration I under the optimal hyperparameter grid to compare the effect of concentration versus distraction on network performance. We present the best and worst AUCs and the corresponding SynPooling iteration I. The results are shown in Table 4. While 80% of the optimal results are obtained with distraction, 53.3% of the worst results are also obtained with distraction. Among the worst results caused by distraction, 50% of the iterations are − 20, and these results were close to the existing methods. These worst results imply that a certain level of distraction can help improve network performance, yet excessive distraction can be counterproductive. To measure the average performance of concentration versus distraction, we rank the network performance in descending order from best to worst and average the results in Table 5. The average ranking of concentration is smaller in most datasets, indicating that the average performance obtained by concentration is more stable.

Table 4 The best and worst AUC obtained by the network and the corresponding SynPooling iteration I

Full size table

Table 5 Syn's average ranking under concentration and distraction (the smaller the better)

Full size table

To reveal the reason for the performance improvement from distraction, we investigate the transfer rate of the attention parameter winner, i.e., the attention parameter with the largest value. We record the change of the winning parameter between two adjacent epochs as one transfer, and the minimum transfer results are shown in Fig. 6. The minimum transfer rate is more representative than the maximum or average. The maximum transfer rate reaches 100% at the early stage of training for datasets with large instance amounts (e.g., Web), and the average transfer rate is also close to 100%. There is a correlation between the minimum winning transfer rate and the best AUC results. Most datasets achieve higher transfer rates through distraction, so the network improves performance by observing more instances. For the Musk 2, Web 4, and Web 5 datasets, distraction does not result in higher transfer rates, so the best AUC is reached by concentration.

To verify the robustness enhancement by concentration, we add noise to each data set with distraction-favored hyperparameters and single control of the Syn iterations I. Images are added with Gaussian white noise of different signal-to-noise ratios (SNR). Texts are imposed with a certain percentage of text deletion, replacement, and swap according to [59]. The optimal results and the related I are shown in Table 6. Although the hyperparameters are configured with the predominance of distraction, most image datasets obtain better results by concentration in the noisy environment. There are large fluctuations in the accuracy of the textual datasets, and the effects of concentration failed to be reflected. Results imply that concentration improves the robustness of the network for image datasets.

Table 6 The optimal results and steps corresponding to different proportional noise under the hyperparameter configuration with predominant distracted attention

Full size table

Conclusions and future work

In this paper, we construct the SNN-based Syn network layer with forward and backward working modes, and use it as an extension of the attention mechanism. The forward mode sparsifies attention weights for concentration, and the backward mode densifies attention weights for distraction. The experimental data show that the Syn layer achieves state-of-the-art results in multiple MIL benchmark datasets, and most of the best results are derived from distraction. Analyses show that the distraction improves the chance of attention focus shifting by increasing the winning order parameter transfer rate to observe more instances. For noisy image samples, concentration enhances the robustness of the network and reduces the loss of accuracy.

It is worth noting that the MIL problem is a generalization of pattern classification with numerous branches. All the datasets in this paper satisfy the standard MIL assumptions, i.e., the positive bag contains at least one positive instance, and the negative bag has only negative instances. For more complex problems, e.g., the positive bag needs the accumulation of a variable number of positive instances, or the negative bag contains positive instances, the Syn layer is not readily applied. In addition, the hyperparameter configuration of Syn lacks regularity. Datasets with similar properties may correspond to divergent hyperparameters. Therefore, the generalization and the meta-learning ability of the Syn layer require more research. In future work, we plan to apply Syn to more complex attention network models and explore its effectiveness in deep learning. Concentrated attention can be used as a plug-and-play module for pruning attention weights to improve the efficiency and robustness of large deep networks. Distracted attention can extend the parameter learning space and enhance the performance of small deep networks. In addition, the fine-grained and dynamic tuning of attention weights based on Syn also has promising prospects.

Data availability

All the experimental data supporting this research are publicly available. Datasets are derived from references [49,50,51,52].

References

Weston J, Chopra S, Bordes A (2015) Memory networks. In: 3rd Int conf learn represent ICLR 2015—conf track proc. https://doi.org/10.1007/978-3-030-82184-5_11
Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Advances in neural information processing systems
Daniluk M, Rocktäschel T, Welbl J, Riedel S (2017) Frustratingly short attention spans in neural language modeling. CoRR abs/1702.0
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5999–6009
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) improving language understanding by generative pre-training. Homol Homotopy Appl
Peters ME, Neumann M, Iyyer M et al (2018) Deep contextualized word representations. In: NAACL HLT 2018—2018 conference of the North American chapter of the association for computational linguistics: human language technologies—proceedings of the conference, pp 2227–2237
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019—2019 Conf North Am chapter assoc comput linguist hum lang technol—proc conf 1, pp 4171–4186
Zaheer M, Guruganesh G, Dubey A et al (2020) Big bird: transformers for longer sequences. In: Advances in neural information processing systems
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16×16 words: transformers for image recognition at scale. CoRR abs/2010.1
Wang Y, Huang R, Song S et al (2021) Not all images are worth 16×16 words: dynamic transformers for efficient image recognition. In: Advances in neural information processing systems, pp 11960–11973
Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, pp 4960–4964
Park DS, Chan W, Zhang Y et al (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of the annual conference of the international speech communication association, Interspeech, pp 2613–2617
Rossenbach N, Zeyer A, Schluter R, Ney H (2020) Generating synthetic audio data for attention-based speech recognition systems. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, pp 7069–7073
Mehta S, Rastegari M (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. CoRR abs/2110.0
Touvron H, Cord M, Douze M et al (2020) Training data-efficient image transformers & distillation through attention. In: Int Conf Mach Learn, pp 10347–10357
Graham B, El-Nouby A, Touvron H et al (2021) LeViT: a vision transformer in ConvNet's clothing for faster inference. In: Proceedings of the IEEE international conference on computer vision, pp 12239–12249
Wu H, Xiao B, Codella N et al (2021) CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE international conference on computer vision, pp 22–31
Cordonnier J-B, Loukas A, Jaggi M (2019) On the relationship between self-attention and convolutional layers. CoRR abs/1911.0
Bello I, Zoph B, Le Q et al (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 3285–3294
Ramachandran P, Bello I, Parmar N et al (2019) Stand-alone self-attention in vision models. In: Advances in neural information processing systems
Steiner A, Kolesnikov A, Zhai X et al (2021) How to train your ViT? Data, augmentation, and regularization in vision transformers. CoRR abs/2106.1
Radford A, Wu J, Child R, Luan D, Dario Amodei IS (2020) GPT2: language models are unsupervised multitask learners. In: OpenAI Blog, pp 1–7
Brown TB, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst
Beltagy I, Peters ME, Cohan A (2020) Longformer: the long-document transformer. CoRR abs/2004.0
Roy A, Saffar M, Vaswani A, Grangier D (2021) Efficient content-based sparse attention with routing transformers. Trans Assoc Comput Linguist 9:53–68. https://doi.org/10.1162/tacl_a_00353
Article Google Scholar
Tay Y, Bahri D, Yang L et al (2020) Sparse Sinkhorn attention. In: 37th Int Conf Mach Learn ICML 2020 Part F16814, pp 9380–9389
Duan J, Liu Z, Li SE et al (2022) Adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints. Neurocomputing 484:128–141. https://doi.org/10.1016/j.neucom.2021.04.134
Article Google Scholar
Djordjevic V, Stojanovic V, Tao H et al (2022) Data-driven control of hydraulic servo actuator based on adaptive dynamic programming. Discret Contin Dyn Syst Ser. https://doi.org/10.3934/dcdss.2021145
Article MathSciNet MATH Google Scholar
Xu J, Xu P, Wei Z et al (2020) DC-NNMN: across components fault diagnosis based on deep few-shot learning. Shock Vib. https://doi.org/10.1155/2020/3152174
Article Google Scholar
Tao H, Cheng L, Qiu J, Stojanovic V (2022) Few shot cross equipment fault diagnosis method based on parameter optimization and feature mertic. Meas Sci Technol 33:115005. https://doi.org/10.1088/1361-6501/ac8368
Article Google Scholar
Widrich M, Schäfl B, Pavlovic M et al (2020) Modern Hopfield networks and attention for immune repertoire classification. In: Adv. Neural Inf. Process. Syst
Ramsauer H, Schäfl B, Lehner J et al (2020) Hopfield networks is all you need
Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit 77:329–353. https://doi.org/10.1016/j.patcog.2017.10.009
Article Google Scholar
Widrich M, Schäfl B, Pavlović M et al (2020) DeepRC: immune repertoire classification with attention-based deep massive multiple instance learning. bioRxiv 2020.04.12.038158
Ilse M, Tomczak JM, Welling M (2018) Attention-based deep multiple instance learning. In: 35th international conference on machine learning, ICML 2018, pp 3376–3391
Zhao L, Yuan L, Hao K, Wen X (2022) Generalized attention-based deep multi-instance learning. Multimed Syst. https://doi.org/10.1007/s00530-022-00992-w
Article Google Scholar
Jain S, Wallace BC (2019) Attention is not explanation. CoRR abs/1902.1
Wiegreffe S, Pinter Y (2019) Attention is not explanation. In: EMNLP-IJCNLP 2019—2019 conference on empirical methods in natural language processing and 9th international joint conference on natural language processing, proceedings of the conference, pp 11–20
Liu H, Dai Z, So DR, Le QV (2021) Pay attention to MLPs. Adv Neural Inf Process Syst 11:9204–9215
Google Scholar
Tang C, Zhao Y, Wang G et al (2022) Sparse MLP for image recognition: is self-attention really necessary? Proc AAAI Conf Artif Intell 36:2344–2351. https://doi.org/10.1609/aaai.v36i2.20133
Article Google Scholar
Haken H (1991) Synergetic computers and cognition: a top-down approach to neural nets. Springer, Berlin
Book MATH Google Scholar
Haken HPJ (1988) Synergetics. IEEE Circ Devices Mag 4:3–7. https://doi.org/10.1109/101.9569
Article MathSciNet MATH Google Scholar
Van Den Oord A, Vinyals O, Kavukcuoglu K (2017) Neural discrete representation learning. In: Advances in neural information processing systems, pp 6307–6316
Razavi A, van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in neural information processing systems
Moore EH (1920) On the reciprocal of the general algebraic matrix. Bull Am Math Soc 26:394–395
Google Scholar
Penrose R (1955) A generalized inverse for matrices. Math Proc Camb Philos Soc 51:406–413. https://doi.org/10.1017/S0305004100030401
Article MATH Google Scholar
Chan CS, Kong H, Liang G (2022) A comparative study of faithfulness metrics for model interpretability methods. In: Proceedings of the annual meeting of the association for computational linguistics, pp 5029–5038
Córdova Sáenz CA, Becker K (2021) Assessing the use of attention weights to interpret BERT-based stance classification. In: ACM international conference proceeding series, pp 194–201
Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. In: Proceedings of the 15th international conference on neural information processing systems. MIT Press, Cambridge, pp 577–584
Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89:31–71. https://doi.org/10.1016/s0004-3702(96)00034-3
Article MATH Google Scholar
Kandemir M, Zhang C, Hamprecht FA (2014) Empowering multiple instance histopathology cancer diagnosis by cell graphs. Med Image Comput Comput Assist Interv 17:228–235
Google Scholar
Zhou ZH, Jiang K, Li M (2005) Multi-instance learning based web mining. Appl Intell 22:135–147. https://doi.org/10.1007/s10489-005-5602-z
Article Google Scholar
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th international conference on learning representations, ICLR 2019
Şeyma Küçükaşcı E, Gökçe Baydoğan M (2018) Bag encoding strategies in multiple instance learning problems. Inf Sci (Ny) 467:559–578. https://doi.org/10.1016/j.ins.2018.08.020
Article Google Scholar
Küçükaşcı EŞ, Baydoğan MG, Taşkın ZC (2022) Multiple instance classification via quadratic programming. J Glob Optim. https://doi.org/10.1007/s10898-021-01120-0
Article MathSciNet MATH Google Scholar
Cheplygina V, Tax DMJ, Loog M (2016) Dissimilarity-Based Ensembles for Multiple Instance Learning. IEEE Trans Neural Netw Learn Syst 27:1379–1391. https://doi.org/10.1109/TNNLS.2015.2424254
Article Google Scholar
Chen Y, Bi J, Wang JZ (2006) MILES: multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28:1931–1947. https://doi.org/10.1109/TPAMI.2006.248
Article Google Scholar
Wang J, Zucker J-D (2000) Solving multiple-instance problem: a lazy learning approach. In: Proc 17th Int Conf Mach Learn, pp 1119–1125
Edunov S, Ott M, Auli M, Grangier D (2018) Understanding backtranslation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP, pp 489–500

Download references

Funding

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, No. 92, Xidazhi Street, Nangang District, Harbin, 150001, Heilongjiang, China
Zihao Wang, Haifeng Li & Lin Ma
School of Medicine and Health, Harbin Institute of Technology, No. 92, Xidazhi Street, Nangang District, Harbin, 150001, Heilongjiang, China
Feng Jiang

Authors

Zihao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Lin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Feng Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Theoretical derivation related to Syn layer

A.1 Syn's forward mode has the same convergence target as the SNN

We first prove that the module of the order parameter does not affect its convergent direction, i.e., $\left|{\xi }_{m}^{\mathrm{new}}\right|$ is the largest when $\left|{\xi }_{m}\right|$ is the largest despite $\|{\varvec{\xi}}\| $. Let ${\xi }_{m}>0$ and

$$\begin{array}{c}g\left({\xi }_{m}\right)={\xi }_{m}^{\mathrm{new}}-{\xi }_{m}=\gamma \left(\frac{{\xi }_{m}^{3}+{\xi }_{m}}{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}-{\xi }_{m}\right).\end{array}$$

(14)

Let $k=1-2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}$,

$$\begin{array}{c}g\left({\xi }_{m}\right)=\frac{\gamma }{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}\left({\xi }_{m}^{3}+k{\xi }_{m}\right).\end{array}$$

(15)

Discuss $k$:

1.
$ k\ge 0$$g\left({\xi }_{m}\right)$ is a monotonically increasing odd function, so ${\xi }_{m}$ has maximum increment during the update.
2.
$k<0.$

The image of $g\left({\xi }_{m}\right)$ when $k<0$ is shown in Fig. 7. $g\left({\xi }_{m}\right)$ has two stationary points $\pm \sqrt{\left(-k\right)/3}$ and two zero points $\pm \sqrt{-k}$ besides the origin. We construct constant $\alpha >\sqrt{-k}$ that satisfies $g\left(\alpha \right)=g\left(-\sqrt{\left(-k\right)/3}\right)$ and discuss ${\xi }_{m}$:

(1)
${\xi }_{m}>\alpha $$g\left({\xi }_{m}\right)>0$ and $g\left({\xi }_{m}\right)>g\left({\xi }_{{m}^{^{\prime}}}\right)$, so ${\xi }_{m}$ has maximum increment during the update.
(2)
$\sqrt{-k}<{\xi }_{m}\le \alpha .$

Construct two constants $-\sqrt{-k}<{\beta }_{1}<{\beta }_{2}<0$ that satisfy $g\left({\beta }_{1}\right)=g\left({\beta }_{2}\right)=g\left({\xi }_{m}\right)$.$g\left({\xi }_{m}\right)>0$ and $g\left({\xi }_{m}\right)>g\left({\xi }_{{m}^{^{\prime}}}\right)$ when ${\xi }_{{m}^{^{\prime}}}\notin \left({\beta }_{1},{\beta }_{2}\right)$, so ${\xi }_{m}$ has maximum increment during the update. Although $g\left({\xi }_{m}\right)<g\left({\xi }_{{m}^{^{\prime}}}\right)$ when ${\xi }_{{m}^{^{\prime}}}\in \left({\beta }_{1},{\beta }_{2}\right)$, ${\xi }_{{m}^{^{\prime}}}$ and $g\left({\xi }_{{m}^{^{\prime}}}\right)$ have different signs, ${\xi }_{m}$ increases, and ${\xi }_{{m}^{^{\prime}}}$ decreases, so ${\xi }_{m}$ has maximum increment during the update.
(3)
$0<{\xi }_{m}\le \sqrt{-k.}$

Construct one constant $0<{\beta }_{3}<\sqrt{\left(-k\right)/3}$ that satisfies $g\left({\beta }_{3}\right)=g\left({\xi }_{m}\right)$.$g\left({\xi }_{{m}^{^{\prime}}}\right)<g\left({\xi }_{m}\right)<0$ when ${\xi }_{{m}^{^{\prime}}}\in \left({\beta }_{3},{\xi }_{m}\right)$, so ${\xi }_{m}$ has minimum decrement during the update. Suppose there exists ${\xi }_{{m}^{^{\prime}}}$ s.t. $g\left({\xi }_{m}\right)<g\left({\xi }_{{m}^{^{\prime}}}\right)<0$ when ${\xi }_{{m}^{^{\prime}}}\in \left(0,{\beta }_{3}\right)$. Substitute Eq. (15) into the first two terms of the inequality and then shift the terms

$$\begin{array}{c}{\xi }_{m}-{\xi }_{{m}^{^{\prime}}}<-\frac{\gamma }{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}\left(\left({\xi }_{m}^{3}-{\xi }_{{m}^{^{\prime}}}^{3}\right)+k\left({\xi }_{m}-{\xi }_{{m}^{^{\prime}}}\right)\right).\end{array}$$

(16)

Divide ${\xi }_{m}-{\xi }_{{m}^{^{\prime}}}$ on both sides,

$$\begin{array}{c}1<-\frac{\gamma }{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}\left(\left({\xi }_{m}^{2}+{\xi }_{{m}^{^{\prime}}}^{2}+{\xi }_{m}{\xi }_{{m}^{^{\prime}}}\right)+k\right);\end{array}$$

(17)

thus

$$\begin{array}{c}{\xi }_{m}^{2}+{\xi }_{{m}^{^{\prime}}}^{2}+{\xi }_{m}{\xi }_{{m}^{^{\prime}}}<-k-\frac{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}{\gamma }.\end{array}$$

(18)

Substitute $k=1-2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}$ to the right term

$$\begin{array}{c}-k-\frac{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}{\gamma }=(2-2/\gamma ){\Vert {\varvec{\xi}}\Vert }_{2}^{2}-1.\end{array}$$

(19)

The above formula is smaller than 0 when $\gamma \le 1$. ${\xi }_{m}$ and ${\xi }_{{m}^{^{\prime}}}$ are both greater than 0, so the left side of (18) is also greater than 0, and (18) does not hold. Therefore, no ${\xi }_{{m}^{^{\prime}}}$ satisfies $g\left({\xi }_{m}\right)<g\left({\xi }_{{m}^{^{\prime}}}\right)<0$ when $\gamma \le 1$, ${\xi }_{m}$ has minimum decrement during the update.

In summary, ${\xi }_{m}$ has maximum increment or minimum decrement during the update when it is greater than 0. Following the step above, it is easy to prove that ${\xi }_{m}$ has minimum increment or maximum decrement during the update when it is smaller than 0. Therefore, ${\xi }_{m}$ with the largest absolute value remains largest with arbitrary $k$. The novel activation function normalizes ${\varvec{\xi}}$, which restricts $k=-1$, so it has identical convergence direction and result to SNN.

A.2 The convergence target of Syn's backward mode

The backward mode of the Syn layer is the inverse of the forward mode, so the state vector retrogrades along the forward trajectory to reach the point with the highest energy. Its convergence target can be derived by calculating the maxima of the energy function. We introduce the energy function of the SNN [41]

$$\begin{array}{c}P=-\frac{1}{2}\sum\limits_{m=1}^{M}{\xi }_{m}^{2}+\frac{1}{2}{\left(\sum\limits_{m=1}^{M}{\xi }_{m}^{2}\right)}^{2}-\frac{1}{4}\sum\limits_{m=1}^{M}{\xi }_{m}^{4}.\end{array}$$

(20)

The normalization of Syn makes $\sum_{m=1}^{M}{\xi }_{m}^{2}=1$, so

$$\begin{array}{c}P=-\frac{1}{4}\sum\limits_{m=1}^{M}{\xi }_{m}^{4}.\end{array}$$

(21)

The extrema of $P$ can be calculated by the Lagrange multiplier method. Construct

$$\begin{array}{c}L=\sum\limits_{m=1}^{M}{\xi }_{m}^{4}+\lambda \left(\sum\limits_{m=1}^{M}{\xi }_{m}^{2}-1\right)=-\lambda +\sum\limits_{m=1}^{M}\left({\xi }_{m}^{4}+\lambda {\xi }_{m}^{2}\right).\end{array}$$

(22)

All extrema satisfy

$$\begin{array}{c}\left\{\begin{array}{c}\frac{\partial L}{\partial {\xi }_{m}}=4{\xi }_{m}^{3}+\lambda {\xi }_{m}=0\\ \displaystyle\sum_{m=1}^{M}{\xi }_{m}^{2}=1\end{array}.\right.\end{array}$$

(23)

Therefore, some order parameters equal zero, and others have the same nonzero value. The nonzero order parameter maintains a nonzero value until convergence during Syn's forward iterations, so these zero-valued order parameters are initialized zero. Let the number of nonzero order parameters be $N$. Since the order parameter vector is normalized, it is easy to know that these ordinal parameters are equal to $\pm 1/\sqrt N$. Syn's sequential parameters are not less than 0, and

$$\begin{array}{c}{\xi }_{m}^{new}\cdot {\xi }_{m}=\gamma \frac{{\xi }_{m}^{4}+{\xi }_{m}^{2}}{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}+\left(1-\gamma \right){\xi }_{m}^{2}\end{array}$$

(24)

is not less than 0 when $0\le \gamma \le 1$, so the sign of each order parameter remains invariant. In summary, the convergence target of Syn's backward mode is: order parameters that initialized with 0 remain unchanged, others converge to $1/\sqrt N$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Z., Li, H., Ma, L. et al. Concentration or distraction? A synergetic-based attention weights optimization method. Complex Intell. Syst. 9, 7381–7393 (2023). https://doi.org/10.1007/s40747-023-01133-0

Download citation

Received: 14 November 2022
Accepted: 27 May 2023
Published: 30 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01133-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Concentration or distraction? A synergetic-based attention weights optimization method

Abstract

Similar content being viewed by others

Generalized attention-based deep multi-instance learning

Lifelong learning with selective attention over seen classes and memorized instances

Attention Awareness Multiple Instance Neural Network

Introduction