Concentration or distraction? A synergetic-based attention weights optimization method

The attention mechanism empowers deep learning to a broader range of applications, but the contribution of the attention module is highly controversial. Research on modern Hopfield networks indicates that the attention mechanism can also be used in shallow networks. Its automatic sample filtering facilitates instance extraction in Multiple Instances Learning tasks. Since the attention mechanism has a clear contribution and intuitive performance in shallow networks, this paper further investigates its optimization method based on the recurrent neural network. Through comprehensive comparison, we find that the Synergetic Neural Network has the advantage of more accurate and controllable convergences and revertible converging steps. Therefore, we design the Syn layer based on the Synergetic Neural Network and propose the novel invertible activation function as the forward and backward update formula for attention weights concentration or distraction. Experimental results show that our method outperforms other methods in all Multiple Instances Learning benchmark datasets. Concentration improves the robustness of the results, while distraction expands the instance observing space and yields better results. Codes available at https://github.com/wzh134/Syn.


Introduction
In recent years, the attention mechanism [1][2][3], as the core of the Transformer model [4], has replaced the convolution mechanism as the preferred choice for textual tasks [5][6][7][8] and is being actively extended to audio and visual tasks [9][10][11][12][13][14].By adding more flexibility to deep networks, i.e., maintaining the trainability of the network under a larger number of parameters, the attention mechanism further improves the performance and applicability of deep networks.However, researchers gradually find that the attention mechanism relies strongly on the network volume [15][16][17].Their performance is unstable and hardly surpasses convolutional networks of equal volume with fewer parameters [18][19][20].In addition, such networks often require more prudent parameter tuning to secure the learning process [21].These requirements make the time cost and equipment requirements unaffordable [22,23], and thus, most subsequent researches focus on simplifying the attention mechanism to improve efficiency [8,[24][25][26] rather than optimizing it for improvement.
Although the attention mechanism empowers deep learning in various applications, some traditional tasks remain challenging, such as adaptive dynamic programming [27,28], few-shot learning [29,30], and multiple instance learning (MIL) [33].Widrich and Ramsauer et al. generalize the attention mechanism to a shallow network, modern Hopfield neural network (MHNN), and extensively explore its application fields [31,32,34].They demonstrate that the attention mechanism is actually a single iteration of MHNN.Its automatic filtering of instances is a powerful feature extraction module for MIL tasks [35,36].The attention mechanism filters effectiveness of the attention mechanism in the shallow MHNN is more significant, for the functionality [37,38] and the substitutability [39,40] of the attention mechanism are still highly controversial.However, MHNN emphasizes its equivalence to the attention network instead of delving into attention weights optimization by network recurrence.The effect of attention weights adjustment remains unclear.
This paper explores the effect of attention weights adjustment in attention networks on MIL tasks.Compared to the Hopfield networks family, the synergetic neural network (SNN) [41] has a similar network structure and identical convergent target.Its Synergetics foundation [42] brings fewer unstable attractors and a polynomial-based activation function, providing the network with a stable and revertible convergence process.We use these two properties to propose the Syn layer, which takes the attention weights as input and adjusts the weights by forward or backward iteration.The forwarding iteration concentrates attention, and the backwarding one distracts attention.We use the gradient bypass technique [43,44] to circumvent the gradient exploding or vanishing from polynomials.Experiment results show that Syn-based attention networks achieve state-of-the-art performances on multiple MIL benchmark datasets.

Attention mechanism and MHNN
At the time when the attention mechanism is formally proposed, its typical behavior can be summarized as the interaction of the query matrix Q, key matrix K , and value matrix d k is the dimension of the key.However, the reason and motivation for choosing softmax are not offered.MHNN complements this problem by constructing a recurrent neural network with the softmax function as the centerpiece.For the query pattern x and the matrix of the static memory patterns the update formula of MHNN [32] is β is the scaling factor.x new is the new input of the network to form the recurrence.The attention mechanism is actually a single iteration of the fine-tuned MHNN [32].Let β = 1/ √ d k and K = V ; with input Q, MHNN acts as the behavior of the attention mechanism Thus, MHNN can examine the usefulness of the softmax function from a dynamic system perspective.It shows that the softmax function can sparsify the weights and make it converge with high probability to a result close to the one-hot encoding after one iteration, which acts as the concentration of the attention.

MHNN versus SNN
SNN and MHNN have similar working processes and identical working goals, so SNN can also be used for the attention mechanism.The update formula of SNN [41] with the default hyperparameter setting is V + is the Moore-Penrose inverse of V [45,46].SNN restricts the number of v to be less than its dimension and mutual independence, such that V + V is the identity matrix.ξ is called the vector of order parameters to describe the phenomenon that the variation of x is entirely subject to ξ [42].f can be interpreted as a synergetics-based activation function; γ is the learning rate.If the ξ of MHNN is considered as a generalized order parameter vector, the two networks have very similar working processes.Moreover, both networks have the same working goal of letting ξ converge to one-hot.
SNN has the potential to form a better attention mechanism.Figure 1 shows all possible convergence cases of MHNN and SNN.The advantages of MHNN include (i) High recall efficiency.MHNN's convergence is achieved with fewer iterations (sometimes even single iteration), while SNN requires more iterations to converge.(ii) Controllable range of order parameters.MHNN's softmax function keeps the order parameter value between the interval [0,1], while SNN may converge to the negative one-hot (Fig. 1h).(iii) Exponential memory capacity (not shown in Fig. 1).MHNN can store exponential memories, while SNN can only store them at polynomial level.The advantages of SNN include (i) precise convergence results.SNN can converge exactly to Fig. 1 Comparison between the convergence of MHNN (up) and SNN (down).a, e Are converging to the target stable points.MHNN converges to approximate one-hot, while SNN converges to strict one-hot.b, f Are the convergence to metastable states.The metastable state of MHNN may be reached when multiple v resemble, and the metastable state of SNN is reached when multiple absolute maximums are in the initial value of ξ .c, g Are converging to the global stable point (MHNN) or saddle point (SNN).The stable point of MHNN is all values of ξ equal 1/N , and the saddle point of SNN is all values of ξ equal 0. Note that (g) is only for representation.The divide-by-0 error terminates the iteration in the update formula.g, h Are abnormal cases.The convergence of MHNN depends on both x and v, so the network may converge to a less similar v with certain x.SNN determines the convergence direction by the maximum element of |ξ |, so the network may converge to a negative one-hot one-hot, while MHNN often converges to one-hot with an error.(ii) Deterministic and controllable converging direction.SNN converges to the single local maxima when ξ is initialized to 0, and metastable states when ξ is initialized with multiple identical maximum values.In contrast, MHNN converges to the single globally stable point when all v are similar, metastable states (the arithmetic mean of v) when some v are similar.Therefore, all non-target stationary points of SNN can be avoided by adding a bias to the initial value of ξ , while MHNN requires modifying multiple columns of V to reduce the similarity.(iii) Convergence independent from memory patterns.MHNN first applies V T V matrix to ξ new and then puts it into the activation function, which means that the update of the order parameters is affected by both ξ and V .As shown in Fig. 1d, we construct x with the i-th row vector of V + , so x is orthogonal to all v except v i .The update of ξ is concentrated in the ith term due to orthogonality, so x will converge to v i despite its similarities to other v.In contrast, SNN applies V + V matrix to ξ new .V + V is the identity matrix, so the update of ξ depends only on its initial value, and SNN will converge to the target with similaritybased initialization.From the comparison, SNN has a more accurate and controllable convergence with no retrieval error, contributing to better concentration.

Syn layer
SNN and MHNN have similar working orders and identical operating goals, establishing the basis of SNN's application to the attention mechanism.In addition, the SNN-based attention mechanism has more powerful functions.Since the softmax function can only sparsify the weights to concentrate, the attention networks focus on only a few tokens under multiple faithfulness metrics [47] with a tendency to overconcentrate on specific tokens that lead to misclassification [48].This phenomenon implies that attentional focus is prone to be anchored during training regardless of its rightfulness, making erroneous attention more difficult to be corrected.In contrast, SNN's activation function can be redesigned to the polynomial shape to calculate its precise inverse function to distract the attention for erroneous focus shifting.To take full advantage of SNN's application in the attention mechanism, we extract its activation function and convert it to the Syn layer for attention weights adjustment.The Syn layer takes attention weights as input and uses the forward or backward iterations within the layer to control the concentration or distraction of attention for focus shifting tendency adjustment.

The function of Syn layer
The Syn layer is a recurrent network layer for adjusting attention weights.It is designed with the objective of two modes of operation, concentration, and distraction.Their activation functions are mutual inverse.Although the Synergeticsbased activation function f is already polynomial-shaped, it is not suitable for the Syn layer.The normalization term in f brings correspondence among order parameters, so the calculation of the inverse function requires the solution of a system of n-element cubic equations (n is the number of order parameters), which are n 3 shockingly long real or complex roots.Obviously, such a complicated inverse function is not suitable as an activation function.To make the solution of the inverse function feasible, we first regularize the ordinal covariates, so that f reduces to a monadic cubic polynomial The reduced f corresponds to one real root and two complex roots, so the real root is the inverse function In summary, the recurrent functions of the Syn layer are and they are depicted in Fig. 2. For a set of order parameters not less than 0, Syn's forward iteration amplifies the difference between the maximum order parameter and others until the convergence to one-hot.Its reverse iterations reduce Fig. 3 The data flow of the Syn layer.The attention matrix is split into multiple vectors along its last dimension and input to the Syn layer for attention concentration or distraction.The output is then applied on V .The gradient ∇ z L (red line) from the backpropagation bypasses the Syn layer and acts directly to , so the gradient exploding or vanishing from Syn's polynomial activation function can be circumvented the difference until all nonzero order parameters converge to the same value 1/ √ N (N is the number of nonzero order parameters), while the zero-valued order parameters remain invariant.Theoretical derivations are detailed in Appendix A.

Syn layer as attention
The Syn layer controls the focus shift tendency by adjusting the gap among attention weights.The Syn layer for attention is used inside the multiheaded attention module, which takes the attention weights as input and iterates several times to output the sparsified or densified attention weights as concentration or distraction.Syn's forward iteration increases the gap between the maximum weight and the other weights to sustain the attention focus.Conversely, its backward iteration narrows the gap to prompt the focus shift.The characteristics of the attention module can circumvent most disadvantages of the Syn layer.The attention weights are within the interval [0,1], so Syn does not converge to a negative one-hot.Introducing multi-head attention can solve the network storage capacity inefficiency [4].The data flow of the multiheaded attention module with the Syn layer is shown in Fig. 3.
For error backpropagation, Syn repeatedly imposes a polynomial function onto the input, which may lead to the gradient exploding or vanishing.The gradient problem is so severe that conventional means like gradient clipping can barely circumvent the non-convergence.To solve this problem, we use the gradient bypass technique [43,44] in the incomputable partial differential problem from function discontinuity.We treat Syn as a discontinuous function and directly pass the gradient from the output Z to the unadjusted attention weights .MHNN [32] Quadratic programming [55] Path encoding [54] M I n D [ 56] MILES [57] Citation-kNN [ Results below quadratic programming are from either [33] or [54] with higher AUC Best results are marked in bold 123 Fig. 4 The distribution of raw data and Syn extracted features from Web datasets.All instances and features are labeled according to the bag label.
Positive samples are in green, and negative samples are in red

Datasets and results
We For more information on the EFT datasets, see [49].The Musk datasets, including Musk1 and 2, contain molecular descriptions using multiple low-energy conformations.Each conformation is represented by a 166-dimensional feature vector derived from surface properties.Musk1 contains on average 6 conformations per bag, while Musk2 contains over 60 conformations per bag.For more information on the Musk datasets, see [50].Our approach has set a new state-of-the-art and has outperformed other methods (see Table 1).Since Syn substantially outperforms other methods in the Web datasets, we visualize the raw data and the features extracted by Syn using the t-SNE method.Results are shown in Fig. 4. Syn significantly simplifies the data distribution, resulting in excellent classification performances.These performances revalidate that the attention mechanism is more applicable to textual datasets.

Network structure
We adopt the structure of HopfieldPooling [32] and name our method SynPooling, whose query pattern Q is random initialized, and key & value patterns K & V are the layer input.The trainable Q is used for averaging over class-indicative instances, enabling the compression of variable-sized bags to a fixed-sized discriminative representation.The network structure is shown in Fig. 5 with details including (i) Dropout 0.75 to avoid over-fitting.(ii) L fully connected linear embedding layers of D nodes with ReLU activation.(iii) SynPooling.Syn iterates I times with learning rate γ =1, embedding dimension M, heads H, and scaling factor S. (iv) Fully connected linear layer with Sigmoid activation that performs the classification.
We perform a hyperparameter grid search.The hyperparameter search grid is shown in Table 2.All models are trained for 50 epochs using the AdamW optimizer [53] with 1e−3 learning rate, 1e−4 weight decay, and NO learning rate

Ablation study
We use the ablation study to verify the usefulness of the Syn layer and the SynPooling module.Table 3 shows the decrease of AUC after ablating Syn and SynPooling in the network architecture.The attention mechanism from the SynPooling module substantially improves the network's performance, while Syn's attention tuning further optimizes the results.

Concentration versus distraction
We singularly control the number of SynPooling iteration I under the optimal hyperparameter grid to compare the effect of concentration versus distraction on network performance.We present the best and worst AUCs and the corresponding SynPooling iteration I.The results are shown in Table 4.While 80% of the optimal results are obtained with distraction, 53.3% of the worst results are also obtained with distraction.Among the worst results caused by distraction, 50% of the iterations are − 20, and these results were close to the existing methods.These worst results imply that a certain level of distraction can help improve network performance, yet excessive distraction can be counterproductive.
To measure the average performance of concentration versus distraction, we rank the network performance in descending order from best to worst and average the results in Table 5.The average ranking of concentration is smaller in most datasets, indicating that the average performance obtained by concentration is more stable.
To reveal the reason for the performance improvement from distraction, we investigate the transfer rate of the attention parameter winner, i.e., the attention parameter with the largest value.We record the change of the winning parameter between two adjacent epochs as one transfer, and the minimum transfer results are shown in Fig. 6.The minimum transfer rate is more representative than the maximum or average.The maximum transfer rate reaches 100% at Better results are marked in bold the early stage of training for datasets with large instance amounts (e.g., Web), and the average transfer rate is also close to 100%.There is a correlation between the minimum winning transfer rate and the best AUC results.Most datasets achieve higher transfer rates through distraction, so the network improves performance by observing more instances.For the Musk 2, Web 4, and Web 5 datasets, distraction does not result in higher transfer rates, so the best AUC is reached by concentration.
To verify the robustness enhancement by concentration, we add noise to each data set with distraction-favored hyperparameters and single control of the Syn iterations I. Images are added with Gaussian white noise of different signalto-noise ratios (SNR).Texts are imposed with a certain percentage of text deletion, replacement, and swap according to [59].The optimal results and the related I are shown in Table 6.Although the hyperparameters are configured with the predominance of distraction, most image datasets obtain better results by concentration in the noisy environment.There are large fluctuations in the accuracy of the textual datasets, and the effects of concentration failed to be reflected.Results imply that concentration improves the robustness of the network for image datasets.

Conclusions and future work
In this paper, we construct the SNN-based Syn network layer with forward and backward working modes, and use it as an extension of the attention mechanism.The forward mode Fig. 6 Minimum winner parameter transfer rate under concentration and distraction.Most datasets have higher transfer rates by distraction, yet no significant difference in the Musk2, Web4, and Web5 datasets sparsifies attention weights for concentration, and the backward mode densifies attention weights for distraction.The experimental data show that the Syn layer achieves state-ofthe-art results in multiple MIL benchmark datasets, and most of the best results are derived from distraction.Analyses show that the distraction improves the chance of attention focus shifting by increasing the winning order parameter transfer rate to observe more instances.For noisy image samples, concentration enhances the robustness of the network and reduces the loss of accuracy.
It is worth noting that the MIL problem is a generalization of pattern classification with numerous branches.All the datasets in this paper satisfy the standard MIL assumptions, i.e., the positive bag contains at least one positive instance, and the negative bag has only negative instances.For more complex problems, e.g., the positive bag needs the accumulation of a variable number of positive instances, or the negative bag contains positive instances, the Syn layer is not readily applied.In addition, the hyperparameter configuration of Syn lacks regularity.Datasets with similar properties may correspond to divergent hyperparameters.Therefore, the generalization and the meta-learning ability of the Syn layer require more research.In future work, we plan to apply Syn to more complex attention network models and explore its effectiveness in deep learning.Concentrated attention can be used as a plug-and-play module for pruning attention weights to improve the efficiency and robustness of large deep networks.Distracted attention can extend the parameter learning space and enhance the performance of small deep networks.In addition, the fine-grained and dynamic tuning of attention weights based on Syn also has promising prospects.Construct one constant 0 < β 3 < √ (−k)/3 that satisfies g(β 3 ) = g(ξ m ).g ξ m < g(ξ m ) < 0 when ξ m ∈ (β 3 , ξ m ), so ξ m has minimum decrement during the update.Suppose there exists ξ m s.t.g(ξ m ) < g ξ m < 0 when ξ m ∈ (0, β 3 ).Substitute Eq. ( 15) into the first two terms of the inequality and then shift the terms The above formula is smaller than 0 when γ ≤ 1. ξ m and ξ m are both greater than 0, so the left side of ( 18) is also greater than 0, and (18) does not hold.Therefore, no ξ m satisfies g(ξ m ) < g ξ m < 0 when γ ≤ 1, ξ m has minimum decrement during the update.
In summary, ξ m has maximum increment or minimum decrement during the update when it is greater than 0. Following the step above, it is easy to prove that ξ m has minimum increment or maximum decrement during the update when it is smaller than 0. Therefore, ξ m with the largest absolute value remains largest with arbitrary k.The novel activation function normalizes ξ , which restricts k = −1, so it has identical convergence direction and result to SNN.

A.2 The convergence target of Syn's backward mode
The backward mode of the Syn layer is the inverse of the forward mode, so the state vector retrogrades along the forward trajectory to reach the point with the highest energy.Its convergence target can be derived by calculating the maxima of the energy function.We introduce the energy function of the SNN [41] The normalization of Syn makes M m=1 ξ 2 m = 1, so The extrema of P can be calculated by the Lagrange multiplier method.Construct ( All extrema satisfy Therefore, some order parameters equal zero, and others have the same nonzero value.The nonzero order parameter maintains a nonzero value until convergence during Syn's forward iterations, so these zero-valued order parameters are initialized zero.Let the number of nonzero order parameters be N .Since the order parameter vector is normalized, it is easy to know that these ordinal parameters are equal to ±1/ √ N .Syn's sequential parameters are not less than 0, and 123 is not less than 0 when 0 ≤ γ ≤ 1, so the sign of each order parameter remains invariant.In summary, the convergence target of Syn's backward mode is: order parameters that initialized with 0 remain unchanged, others converge to 1/ √ N .

Fig. 2
Fig.2The Syn layer has two modes.The forward iterations sparsify ξ , and the backward iterations densify ξ The UCSB breast cancer classification data set consists of 58 packages and 2002 instances.An instance represents a histopathological image of cancerous or normal tissue.For more information on the UCSB data set, see[51].Web Recommendation datasets contain nine subsets.Each subset is derived from 113 web index pages annotated by a volunteer based on their interest.Pages are considered as bags and links are instances.The number of instances per package is between 4 and 200, with an average number of 30.29.For more information on the Web Recommendation datasets, see[52].The defaulted train and test set separation is 9:1 for EFT and Musk, 3:1 for UCSB, and 75:38 for Web.All datasets except the Web perform cross-validation.All experiments were repeated five times.Results are organized

Fig. 7
Fig. 7 The plot of g(ξ m ) and constants definition when k < 0

Table 1
AUC and standard error (100×) results of MIL methods apply Syn to multiple MIL benchmark datasets: elephant, fox, and tiger (EFT), Musk 1 & 2, UCSB breast cancer classification, and Web Recommendation.EFT datasets consist of color images from the Corel data set that have been preprocessed and segmented.An image (bag) consists of varying numbers of segments (instances), each characterized by color, texture, and shape descriptors.The data set of each animal has 100 positive and 100 negative example images.Negative images are randomly drawn from a photo pool of other animals.Elephant has 1391 instances, Fox has 1320, and Tiger has 1220.All instances have 230 features.

Table 2
The grid search space of the network's hyperparameters

Table 3
The ablation study results of the Syn layer and the SynPooling module

Table 4
The best and worst AUC obtained by the network and the cor-SynPooling iteration I I > 0 is concentration, and I < 0 is distraction

Table 5
Syn's average ranking under concentration and distraction (the smaller the better)

Table 6 The
I > 0 is concentration, I = 0 is defaulted setting, and I < 0 is distraction.(a) Results for image datasets, and (b) results for text datasets.Concentration achieves higher accuracies in most image datasets, but the effect in the text datasets lacks regularity Funding The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.