Introduction

In recent years, the attention mechanism [1,2,3], as the core of the Transformer model [4], has replaced the convolution mechanism as the preferred choice for textual tasks [5,6,7,8] and is being actively extended to audio and visual tasks [9,10,11,12,13,14]. By adding more flexibility to deep networks, i.e., maintaining the trainability of the network under a larger number of parameters, the attention mechanism further improves the performance and applicability of deep networks. However, researchers gradually find that the attention mechanism relies strongly on the network volume [15,16,17]. Their performance is unstable and hardly surpasses convolutional networks of equal volume with fewer parameters [18,19,20]. In addition, such networks often require more prudent parameter tuning to secure the learning process [21]. These requirements make the time cost and equipment requirements unaffordable [22, 23], and thus, most subsequent researches focus on simplifying the attention mechanism to improve efficiency [8, 24,25,26] rather than optimizing it for improvement.

Although the attention mechanism empowers deep learning in various applications, some traditional tasks remain challenging, such as adaptive dynamic programming [27, 28], few-shot learning [29, 30], and multiple instance learning (MIL) [33]. Widrich and Ramsauer et al. generalize the attention mechanism to a shallow network, modern Hopfield neural network (MHNN), and extensively explore its application fields [31, 32, 34]. They demonstrate that the attention mechanism is actually a single iteration of MHNN. Its automatic filtering of instances is a powerful feature extraction module for MIL tasks [35, 36]. The attention mechanism filters effectiveness of the attention mechanism in the shallow MHNN is more significant, for the functionality [37, 38] and the substitutability [39, 40] of the attention mechanism are still highly controversial. However, MHNN emphasizes its equivalence to the attention network instead of delving into attention weights optimization by network recurrence. The effect of attention weights adjustment remains unclear.

This paper explores the effect of attention weights adjustment in attention networks on MIL tasks. Compared to the Hopfield networks family, the synergetic neural network (SNN) [41] has a similar network structure and identical convergent target. Its Synergetics foundation [42] brings fewer unstable attractors and a polynomial-based activation function, providing the network with a stable and revertible convergence process. We use these two properties to propose the Syn layer, which takes the attention weights as input and adjusts the weights by forward or backward iteration. The forwarding iteration concentrates attention, and the backwarding one distracts attention. We use the gradient bypass technique [43, 44] to circumvent the gradient exploding or vanishing from polynomials. Experiment results show that Syn-based attention networks achieve state-of-the-art performances on multiple MIL benchmark datasets.

Background

Attention mechanism and MHNN

At the time when the attention mechanism is formally proposed, its typical behavior can be summarized as the interaction of the query matrix \(Q\), key matrix \(K\), and value matrix \(V\) [4]

$$\begin{array}{c}Z=Vsoftmax\left(\frac{1}{\sqrt{{d}_{k}}}Q{K}^{\mathrm{T}}\right);\end{array}$$
(1)

\({d}_{k}\) is the dimension of the key. However, the reason and motivation for choosing softmax are not offered. MHNN complements this problem by constructing a recurrent neural network with the softmax function as the centerpiece. For the query pattern \({\varvec{x}}\) and the matrix of the static memory patterns \(V=[{{\varvec{v}}}_{1},\dots ,{{\varvec{v}}}_{{\varvec{N}}}]\), the update formula of MHNN [32] is

$$\begin{array}{c}{{\varvec{\xi}}} =\beta {V}^{\mathrm{T}}{{\varvec{x}}}\end{array}$$
(2)
$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=softmax\left({\varvec{\xi}}\right)\end{array}$$
(3)
$$\begin{array}{c}{{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=V{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}};\end{array}$$
(4)

\(\beta \) is the scaling factor. \({{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}\) is the new input of the network to form the recurrence. The attention mechanism is actually a single iteration of the fine-tuned MHNN [32]. Let \(\beta =1/\sqrt{{d}_{k}}\) and \(K=V\); with input \(Q\), MHNN acts as the behavior of the attention mechanism

$$\begin{array}{c}Z=Vsoftmax\left(\Xi \right)=Vsoftmax\left(1/\sqrt{{d}_{k}}Q{K}^{\mathrm{T}}\right).\end{array}$$
(5)

Thus, MHNN can examine the usefulness of the softmax function from a dynamic system perspective. It shows that the softmax function can sparsify the weights and make it converge with high probability to a result close to the one-hot encoding after one iteration, which acts as the concentration of the attention.

MHNN versus SNN

SNN and MHNN have similar working processes and identical working goals, so SNN can also be used for the attention mechanism. The update formula of SNN [41] with the default hyperparameter setting is

$$\begin{array}{c}{{\varvec{\xi}}} ={V}^{+}{{\varvec{x}}}\end{array}$$
(6)
$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=f\left({\varvec{\xi}}\right)=\gamma \left(\frac{{{\varvec{\xi}}}^{3}+{\varvec{\xi}}}{2{\Vert {\varvec{\xi}}\Vert }_{2}^{2}}+\left(\frac{1}{\gamma }-1\right){\varvec{\xi}}\right)\end{array}$$
(7)
$$\begin{array}{c}{{\varvec{x}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=V{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}.\end{array}$$
(8)

\({V}^{+}\) is the Moore–Penrose inverse of \(V\) [45, 46]. SNN restricts the number of \({\varvec{v}}\) to be less than its dimension and mutual independence, such that \({V}^{+}V\) is the identity matrix. \({\varvec{\xi}}\) is called the vector of order parameters to describe the phenomenon that the variation of \({\varvec{x}}\) is entirely subject to \({\varvec{\xi}}\) [42]. \(f\) can be interpreted as a synergetics-based activation function; \(\gamma \) is the learning rate. If the \({\varvec{\xi}}\) of MHNN is considered as a generalized order parameter vector, the two networks have very similar working processes. Moreover, both networks have the same working goal of letting \({\varvec{\xi}}\) converge to one-hot.

SNN has the potential to form a better attention mechanism. Figure 1 shows all possible convergence cases of MHNN and SNN. The advantages of MHNN include (i) High recall efficiency. MHNN's convergence is achieved with fewer iterations (sometimes even single iteration), while SNN requires more iterations to converge. (ii) Controllable range of order parameters. MHNN's softmax function keeps the order parameter value between the interval [0,1], while SNN may converge to the negative one-hot (Fig. 1h). (iii) Exponential memory capacity (not shown in Fig. 1). MHNN can store exponential memories, while SNN can only store them at polynomial level. The advantages of SNN include (i) precise convergence results. SNN can converge exactly to one-hot, while MHNN often converges to one-hot with an error. (ii) Deterministic and controllable converging direction. SNN converges to the single local maxima when \({\varvec{\xi}}\) is initialized to 0, and metastable states when \({\varvec{\xi}}\) is initialized with multiple identical maximum values. In contrast, MHNN converges to the single globally stable point when all \({\varvec{v}}\) are similar, metastable states (the arithmetic mean of \({\varvec{v}}\)) when some \({\varvec{v}}\) are similar. Therefore, all non-target stationary points of SNN can be avoided by adding a bias to the initial value of \({\varvec{\xi}}\), while MHNN requires modifying multiple columns of \(V\) to reduce the similarity. (iii) Convergence independent from memory patterns. MHNN first applies \({V}^{\mathrm{T}}V\) matrix to \({{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}\) and then puts it into the activation function, which means that the update of the order parameters is affected by both \({\varvec{\xi}}\) and \(V\). As shown in Fig. 1d, we construct \({\varvec{x}}\) with the i-th row vector of \({V}^{+}\), so \({\varvec{x}}\) is orthogonal to all \({\varvec{v}}\) except \({{\varvec{v}}}_{{\varvec{i}}}\). The update of \({\varvec{\xi}}\) is concentrated in the ith term due to orthogonality, so \({\varvec{x}}\) will converge to \({{\varvec{v}}}_{{\varvec{i}}}\) despite its similarities to other \({\varvec{v}}\). In contrast, SNN applies \({V}^{+}V\) matrix to \({{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}\). \({V}^{+}V\) is the identity matrix, so the update of \({\varvec{\xi}}\) depends only on its initial value, and SNN will converge to the target with similarity-based initialization. From the comparison, SNN has a more accurate and controllable convergence with no retrieval error, contributing to better concentration.

Fig. 1
figure 1

Comparison between the convergence of MHNN (up) and SNN (down). a, e Are converging to the target stable points. MHNN converges to approximate one-hot, while SNN converges to strict one-hot. b, f Are the convergence to metastable states. The metastable state of MHNN may be reached when multiple \({\varvec{v}}\) resemble, and the metastable state of SNN is reached when multiple absolute maximums are in the initial value of \({\varvec{\xi}}\). c, g Are converging to the global stable point (MHNN) or saddle point (SNN). The stable point of MHNN is all values of \({\varvec{\xi}}\) equal \(1/N\), and the saddle point of SNN is all values of \({\varvec{\xi}}\) equal 0. Note that (g) is only for representation. The divide-by-0 error terminates the iteration in the update formula. g, h Are abnormal cases. The convergence of MHNN depends on both \({\varvec{x}}\) and \({\varvec{v}}\), so the network may converge to a less similar \({\varvec{v}}\) with certain \({\varvec{x}}\). SNN determines the convergence direction by the maximum element of \(\left|{\varvec{\xi}}\right|\), so the network may converge to a negative one-hot

Syn layer

SNN and MHNN have similar working orders and identical operating goals, establishing the basis of SNN's application to the attention mechanism. In addition, the SNN-based attention mechanism has more powerful functions. Since the softmax function can only sparsify the weights to concentrate, the attention networks focus on only a few tokens under multiple faithfulness metrics [47] with a tendency to over-concentrate on specific tokens that lead to misclassification [48]. This phenomenon implies that attentional focus is prone to be anchored during training regardless of its rightfulness, making erroneous attention more difficult to be corrected. In contrast, SNN's activation function can be redesigned to the polynomial shape to calculate its precise inverse function to distract the attention for erroneous focus shifting. To take full advantage of SNN's application in the attention mechanism, we extract its activation function and convert it to the Syn layer for attention weights adjustment. The Syn layer takes attention weights as input and uses the forward or backward iterations within the layer to control the concentration or distraction of attention for focus shifting tendency adjustment.

The function of Syn layer

The Syn layer is a recurrent network layer for adjusting attention weights. It is designed with the objective of two modes of operation, concentration, and distraction. Their activation functions are mutual inverse. Although the Synergetics-based activation function \(f\) is already polynomial-shaped, it is not suitable for the Syn layer. The normalization term in \(f\) brings correspondence among order parameters, so the calculation of the inverse function requires the solution of a system of \(n\)-element cubic equations (\(n\) is the number of order parameters), which are \({n}^{3}\) shockingly long real or complex roots. Obviously, such a complicated inverse function is not suitable as an activation function. To make the solution of the inverse function feasible, we first regularize the ordinal covariates, so that \(f\) reduces to a monadic cubic polynomial

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}=\frac{{\varvec{\xi}}}{{\Vert {\varvec{\xi}}\Vert }_{2}}\end{array}$$
(9)
$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=f\left(\widetilde{{\varvec{\xi}}}\right)=\gamma {\widetilde{{\varvec{\xi}}}}^{3}+\left(1-\gamma \right)\widetilde{{\varvec{\xi}}.}\end{array}$$
(10)

The reduced \(f\) corresponds to one real root and two complex roots, so the real root is the inverse function

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}={f}^{-1}\left({{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}\right)={\varvec{a}}+\frac{\gamma -1}{3\gamma {\varvec{a}}},{\varvec{a}}={\left({\left({\left(\frac{{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}}{2\gamma }\right)}^{2}-{\left(\frac{\gamma -1}{3\gamma }\right)}^{3}\right)}^\frac{1}{2}+\frac{{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}}{2\gamma }\right)}^\frac{1}{3}.\end{array}$$
(11)

In summary, the recurrent functions of the Syn layer are

$$\begin{array}{c}\widetilde{{\varvec{\xi}}}=\frac{{\varvec{\xi}}}{{\Vert {\varvec{\xi}}\Vert }_{2}}\end{array}$$
(12)
$$\begin{array}{c}{{\varvec{\xi}}}^{\mathbf{n}\mathbf{e}\mathbf{w}}=\left\{\begin{array}{c}f\left(\widetilde{{\varvec{\xi}}}\right), iter=iter+1\\ {f}^{-1}\left(\widetilde{{\varvec{\xi}}}\right),iter=iter-1,\end{array}\right.\end{array}$$
(13)

and they are depicted in Fig. 2. For a set of order parameters not less than 0, Syn's forward iteration amplifies the difference between the maximum order parameter and others until the convergence to one-hot. Its reverse iterations reduce the difference until all nonzero order parameters converge to the same value \(1/\sqrt{N}\) (\(N\) is the number of nonzero order parameters), while the zero-valued order parameters remain invariant. Theoretical derivations are detailed in Appendix A.

Fig. 2
figure 2

The Syn layer has two modes. The forward iterations sparsify \({\varvec{\xi}}\), and the backward iterations densify \({\varvec{\xi}}\)

Syn layer as attention

The Syn layer controls the focus shift tendency by adjusting the gap among attention weights. The Syn layer for attention is used inside the multiheaded attention module, which takes the attention weights as input and iterates several times to output the sparsified or densified attention weights as concentration or distraction. Syn's forward iteration increases the gap between the maximum weight and the other weights to sustain the attention focus. Conversely, its backward iteration narrows the gap to prompt the focus shift. The characteristics of the attention module can circumvent most disadvantages of the Syn layer. The attention weights are within the interval [0,1], so Syn does not converge to a negative one-hot. Introducing multi-head attention can solve the network storage capacity inefficiency [4]. The data flow of the multiheaded attention module with the Syn layer is shown in Fig. 3.

Fig. 3
figure 3

The data flow of the Syn layer. The attention matrix \(\Xi \) is split into multiple vectors along its last dimension and input to the Syn layer for attention concentration or distraction. The output \({\Xi }^{^{\prime}}\) is then applied on \(V\). The gradient \({\nabla }_{z}L\) (red line) from the backpropagation bypasses the Syn layer and acts directly to \(\Xi \), so the gradient exploding or vanishing from Syn's polynomial activation function can be circumvented

For error backpropagation, Syn repeatedly imposes a polynomial function onto the input, which may lead to the gradient exploding or vanishing. The gradient problem is so severe that conventional means like gradient clipping can barely circumvent the non-convergence. To solve this problem, we use the gradient bypass technique [43, 44] in the incomputable partial differential problem from function discontinuity. We treat Syn as a discontinuous function and directly pass the gradient from the output \(Z\) to the unadjusted attention weights \(\Xi \).

Experiments

Datasets and results

We apply Syn to multiple MIL benchmark datasets: elephant, fox, and tiger (EFT), Musk 1 & 2, UCSB breast cancer classification, and Web Recommendation. EFT datasets consist of color images from the Corel data set that have been preprocessed and segmented. An image (bag) consists of varying numbers of segments (instances), each characterized by color, texture, and shape descriptors. The data set of each animal has 100 positive and 100 negative example images. Negative images are randomly drawn from a photo pool of other animals. Elephant has 1391 instances, Fox has 1320, and Tiger has 1220. All instances have 230 features. For more information on the EFT datasets, see [49]. The Musk datasets, including Musk1 and 2, contain molecular descriptions using multiple low-energy conformations. Each conformation is represented by a 166-dimensional feature vector derived from surface properties. Musk1 contains on average 6 conformations per bag, while Musk2 contains over 60 conformations per bag. For more information on the Musk datasets, see [50]. The UCSB breast cancer classification data set consists of 58 packages and 2002 instances. An instance represents a histopathological image of cancerous or normal tissue. For more information on the UCSB data set, see [51]. Web Recommendation datasets contain nine subsets. Each subset is derived from 113 web index pages annotated by a volunteer based on their interest. Pages are considered as bags and links are instances. The number of instances per package is between 4 and 200, with an average number of 30.29. For more information on the Web Recommendation datasets, see [52]. The defaulted train and test set separation is 9:1 for EFT and Musk, 3:1 for UCSB, and 75:38 for Web. All datasets except the Web perform cross-validation. All experiments were repeated five times. Results are organized via the area under the receiver operation characteristics curve (AUC).

Our approach has set a new state-of-the-art and has outperformed other methods (see Table 1). Since Syn substantially outperforms other methods in the Web datasets, we visualize the raw data and the features extracted by Syn using the t-SNE method. Results are shown in Fig. 4. Syn significantly simplifies the data distribution, resulting in excellent classification performances. These performances revalidate that the attention mechanism is more applicable to textual datasets.

Table 1 AUC and standard error (100×) results of MIL methods
Fig. 4
figure 4

The distribution of raw data and Syn extracted features from Web datasets. All instances and features are labeled according to the bag label. Positive samples are in green, and negative samples are in red

Network structure

We adopt the structure of HopfieldPooling [32] and name our method SynPooling, whose query pattern \(Q\) is random initialized, and key & value patterns \(K\) & \(V\) are the layer input. The trainable \(Q\) is used for averaging over class-indicative instances, enabling the compression of variable-sized bags to a fixed-sized discriminative representation. The network structure is shown in Fig. 5 with details including (i) Dropout 0.75 to avoid over-fitting. (ii) L fully connected linear embedding layers of D nodes with ReLU activation. (iii) SynPooling. Syn iterates I times with learning rate \(\gamma \)=1, embedding dimension M, heads H, and scaling factor S. (iv) Fully connected linear layer with Sigmoid activation that performs the classification.

Fig. 5
figure 5

Schematic representation of the network architecture

We perform a hyperparameter grid search. The hyperparameter search grid is shown in Table 2. All models are trained for 50 epochs using the AdamW optimizer [53] with 1e−3 learning rate, 1e−4 weight decay, and NO learning rate decay. The weights of all network layers are initialized using Xavier uniform distribution. The loss function is the negative log-likelihood. Gradients are cropped to upper limit 1 under Euclidean length.

Table 2 The grid search space of the network's hyperparameters

Ablation study

We use the ablation study to verify the usefulness of the Syn layer and the SynPooling module. Table 3 shows the decrease of AUC after ablating Syn and SynPooling in the network architecture. The attention mechanism from the SynPooling module substantially improves the network's performance, while Syn's attention tuning further optimizes the results.

Table 3 The ablation study results of the Syn layer and the SynPooling module

Concentration versus distraction

We singularly control the number of SynPooling iteration I under the optimal hyperparameter grid to compare the effect of concentration versus distraction on network performance. We present the best and worst AUCs and the corresponding SynPooling iteration I. The results are shown in Table 4. While 80% of the optimal results are obtained with distraction, 53.3% of the worst results are also obtained with distraction. Among the worst results caused by distraction, 50% of the iterations are − 20, and these results were close to the existing methods. These worst results imply that a certain level of distraction can help improve network performance, yet excessive distraction can be counterproductive. To measure the average performance of concentration versus distraction, we rank the network performance in descending order from best to worst and average the results in Table 5. The average ranking of concentration is smaller in most datasets, indicating that the average performance obtained by concentration is more stable.

Table 4 The best and worst AUC obtained by the network and the corresponding SynPooling iteration I
Table 5 Syn's average ranking under concentration and distraction (the smaller the better)

To reveal the reason for the performance improvement from distraction, we investigate the transfer rate of the attention parameter winner, i.e., the attention parameter with the largest value. We record the change of the winning parameter between two adjacent epochs as one transfer, and the minimum transfer results are shown in Fig. 6. The minimum transfer rate is more representative than the maximum or average. The maximum transfer rate reaches 100% at the early stage of training for datasets with large instance amounts (e.g., Web), and the average transfer rate is also close to 100%. There is a correlation between the minimum winning transfer rate and the best AUC results. Most datasets achieve higher transfer rates through distraction, so the network improves performance by observing more instances. For the Musk 2, Web 4, and Web 5 datasets, distraction does not result in higher transfer rates, so the best AUC is reached by concentration.

Fig. 6
figure 6

Minimum winner parameter transfer rate under concentration and distraction. Most datasets have higher transfer rates by distraction, yet no significant difference in the Musk2, Web4, and Web5 datasets

To verify the robustness enhancement by concentration, we add noise to each data set with distraction-favored hyperparameters and single control of the Syn iterations I. Images are added with Gaussian white noise of different signal-to-noise ratios (SNR). Texts are imposed with a certain percentage of text deletion, replacement, and swap according to [59]. The optimal results and the related I are shown in Table 6. Although the hyperparameters are configured with the predominance of distraction, most image datasets obtain better results by concentration in the noisy environment. There are large fluctuations in the accuracy of the textual datasets, and the effects of concentration failed to be reflected. Results imply that concentration improves the robustness of the network for image datasets.

Table 6 The optimal results and steps corresponding to different proportional noise under the hyperparameter configuration with predominant distracted attention

Conclusions and future work

In this paper, we construct the SNN-based Syn network layer with forward and backward working modes, and use it as an extension of the attention mechanism. The forward mode sparsifies attention weights for concentration, and the backward mode densifies attention weights for distraction. The experimental data show that the Syn layer achieves state-of-the-art results in multiple MIL benchmark datasets, and most of the best results are derived from distraction. Analyses show that the distraction improves the chance of attention focus shifting by increasing the winning order parameter transfer rate to observe more instances. For noisy image samples, concentration enhances the robustness of the network and reduces the loss of accuracy.

It is worth noting that the MIL problem is a generalization of pattern classification with numerous branches. All the datasets in this paper satisfy the standard MIL assumptions, i.e., the positive bag contains at least one positive instance, and the negative bag has only negative instances. For more complex problems, e.g., the positive bag needs the accumulation of a variable number of positive instances, or the negative bag contains positive instances, the Syn layer is not readily applied. In addition, the hyperparameter configuration of Syn lacks regularity. Datasets with similar properties may correspond to divergent hyperparameters. Therefore, the generalization and the meta-learning ability of the Syn layer require more research. In future work, we plan to apply Syn to more complex attention network models and explore its effectiveness in deep learning. Concentrated attention can be used as a plug-and-play module for pruning attention weights to improve the efficiency and robustness of large deep networks. Distracted attention can extend the parameter learning space and enhance the performance of small deep networks. In addition, the fine-grained and dynamic tuning of attention weights based on Syn also has promising prospects.