Tolerance versus synaptic noise in dense associative memories

The retrieval capabilities of associative neural networks can be impaired by different kinds of noise: the fast noise (which makes neurons more prone to failure), the slow noise (stemming from interference among stored memories), and synaptic noise (due to possible flaws during the learning or the storing stage). In this work we consider dense associative neural networks, where neurons can interact in $p$-plets, in the absence of fast noise, and we investigate the interplay of slow and synaptic noise. In particular, leveraging on the duality between associative neural networks and restricted Boltzmann machines, we analyze the effect of corrupted information, imperfect learning and storing errors. For $p=2$ (corresponding to the Hopfield model) any source of synaptic noise breaks-down retrieval if the number of memories $K$ scales as the network size. For $p>2$, in the relatively low-load regime $K \sim N$, synaptic noise is tolerated up to a certain bound, depending on the density of the structure.


I. INTRODUCTION
Associative memories (AM) are devices able to store and then retrieve a set of information (see e.g., [1]). Since the 70's, several models of AM have been introduced, among which the Hopfield neural network (HNN) probably constitutes the best known example [2,3]. In this model one has N units, meant as stylized (on/off) neurons, able to process information through pairwise interactions. The performance of an AM is usually measured as the ratio α between the largest extent of information safely retrievable and the amount of neurons employed for this task; in the HNN this ratio is order of 1. In the last decades many efforts have been spent trying to raise this ratio (see e.g., [4,5] and references therein). For instance, in the so-called dense associative memories (DAMs) neurons are embedded on hypergraphs in such a way that they are allowed to interact in p-plets and α ∼ O(N p−1 ). However, this model also requires more resources as the number of connections encoding the learned information scales as N p instead of N 2 as in the standard pairwise model [6,8].
Clearly, whatever the AM model considered, limitations on α are intrinsic given that the amount of resources (in terms of number of neurons and number of connections) available necessarily yields to bounds in the extent of information storable. In particular, by increasing the pieces of information to be stored, the interference among them generates a so-called slow noise which requires a relatively large number of neurons or of connections to be resolved. Beyond this, one has also to face another kind of noise, which has been less investigated in the last years and which is the focus of the current work.
In fact, classical AM models assume that communication among neurons is perfect and that learning and storing stages can rely on exact knowledge of information whereas, in general, communication can be disturbed and the information provided may be affected by some source of noise (see e.g., [9,10]). We refer to the noise stemming from this kind of flaws as synaptic noise and we envisage different * Correspondence email address: agliari@mat.uniroma1.it ways to model it, mimicking different physical situations. In each case we investigate the effects of such a noise on the retrieval capabilities of the system and on the existence of bounds on the amount of noise above which the network can not work as an AM any longer. More precisely, our analysis is led on hyper-graphs with p ≥ 2 and we highlight an interplay between slow noise, synaptic noise and network density: by increasing p one can exploit some of the additional resources to soften the effect of slow noise and make higher load affordable, and some to soften the effect of synaptic noise and make the system more robust. On the other hand, here, possible effects due to fast noise (also referred to as temperature) are discarded and, since it typically reduces tolerance, our results provide an upper bound for the system tolerance. Also, this particular setting allows addressing the problem analytically via a signal-to-noise approach [2].
In the following Sec. II, we will frame the problem more quantitatively exploiting, as a reference model, the HNN: we will review the signal-to-noise approach and introduce the necessary definitions. Next, in Sec. III, we will consider the p-neuron Hopfield model and we will find that i. when the information to be stored is provided with some mistakes, then the machine will store the defective pieces of information and retrieving the correct ones is possible as long as mistakes are "small"; ii. when the information is provided exactly but the learning process is imperfect, then retrieval is possible but the capacity α turns out to be downsized; iii. when information is provided exactly and it is correctly learned, but communication among neurons during retrieval is impaired, then retrieval is still possible but α is "moderately" reduced. These results are also successfully checked versus numerical simulations. Finally, Sec. IV is left for our conclusive remarks. Since calculations for the p-neuron Hopfield model are pretty lengthy, they are not shown in details for arbitrary p, instead, we report explicit calculations for the case p = 4 in the Appendix.

II. NOISE TOLERANCE
In this section we introduce the main players of our investigations taking advantage of the HNN as a reference framework. The HNN is made of N neurons, each associated to a variable σ i ∈ {−1, +1}, with i = 1, ..., N representing the related status (either active or inactive), embedded in a complete graph with weighted connections. An HNN with N neurons is able to learn pieces of information which can be encoded in binary vectors of length N , also called patterns. After the learning of K such vectors {ξ 1 , ..., ξ K }, with ξ µ ∈ {−1, +1} K for µ = 1, ..., K, the weight for the coupling between neuron i and j is given by the so-called Hebbian rule J Hebb ij = 1 N K µ=1 ξ µ i ξ µ j for any i = j, while self-interactions are not allowed, i.e., J ii = 0, for any i. In the absence of external noise and external fields, the neuronal state evolves according to the dynamic where is the internal field acting on the i-th neuron. This dynamical system corresponds to a steepest descent algorithm where plays as a Lyapunov function or, in a statistical-mechanics setting, as the Hamiltonian of the model (see e.g., [2,3]). The retrieval of a learned pattern ξ µ , starting from a certain input state σ(t = 0), is therefore assured as long as this initial state belongs to the attraction basin of ξ µ , according to the dynamic (1), in such a way that, eventually, the neuronal configuration will reach the stable state σ = ξ µ . With these premisis, the signal-to-noise analysis ascertains the stability of the arbitrary pattern ξ µ by checking whether the inequality is verified for any neuron i = 1, ..., N . Of course, this kind of analysis can be applied to an arbitray AM model, by suitably defining the internal field in the condition (4), as h i issues from the architecture characterizing the considered model. Before proceeding, a few remarks are in order.
The expression "signal-to-noise" refers to the fact that, as we will see, the l.h.s. in (4) can be split into a "signal" term S and a "noise" term R, the latter typically stemming from interference among patterns and growing with K. Thus, the largest amount of patterns that the system can store and retrieve corresponds to the largest value of K which still ensures S/R 1. Further, since we are interested in storing the largest amount of information, rather than Figure 1: RBM corresponding to faulty patterns. The machine is built over a hidden layer made of Gaussian neurons {z µ } µ=1,...,K and a visible layer made of binary neurons {σ i } i=1,...,N ; in this case a neuron z µ belonging to the hidden layer can interact with one neuron σ i belonging to the visible layer and the coupling is η µ i = ξ µ i + ωξ µ i , as described by Eq. 6. Since the machine is restricted, intra-layer interactions are not allowed. In the dual associative network the neurons interact pairwise (p = 2) and the synaptic weight for the , as reported also in Eq. 7. This structure can be straightforwardly generalized for p > 2. In this figure, seeking for clarity, only a few connections are drawn for illustrative purposes.
the largest amount of patterns, recalling the Shannon-Fano coding, the pattern entries shall be drawn according to for any i, µ, that is, entries are taken as i.i.d. Rademacher random variables. Remarkably, the above mentioned Hebbian rule accounts for perfect i. dataset, ii. learning and iii. storage of information whereas, in general, some source of noise may take place and, according to the stage where it occurs, we revise J Hebb as explained hereafter.
(i) The first kind of noise we look at allows for corrupted patterns, referred to as {η µ } µ=1,...,K , and defined as whereξ µ i is a standard Gaussian random variable and ω is a real parameter that tunes the noise level. The Hebbian rule, in the case p = 2, is therefore revised as This is an inner and rather strong kind of noise, in fact, as we will show, even in a low-load regime (i.e., K/N p−1 → 0) and for relatively small values of ω, it implies the breakdown of pattern recognition capability. It is intuitive to see that this kind of noise leads Figure 2: RBM corresponding to shortcomings in the learning stage. The machine is built over a hidden layer made of Gaussian neurons {z µ } µ=1,...,K and a visible layer made of binary neurons {σ i } i=1,...,N ; in this case a neuron z µ belonging to the hidden layer can interact simultaneously with two neurons (σ i , σ j ) belonging to the visible layer and the coupling is ξ µ i ξ µ j + ωξ µ ij , mimicking a situation where the correct patterns are learnt, yet interaction among the two layers is disturbed. Since the machine is restricted, intra-layer interactions are not allowed. In the dual associative network the neurons interact 4-wise (p = 4) and the synaptic weight for the 4-plet (σ i , σ j , σ k , σ l ) is , as reported also in Eq. 18. Notice that this kind of noise is intrinsically defined only for associative networks where p is even and that when p = 2 we recover case depicted in Fig. 1. Also in this figure, seeking for clarity, only a few connections are drawn for illustrative purposes.
to such a dramatic effect if one looks at the dual representation of the associative neural network in terms of a restricted Boltzmann machine (RBM) [9,[13][14][15][16][17], see Fig. 1. In fact, the coupling (7) is reminiscent of the fact that, during the learning stage, the system is fed by noisy patterns and therefore it learns patterns along with their noise. Notice that, when neurons interact p-wisely, the coupling J ij turns out to be a polynomial order p in ω.
(ii) The second kind of noise we look at can be thought of as due to flaws during the learning stage. Still looking at the RBM representation, in this case the couplings between visible and hidden units are noisy and, again, we quantify this noise by ω times a standard Gaussian variable, see Fig. 2. Notice that, when p = 2 (as for the classical HNN), this kind of noise coincides with the previous one and, in general, it yields to a revision in the coupling J Hebb ij given by additional terms up to second order in ω. This suggests that, in this case, effects are milder with respect to the previous one. In fact, as we will see, in a low-load regime, the degree of noise ω can grow algebraically with the system size, without breaking retrieval capabilities. Figure 3: RBM corresponding to shortcomings in the storage case. The machine is built over a hidden layer made of Gaussian neurons {z µ } µ=1,...,K and a visible layer made of binary neurons {σ i } i=1,...,N ; in this case a neuron z µ belonging to the hidden layer can interact with one neuron σ i belonging to the visible layer and the coupling is ξ µ i , namely the patterns are correctly learnt and communications between the two layers is devoid of flaws. Since the machine is restricted, intra-layer interactions are not allowed. In the dual associative network the neurons interact pairwise (p = 2) and the synaptic weight for the couplet (σ i , σ j ) is J ij = µ ξ µ i ξ µ j + ωξ µ ij , as reported also in Eq. 8. This structure can be straightforwardly generalized for p > 2.
In this figure, again, seeking for clarity, only a few connections are drawn for illustrative purposes.
(iii) The third kind of noise we look at can be thought of as due to effective shortcomings in storage as it directly affects the coupling among neurons in the AM system as where, again,ξ µ ij is a standard Gaussian random variable and ω is a real parameter that tunes the noise level. In the RBM representation, this corresponds to a perfect learning, while defects emerge just in the associative network, see Fig. 3. Notice that the coupling in (8) is linear in ω and it yields to relatively weak effects, in fact, we will show that in a low-load regime, ω can grow "fast" with the system size, without breaking retrieval capabilities. It is worth recalling that the problem of a HNN endowed with noisy couplings like in (8) has already been addressed in the past (see e.g., [2,19,20,22,23]). In particular, Sompolinsky [19,20] showed that, in the high-load regime (i.e., K ∼ N ), the strength of noise affecting couplings still preserving retrieval is of order one. More precisely, denoting by δ ij a centered Gaussian variable with variance δ 2 and setting J s ij = µ ξ µ i ξ µ j /N + δ ij / √ N , he found that, as δ is fine tuned, the system capacity α is lowered and it vanishes for δ ≈ 0.8. From this result, one can conclude that the HNN is relatively robust to the presence of "moderate levels" of effective synaptic noise. These findings are recovered in our investigations and suitably extended for p > 2. Notably, this kind of noise also includes, as a special example, the diluted network, where a finite fraction of the connections are cut randomly, still retaining a giant component [2,[19][20][21].
Before concluding we need a few more definitions. As aforementioned, we distinguish the tolerance with respect to interference among patterns (slow noise), which grows with K, and with respect to errors during learning or storing (synaptic noise), which grows with ω. More quantitatively, we set and we introduce Finally, the Mattis magnetization, defined as is used to assess the retrieval of the µ-th pattern.

III. THE p-NEURON HOPFIELD MODEL WITH SYNAPTIC NOISE
The p-neuron Hopfield model is described by the Hamiltonian where the sum runs over all possible p-plets and selfinteractions are excluded. This kind of model provides an example of dense AMs, which have been intensively studied in the last years (see e.g., [7][8][9]11]).
For even p, this model is thermodynamically equivalent to a RBM equipped with a hidden layer made of K Gaussian neurons {z µ } µ=1,...,K and with a visible layer made of N binary neurons {σ i } i=1,...,N , but now couplings in the RBM are (1+p/2)-wise and include one hidden neuron and p/2 visible neurons, say (z µ , σ i1 , ..., σ i p/2 ), and the related coupling in the p-neuron Hopfield model is ξ µ i1 ...ξ µ i p/2 . To see the equivalence between this RBM and the model described by (14) we look at the RBM partition function and we perform the Gaussian integration to marginalize over the hidden units as where the inverse temperature β has been properly rescaled into β . Let us start the study of this system in the presence of slow noise only and let us check stability of the configuration ξ 1 , without loss of generality. By signal to noise analysis we write and, for large N , from the central limit theorem, Recalling that the condition for retrieval is R (0) S, the highest load corresponds to K ∼ N p−1 , namely as previously proved in [8]. This result shows that increasing the number of interacting spins allows to arbitrary increase the tolerance versus slow noise. It is then natural to question if an analogous robustness can be obtained versus synaptic noise too. In the next subsections we address this question for the three sources of noise outlined in Sec. II.

A. Noisy patterns
When the noise affects directly patterns constituting the dataset, using Eq. (6) we can write the product between the local field and a pattern, according to Eq. 2, as Splitting the sum into a signal S and a noise R term we obtain h (p) The quantity R (0) is the standard contribution due to slow noise given by Eq. (16), whileR = p n=1 R (n) derives from the presence of synaptic noise. To simplify the following formulas we rename i as i 1 and write this last contribution where (ia 1 ...ia n ) denotes the sum over all possible permutations of n indices chosen from i 1 . . . i p . Using the central limit theorem (as explained in details for p = 4 in the Appendix A) we obtain that Then, at leading order, it holds Therefore, overall, the noise R = R (0) +R scales as Recalling that S ∼ 1, we conclude that retrieval is possible provided that ω ∼ 1, independently of the number K of stored patterns (up to K ∼ N p−1 ). This implies that a diverging synaptic noise (i.e., ω ∼ O(N b ), b > 0) can not be handled by the system even if the number p of spins interacting and, accordingly, the number of links, is arbitrarly increased. This result is checked numerically as shown in Fig. 4. In particular, we notice that, as long as ω remains finite (or vanishing) while the size N is increased, i.e., as long as b ≤ 0, the Mattis magnetization corresponding to the input pattern is non null and the system can retrieve. The transition between a retrieval and a non-retrieval regime is sharper when the network size is larger. In Fig. 5 we focus on p = 2 and we set the ratio K/N < α(b = 0) ≈ 0.138, while we perform a fine tuning by varying ω ∈ [0, 3]. As expected, even small values of ω are sufficient to break down retrieval capabilities.

B. Noisy learning
Let us now consider the AM corresponding to imperfect learning as depicted in Fig. 2. This equals to say that the noise affects the (p/2 + 1)-component tensor in such a way that the coupling between neurons is Notice that this picture is possible only for even p and constitutes a generalization of the system studied in [12]. The product between the local field and the pattern ξ 1 candidate for retrieval reads Again, we can split this quantity into a signal S and noise R = 2 n=0 R (n) term, the signal and the zeroth order of noise are, as already shown, The first order contribution is and, in the limit of large network size (for more details we refer to Appendix A were calculations for p = 4 are reported), Similarly, the second order contribution is of the form We then deduce that the noise R scales as and therefore, neglecting subleading contributions, we can write Setting K ∼ N a and ω ∼ N b the condition for retrieval reads N 1/2−p/4+b + N (1−p+a)/2+2b 1. Conversely, if a > 1, the second term prevails and consequently the extremal condition for retrieval becomes b = (p − 1 − a)/4, and the tolerance is Note that in this case the tolerance depends on a, that is, on the network load. This shows that storing and tolerance are intimately tangled: the larger the load and the smaller the noise that can be handled. In particular, at low load, so for a = 1, the tolerance reads as corroborated numerically in Fig. 6. For p = 2 this kind of noise reduces to the case discussed in Subsec. III A and consistenly we get β 2 (1) ∼ 1.

C. Noisy storing
Finally, we consider noise acting directly on couplings, where η µ i1...ip is the (p + 1)-component tensor Still following the prescription coded by Eq. 2, the product between the local field h i and ξ 1 i is The signal scales as S ∼ 1, while the noise is composed on solely two contributions: zeroth and first order. We have already computed the former and, as for the latter, it holds Therefore, Setting, as before, K ∼ N a and ω ∼ N b the condition for retrieval becomes This implies that the tolerance versus pattern noise is This is succesfully checked numerically in Fig. 7. The particular case p = 2 is considered in Fig. 8. Again, as pointed out in the previous section, tolerance and load are sides of the same coin: an increase of the latter results in a decrease of the former. A similar problem, for the p = 2 Hopfield model, has been studied by Sompolinsky [19,20]. In particular the following couplings have been considered Here δ ij are Gaussian variables with null mean and variance δ 2 , whileJ s ij represents the correction to Hebbian couplings due to noise. Focusing on the high load regime, that is K ∼ N , retrieval was found to be possible provided that δ 0.8. We can easily map noise defined by Eq. (8) into this notation, indeed As a consequence, in our framework the noisy contribution to couplings reads where ω ij are Gaussian variables with null mean and variance ω 2 . Considering the high load regime we then obtaiñ This shows that ω ij is the counterpart of δ ij and, therefore, that ω plays the same role of δ. Recalling Eq. (21) and setting p = 2 and a = 1 we conclude that retrieval is possible provided that ω 1. This result is in perfect agreement with Sompolinsky's bound δ 0.8 and also with the simulations we run.

IV. CONCLUSIONS
In this work we considered dense AMs and we investigated the role of density in preventing retrieval break-down due to noise. In particular, we allow for noise stemming from pattern interference (i.e., slow noise) and for noise varying ω linearly in [0, 5]. We compare two loads: K/N = 0.125 (×) and K/N = 0.04 (+). Notice that, in both case, as ω is relatively large the retrieval is lost.
stemming from uncertainties during learning or storing (i.e., synaptic noise), while fast noise is neglected. Synaptic noise ultimately affects the synaptic couplings among neurons making up the network and we envisage different ways to model it, mimicking different physical situations. In fact, since couplings encode for the pieces of information previously learned, we can account for the following scenarios: i. information during learning is provided corrupted, ii. information is supplied correctly but is imperfectly learned, iii. information is well supplied and learned but storing is not accurate. These cases are discussed leveraging on the duality between AM and RBMs [9,[13][14][15][16][17].
Investigations were led analytically (via signal-to-noise approach) and numerically (via Monte Carlo simulations) finding that, according to the way synaptic noise is implemented, effects on retrieval can vary qualitatively. As long as the dataset is provided correctly during learning, synaptic noise can be annihilated by increasing redundancy (i.e., by letting neurons interact in relatively large cliques or work in a low-load regime). On the other hand, if, during learning, the machine was presented to corrupted pieces of information, it will learn noise as well and the correct information can be retrieved only if the original corruption is non diverging, no matter how redundant the network is.

ACKNOWLEDGMENTS
EA is grateful to Università Sapienza di Roma (Progetto Ateneo RG11715C7CC31E3D) for financial support. In this appendix we set p = 4 and we go through signalto-noise calculations in detail.
The 4-neuron Hopfield model is described by the Hamil-tonian where the sum is meant without self-interaction. Let us start the study of this system in presence of slow noise only and let us check stability of the configuration ξ 1 , without loss of generality. By signal to noise analysis we write where asymptotic expressions are obtained exploiting the central limit theorem. Recalling that the condition for retrieval is R (0) S, the highest load corresponds to K ∼ N 3 , namely (A2)

Noisy patterns
We now turn to the case in which the network is affected by pattern noise. We begin considering a situation in which the noise arises directly from patterns, in particular we suppose that the network stores the following vectors where ξ µ i are the patterns we would like to memorize, whilẽ ξ µ i are i.i.d. Gaussian variables with null mean and unitary variance. In order to study the stability of ξ 1 i we consider the local field acting on it We split this sum in signal S ad noise R = 4 i=0 R (i) . The signal and the zeroth order of noise are straightforward S ∼ 1, The first order is That is Let us study the two terms separately it then follows For what concerns the other term Combining the two terms we get We can now turn to the second order of pattern noise, proceeding as before it is easy to show that The first term is That is We then obtain The third order of noise is of the form where the two terms scale as Therefore Finally the fourth order is whose scaling is simply Combining the four contribution we obtain the following scaling for the noise Recalling that S ∼ 1 we deduce that the network can tolerate, at most, ω ∼ 1. In other words the tolerance versus pattern noise satisfies β(a) ∼ 1 for a ≤ 3.

Noisy learning
At second level we can consider the following form of synaptic noise The local field is defined as where, even if not specified, the sum does not contain selfinteraction among spins. In these terms the Hamiltonian is We want to study the stability of pattern ξ 1 i . Recalling that η µ ij = ξ µ i ξ µ j + ωξ µ ij we get N j,k,l ξ 1 i ξ 1 j ξ 1 k ξ 1 l ξ µ i ξ µ j ξ µ k ξ µ l + ωξ µ i ξ µ jξ µ kl + +ωξ µ k ξ µ lξ µ ij + ω 2ξµ ijξ µ kl .
We can split this sum in signal S and noise R = R (0) + R (1) + R (2) . The signal is The contribution to noise due to interference among patterns R (0) is As expected, in absence of pattern noise, the network can store up to N 3 vector patterns. At first order synaptic noise contributes with R (1) , whose expression is We then obtain Finally the second order of the pattern noise R (2) is In conclusion the noise can be written as We set K ∼ N a and ω ∼ N b , in this way we obtain, at leading order Recalling that retrieval is possible provided that R S ∼ 1 we see that there are two different regimes • a ≤ 1 In this case noise in dominated by the second term and the extremal condition for retrieval reads Therefore the tolerance versus pattern noise is β(a) ∼ N 1/2 for a ≤ 1.
• a > 1 Increasing the load reduces the tolerance versus pattern noise, indeed we obtain It then follows β(a) ∼ N (3−a)/4 for 1 < a < 3.

Noisy storing
Finally the less challenging noise is the one applied on 4-tensors or, analogously, on the couplings. This is of the form η µ ijkl = ξ µ i ξ µ j ξ µ k ξ µ l + ωξ µ ijkl .
Again we consider the product between the local field h i and ξ 1 The signal, as already shown, scales as S ∼ 1, while the noise is composed of two contributions: zeroth and first order. We have already computed the former For what concerns the first order it holds Setting, as before, K ∼ N a and ω ∼ N b the condition for retrieval becomes which implies that the tolerance versus pattern noise is β(a) ∼ N (3−a)/2 for a ≤ 3.