Some Remarks on Replicated Simulated Annealing

Recently authors have introduced the idea of training discrete weights neural networks using a mix between classical simulated annealing and a replica ansatz known from the statistical physics literature. Among other points, they claim their method is able to find robust configurations. In this paper, we analyze this so-called"replicated simulated annealing"algorithm. In particular, we explicit criteria to guarantee its convergence, and study when it successfully samples from configurations. We also perform experiments using synthetic and real data bases.


Introduction
In the past few years, there has been a growing interest in finding methods to train discrete weights neural networks. As a matter of fact, when it comes to implementations, discrete weights allow to reach better efficiencies, as they considerably simplify the multiply-accumulate operations, with the extreme case where weights become binary and there is no need to perform any multiplication anymore. Unfortunately, training discrete weights neural networks is complex in practice, since it basically boils down to a NP-hard optimization problem. To circumvent this difficulty, many works have introduced techniques that aim at finding reasonable approximations [5,4,17,8]. Among these works, in a recent paper Baldassi et al. [2] discuss the learning process in artificial neural networks with discrete weights and try to explain why these networks work so efficiently. They propose a new measure, called the robust ensemble, that ignores configurations with poor computational performance. Those, by definition, are configurations that are in a certain sense isolated. The robust ensemble on the other hand amplifies the dense regions with many good configurations. In [2] the authors present various algorithms to sample from this robust ensemble, one of them is Replicated Simulated Annealing or Simulated Annealing with Scoping. This algorithm combines the replica approach from statistical physics with the simulated annealing algorithm that is supposed to find the minima (or maxima) of a target function. We will define it in Section 2.
To give a first impression of this algorithm, assume we have Σ := {−1, +1} N as our state space and N is large. On Σ we have very rugged energy function E : Σ → R and assume that min σ∈Σ E(σ) = 0. To find the minima of E one could run a Metropolis algorithm for the Gibbs measure at inverse temperature β > 0: Here Z β is the so called partition function of the model, a normalizing factor that makes π β a probability measure. If one carefully lowers the temperature, i.e. if one increases β slowly enough during the process the corresponding Markov chain will get stuck in one of the maxima of π ∞ which are easily seen to be the minima of E. This is the classical Simulated Annealing algorithm, cf. [18] or [10] for the seminal papers. The question how to choose the optimal dependence of β from time t and its convergence properties have been extensively discussed. We just mention [13], [15], [11], [6], [3], for a short and by far not complete list of references. The upshot is that good "cooling schedule" is of the form β t = 1 m log(1 + Ct), where m can be roughly described as the largest hill to be climbed to get from an arbitrary state to one of the global minima of the target function. However, sometimes not all the global minima are equally important, in particular one may be interested in regions with many global minima (or almost global minima), so called dense regions. An obstacle may be, that E exhibits many global minima, but only relatively few of them are in dense regions. Let us motivate this question by a central example which we will often have in mind in this context and which also is one of the central objects in [2]. Example 1.1. Assume we have patterns ξ 1 , . . . , ξ M , ξ µ ∈ {±1} N , µ = 1, . . . , M where M = αN for some α > 0. Each of these patterns belongs to one of two groups, which we indicate by ϑ(ξ µ ) = ϑ µ ∈ {±1}. The task is to classify these patterns using a perceptron with weights The classification is supposed to be perfect, i.e. we want that Here denotes the Heaviside-function. Hence our classification task is fulfilled if M µ=1 Θ(−ϑ µ W, ξ µ ) = 0 (where we assume that N is even to avoid the specification of tie-breaking rules) or, equivalently The objective now is to find weights W such that (1) is true, i.e. we are searching for weights W such that E(W ) = − M µ=1 Θ(ϑ µ W, ξ µ ) is minimal. Obviously, this is optimization problem is of the above mentioned form. However, one prefers weights W in co called dense regions, i.e. weights that are surrounded by weights W that are also minima of E. The idea is these states have good generalization properties or a small generalization error. This means that we want to find weights, that still classify input patterns correctly, which we have not seen in our training set ξ 1 , . . . , ξ M . It is at least plausible that weights with a small generalization error lie in dense regions of {±1} N .
In this work we are interested in making explicit convergence properties of the algorithm of Replicated Simulated Annealing. We also perform experiments using synthetic and real data bases. The outline is as follows: in Section 2 we mathematically formalize and describe the algorithm of Replicated Simulated Annealing, in Section 3 we study its convergence properties. In Section 4 we perform experiments using synthetic and real data bases. Finally, Section 5 is a conclusion.

Replicated Simulated Annealing
Recall that we are searching for the minima of a function E : Σ → R, where Σ := {−1, +1} N . To find minima of E in dense regions of the state space the authors in [2] propose a new measure given by P y,β,γ (σ) := Z −1 (y, β, γ) exp(yΦ β,γ (σ)) (2) which at least formally has the structure of a Gibbs measure at inverse temperature y. Its "energy function" is given by where d(·, ·) is some monotonically increasing function of a distance on Σ. This distance will be chosen below, but there are not too many reasonable essentially different distance functions on Σ, anyway. Since Φ y,β,γ weights each configuration σ by a function of its distance to σ ′ and the σ ′ again by an exponential of their energy, it is, indeed, plausible that Φ y,β,γ is much smoother than E and will have its minima in dense regions. We will come back to this question in the next section. However, a serious problem is, how one could simulate from the measure P y,β,γ . Indeed, computing the "energy" Φ y,β,γ (σ) of a single configuration σ involves, among others, computing E(σ ′ ) for all σ ′ ∈ Σ. Computing these values is almost as hard as finding the minima of E (even though one might not be immediately able to tell which of these minima are in dense regions). To find a promising algorithm that does not rely on computing all the values of E(σ), Baldassi et. al. [2] propose the following: First of all assume that y ≥ 2 is an integer. Second take as function of the distance between two spins σ and σ ′ the (negative) inner product: d(σ, σ ′ ) = − σ, σ ′ . As a matter of fact, this is a natural choice, since two natural distance functions, the Hamming distance and the square of the Euclidian distance are functions of the inner product: and the N dependent terms cancel, because they also occur in Z(y, β, γ). Using the fact that y is an integer, we can now compute the partition function Z(y, β, γ) of this model: Hence Z(y, β, γ) can be considered as a partition function on the space of all (σ, {σ a }) ∈ Σ y+1 of the measure Its marginal with respect to the second coordinate {σ a } is given by Making use of our choice d(σ, σ ′ ) = − σ, σ ′ we obtain for the numerator in Q: Putting the 2 N into the normalizing constant we thus obtain that This form of Q is now accessible to a Simulated Annealing algorithm: being in {σ a } one picks one of the σ a at random and one coordinate σ a i of σ a at random and flips it to become −σ a i . This new configuration is then accepted with the usual Simulated Annealing probabilities.
This function, however, may be a bit unwieldy when using Simulated Annealing, since it just tells how many patterns have been classified correctly but not whether we are moving in a "good" or "bad" direction when the proposed configuration {W ′ a } has the same energy as the old configuration {W a }. We therefore propose (as e.g. [2]) Here R(x) = x+1 2 Θ(x) and we again assume that N is odd, otherwise we would need to take R(x) x 2 Θ(x). In other words E µ is the number of bits that we need to change, in order to classify ξ µ correctly.

Convergence of the annealing process
In this section we want to discuss the convergence properties of the annealing procedure introduced above. The two major questions are: Does the process converge to an invariant measure, and if so, does this measure have the desired property of favoring dense regions ? We will distinguish two cases: the first is when γ in the definition of the measures Q in (4) and Q in (5) does not depend on time, while the second is, when it does. Before analyzing these two cases, we will slightly modify the annealing procedure, to make it accessible to the best results that are available for discrete time, see [1]. As a matter of fact, we find discrete time slightly more appropriate for computer simulations than the continuous time set-up in e.g. [15], [11], or [6]. To this end, we will study cooling schedules, where the inverse temperature β n is fixed for T n consecutive steps of the annealing process. Denote by ν n the distribution of the annealing process X n at time Note that ν n can be computed recursively: If S β denotes the transition matrix of the Metropolis-Hastings chain (see [12] or [14]) at inverse temperature β (see (6) below), then where ν 0 is a fixed probability measure on Σ y .
3.1. Fixed γ. If γ is fixed it is convenient to rewrite the Simulated Annealing algorithm introduced above into a γ-dependent part and a β-dependent part. To this end, let us introduce the following probability measure µ 0 on Σ y : Next define a transition matrix Π on Σ y . Π will only allow transition from {σ a } to {σ a }, if there are exactly one σ a , a = 1, . . . y and one i = 1, . . . N, such that σ a i = −σ a i , and for all other b and j we have σ b j =σ b j . In this case, we define .

For all other configurations {σ
Note that Π γ is nothing but the Metropolis-Hastings algorithm for the measure µ 0 (see [14]). In particular, Π γ is reversible with respect to the measure µ 0 , i.e.
Now consider the Metropolis-Hastings chain on Σ y with proposal chain Π γ and transition probabilities For an appropriate normalizing constantΓ this chain has as its invariant measure So indeed for each fixed β > 0, γ > 0 we have found a Metropolis chain for Q.
If we now let β = β n depend on n in the form described at the beginning of the section, we arrive at a Simulated Annealing algorithm with piecewise constant temperature. We will quickly introduce some of Azencott's notation [1]. The invariant measure of S βn is Q βn . Recall that we assumed that min σ∈Σ E(σ) = 0 and define B := min σ:E(σ) =0 E(σ).

Next we bound
for some constant κ 1 . Indeed such an estimate is true for any difference of Gibbs measures with respect to the same energy function. To see this let ρ βn := exp(−β n U(x)) Z n be a sequence of Gibbs measures with respect to the energy function U on a discrete space of size K. We assume min x U(x) = 0 (without loss of generality, otherwise we subtract the minimum from U) and min x: To describe the spectral gap of S βn , for any two

Elev(p)
and The quantity m is related to the optimal cooling schedule for Simulated Annealing as well as to the spectral gap of the associated Metropolis Hastings algorithm S β . To understand this, define the operator Let E β be the associated Dirichlet form, i.e. for functions f, g : Then with we have Proposition 3.1. There are constants d > 0 and D < ∞ such that for all β ≥ 0, But we also have that This establishes the relation of m to the spectral gap of S β . Introduce Then In particular, we need that n k=1 T k exp(−β k m) → ∞. In this case ν n has the same limit as Q βn and this is given by a distribution Q ∞ on and Q ∞ ({σ a }) = 0, otherwise (here ≍ denotes proportionality). Of course Q ∞ is normalized in such a way that it is a probability measure on N 0 .
Proof. The convergence part is basically the content of [1,Section 7]. Note that the computations there are done for a proposal chain Π that has the uniform measure as its invariant distribution. However, the proof on p. 231 [1] carries over verbatim to our situation. After that it is easy matter to check that Q βn has a limiting distribution Q ∞ and that Q ∞ charges every point in N 0 with a probability proportional to µ 0 ({σ a }).
A case where (8) holds is given by for a constant b > 0, and m and C as given in Proposition 3.1. As Azencott [1] points out, in this case for some α > 1, and T n and L n have the same order of magnitude, i.e. the algorithm spends most of the time in the lowest temperature band. One also sees the logarithmic relation between β and T , i.e.
We now turn to the question whether this algorithm achieves that typical samples from it have realizations in dense regions of Σ. First of all this needs to be defined The authors in [2] seem to have in mind a situation close to the following caricature: we say that a sequence of energy functions E N on Σ N : It is now rather obvious that Q ∞ ({·}) prefers such dense regions: Proposition 3.5. Assume we are in the situation described in Situation 3.4. Hence we have a sequence of energy functions that is (a, b, α N , δ N )-regular. Then, given ε > 0, for any admissible choice of these parameters, there exists y, N 0 and γ such that Proof. Note that Q ∞ has its mass concentrated on the set N 0 and the differences in the mass for the various configurations from this set stem from factor log cosh(γ y a=1 σ a i ) Γ Let us just consider the numerators of these weights. Let σ be an δ N -isolated minimum of E N . If all σ 1 , . . . , σ y are located in σ, then the numerator of µ 0 ({σ a }) equals exp (N log cosh(γy)). Otherwise there is at least one σ a that is different from σ, say in a global minimum τ of E N . By assumption d H (σ, τ ) ≥ δ N . Thus a configuration that has at least one σ a = τ has a weight at most exp((N − δ N ) log cosh(γy) + δ N log cosh((y − 2)γ)). Now there are b N − a N δ N isolated minima. Hence the sum of the numerators of the probabilities of these isolated minima can be be bounded from above by exp (N log cosh(γy)) b N 1 + b N y e (N −δ N ) log cosh(γy)+δ N log cosh((y−2)γ)) e N log cosh(γy) .
Here b N is a bound on the number of isolated minima, exp (N log cosh(γy)) is the weight, when all σ a are identical, b N y is an upper bound on the number of choices we have, when one σ a equals a given isolated minimum and at least one σ b is different, and finally e (N −δ N ) log cosh(γy)+δ N log cosh((y−2)γ)) is a rough upper bound on the weight in that case. Note, that we will choose γ and y below in such a way that yγ → ∞, when N → ∞. We will therefore bound log cosh(yγ) ≤ yγ. Then the total contribution of the isolated minima becomes at most: 2δ N the contribution to the numerator of the probability of the isolated minima will be at most 2 exp (Nγy)) b N . On the other hand, for the case that all σ a are in the dense region B α N (σ) we have a N y choices. For each of these choices at least N − yα N of the coordinates of all σ 1 , . . . , σ y are identical. Again, since yγ → ∞, when N → ∞, given ε ′ > 0 we may bound log cosh(yγ) ≥ yγ(1 − ε ′ ). Thus the overall weight (this is again the numerator of the corresponding probability) of the dense region is at least a N y e (N −yα N )γy(1−ε ′ ) . To compare the two weights, let us see, if we can arrange the parameters in such a way that To this end, substitute γ = N y log b 2δ N (and note that indeed γ = γ N → ∞ as N → ∞) to obtain for the exponent on the right hand side: log a ⌉. Since, by assumption α N δ N → 0 and y does not depend on N, also y 2 log b α N 2δ N converges to 0. This implies that the exponent will eventually become negative, hence the dense region carries an arbitrarily large mass.
Remark Reading [2] carefully, one may get the impression that for them a dense region is one with an exponential number of local minima of E N . However, if we are taking the limit β → ∞ as in a real Simulated Annealing schedule, the local minima that are not global minima will eventually get zero probability. As a matter of fact, if one works with finite times, this is, of course, not true. In this case however, one could equally well study a low temperature Metropolis chain, since most of the time in the annealing schedules is spent in the low temperature region, anyway, as remarked above. For this Metropolis-Hastings chain a result similar to Proposition 3.5 can be shown very similarly. 3.2. The limit γ → ∞. The situation where also γ depends on time and converges to infinity, when time becomes large, seems to be the version of the algorithm that the authors in [2] have in mind. In this case, we need to modify the considerations of the previous section. Again we will assume that we keep β n , γ n constant on an interval T n ≤ t ≤ T n+1 − 1. For the algorithm in this fixed time interval, again, the invariant measure is given by Q β with β = β n and γ = γ n as given in (7). This is the case because during this interval the parameters of the Metropolis chain do not change. To stress the dependence on both parameters, we will now denote this measure by Q β,γ . Following the arguments in the previous subsection we now see that there is a constant κ 2 , such that  − 2)).
The maximum of H is realized when we take all σ a identical, while the second term in B ′ stems from the fact that the we obtain the second largest value of H by changing one σ a in one spin from a maximizing configuration. Since we will consider the limit γ → ∞ we may safely replace log cosh(γy) by γy − log 2 and log cosh(γ(y − 2)) by γ(y − 2) − log 2 to obtain To determine how the cooling schedule has to be chosen, we need to estimate the spectral gap of the Metropolis chain. Note that, if we use Proposition 3.1 to do so, we run into the problem, that the constants c and C there depend on time, because the energy function does. The solution is, of course, to include this time dependence into the definitions. Hence for a time t let Then, we can represent the Simulated Annealing chain, which we will now denote by S β,γ and which is still given by (6) (with the only difference that now also Π γ depends on time) as a Simulated Annealing algorithm with time-dependent energy function F t , see e.g. [20], [9]. Indeed, in this case we may alter the proposal chain in (6) from Π γ to Π, where Π being in {σ a } picks one of a = 1, . . . y and one index i = 1, . . . , N at random and alters σ a i to −σ a i . Then S β,γ can be written as Now we can use results from [20] (cf. [19] for related work) to compute the spectral gap S β,γ . In analogy to what we did in the previous subsection, define Again, in analogy to the previous subsection for functions f, g : Σ y → R let the Dirichlet-form E β,γ be given by Then for it holds Proposition 3.6. There are constants d > 0 and D < ∞ such that for all β ≥ 0.
Proof. This is the content of [20, Theorem 2.1].

This gives
(cf. [1, (7.14)]). Here Hence we need to chose our parameters β n , γ n , T n in such a way that the right hand side converges to zero. In this case we have shown the following theorem.
as n → ∞, the distribution ν n of S βn,γn has the same limit as Q βn,γn as n → ∞. Following [1, (7.17)] a necessary condition for (11) is The result of the theorem is, however, not what authors in [2] seem to intend with their introduction of Replicated Simulated Annealing algorithm. Indeed, when β n → ∞, and γ n → ∞ the measure Q βn,γn converges to Q ∞,∞ . However, the latter is nothing but the uniform distribution oñ In particular, Q ∞,∞ does not put higher probability on configuration in dense regions of the state space.
In the next section, we empirically study a slightly modified version of the algorithm of Replicated Simulated Annealing, using both synthetic toy datasets and real data bases.

Experiments
Throughout this section, we present various experiments we conducted to empirically study the effectiveness of Replicated Simulated Annealing. Notice that in this simulation section we will necessarily stay closer to the setting in [2]. Especially, other than in the preceding theoretical section we will not let β and γ tend to infinity (for the theoretical part this was necessary in order to get convergence results, while it is impossible in practical applications). The precise setting will be described below. We will put an emphasis on studying the effect of the choice of these hyperparameters on the performance of our algorithm, as well as the robustness of the found solutions. We conduct our experiments using the MNIST dataset, also described below, and synthetic data.
MNIST is a dataset of images depicting digits between 0 and 9. With a training set of 6'000 examples per digit, the aim is to correctly predict which digits are contained in previously unseen images (called test set). MNIST images are 28x28 pixels, and grey-leveled. As such, they are typically represented by a 784-sized vector of numbers between 0 and 255. When training a binary (weights can only be -1 or 1) logistic regression classifier on MNIST using Replicated Simulated Annealing, we typically achieve a 88% accuracy on the test set, which is on par with the performance obtained with continuous weights and gradient descent. Note that when training our models, we use the cross-entropy loss as our energy. We train the networks for a total of 300'000 total iterations, starting from a random configuration. Our models contain a total of 784 · 10 = 7840 parameters, corresponding to a single matrix which input is a raw image of 784 dimensions and which output is a 10-sized vector where the largest coordinate indicates the associated decision.

4.2.
Effect of the initial and final values of β. As mentioned above in our experiments we will always take β from a certain bounded range of values [β i , β f ] (these bounds will be used in the remaining of this work). We first explore the influence of the initial and final values for β. Throughout our experiments, we change the value of β from β i to β f following an exponential interpolation, which appeared to give the best and most consistent results among the interpolations we tried. Note that for this first series of experiments we only train one model (γ = 0). First in Table 1, we indicate the number of active transitions (when a potential flip of a value has been performed). Little surprisingly, we observe that the higher the values of β, the less likely we perform flips. We observe a range of two orders of magnitude with our selected parameters. In Table 2, we depict the corresponding training loss and training accuracy. Interestingly, we observe that the largest values of β are not necessarily giving the best results, suggesting that allowing to perform flips that immediately slightly lower the loss can be beneficial in the long run. We also observe that the results do not seem to be very sensitive of the choice of initial and final values for β, as a large range of this values yield very similar performance. Together with Table 1, we can observe that β i = 100 and β f = 100 ′ 000 is a reasonable choice of parameters. This is confirmed by Table 3 where we depict the corresponding test loss and test accuracy.

Influence of γ.
To better see the influence of γ, in the next series of experiments we reduce the number of training samples to accelerate computations. Namely we use 10'000 arbitrary training samples. We perform 10 runs for each value of γ, choosing the best values of β i and β f found in the previous section. We plot the error bars (confidence interval at 95%) for each value of γ. In Figure 1 we depict the evolution of the training accuracy and training loss. In Figure 2 the evolution of the test accuracy and test loss, and in Figure 3 the evolution of the number of active transitions. We observe that γ helps in finding better solutions, that is to say solutions with higher accuracies on both the training and the test set. That is only true for a limited range though, as increasing γ too much lead to dramatic decrease in overall performance. This is not surprising as a too large γ leads to forbid many transitions that would result in reducing the loss.

Robustness of trained models.
To study the robustness of trained models, we consider randomly perturbating a proportion p of the weights in the trained models, and evaluating the impact on the test accuracy. We average each point over 1'000 runs of random perturbations, but since it takes a very long time to train the models with MNIST, we always use the same trained models (one for each value of γ). In Figure 4, we depict the results for y = 3, Figure 5 for y = 5 and Figure 6 for y = 7. In order to add statistically more significant results, we also plot in Figure 7 results obtained with synthetic data and y = 10. Synthetic data is created by generating 30 vectors uniformly drawn with repetition from all binary vectors of size 100. In this experiment, we average the results over    1'000 tests for each point. For this additional experiment, we found that the best values are β i = 0.1 and β f = 1 ′ 000.  Interestingly, we observe that the most robust models are the ones for a balanced value of γ, typically 0.8 or 1.6. This is even true for the case of synthetic data, despite the fact all the models start with a perfect accuracy of 100% when uncorrupted. This is inline with the claims of the authors of [2].

Conclusion
In this work, we have proposed to mathematically and empirically study the algorithm of Replicated Simulated Annealing, that is used to find good configurations of discrete weights neural networks. We have explicited convergence guarantees and discussed the ability of the algorithm in finding good configurations in dense robust regions of the search space. We also performed experiments using both real datasets and synthetic data. Overall, our findings show that Replicated Simulated Annealing is able to find interesting configurations, but that the gain compared to a simple Simulated Annealing is rather small, sometimes even nonexistent in the asymptotic regime, depending on whether one lets γ → ∞ or not.