The solution of the deep Boltzmann machine on the Nishimori line

The deep Boltzmann machine on the Nishimori line with a finite number of layers is exactly solved by a theorem that expresses its pressure through a finite dimensional variational problem of min-max type. In the absence of magnetic fields the order parameter is shown to exhibit a phase transition.


Introduction
A deep (restricted) Boltzmann machine can be considered as a special case of the mean field multi-species spin glass model introduced in [11], further studied in [13,27]. Specifically the set of spins is arranged into a geometry made of consecutive layers and only interactions among spins belonging to adjacent layers are allowed. In particular intra-layer interactions are forbidden. Such architectural assumption makes it impossible to fulfill the positivity hypothesis under which the results of [11,27] were obtained. In fact the positivity property, encoded in an elliptic condition, requires dominant intra-group interaction with respect to inter-group ones. While the general deep (restricted) Boltzmann machine is still an unsolved problem (see nevertheless [1,3,4,5,12,18,23,24] for centered Gaussian interactions), we present here its exact and rigorous solution in a subregion of the phase space known as Nishimori line. In a previous paper [2] we have fully solved the elliptic multi-species model on the Nishimori line, where the property of replica symmetry, i.e. the concentration of the overlap, was shown to hold. Such property indeed is fully general on the Nishimori line, see [10] on this respect, and does not rely on any positivity assumption of the interactions. While the positivity properties carry with them the typical bounds of Guerra's method [19,20], here the technical support to control and solve the model is based on the presence, on the Nishimori line, of a set of identities relating magnetizations and overlaps expectations [16,25,26] and correlation inequalities [21]. Our work provides the first exact solution of a disordered Statistical Mechanics model in a deep architecture and describes how the relative size of the layers influences the phase transition.
The relevance of the Nishimori line is twofold. On one side it provides the possibility to investigate the replica symmetric phase of the model through an exact solution for arbitrary strength of the interactions. On the other side it represents a bridge between a class of inference problems and Statistical Physics [26]. For instance the Sherrington-Kirkpatrick model on the Nishimori line corresponds to the Wigner Spiked model in the inference Bayesian optimal setting with binary signals [8,9]. Analogously, any multi-species mean-field model on the Nishimori line can be seen as a spatially coupled Wigner spiked model first introduced and studied in [6,7]. From the inference point of view here we deal with a deep spatially coupled Wigner spiked model with K layers, which in the case K = 2 coincides with the Wishart model (rank-one non-symmetric matrix estimation [9]).
The paper is organized as follows. In Section 2 we introduce the model and we present the main results in three theorems. Section 3 is a collection of tools and preliminary results, starting form the Nishimori identities and the correlation inequalities, up to the adaptive interpolation method. The proofs are contained in Section 4 and Section 5 collects some conclusions and perspectives.

Definitions and results
Consider a set of sites with cardinality N , divide it into K disjoint subsets, called layers and denoted by {L r } r=1,...,K with cardinality |L r | = N r and K r=1 N r = N . To each site i we associate an Ising spin σ i and we denote σ = (σ 1 , . . . , σ N ) a configuration of spins belonging to the space Σ N = {+1, −1} N . The Hamiltonian of the model is defined as: where the interaction coefficients and the external fields are independent Gaussian random variables distributed as follows and the matrix µ := (µ rs ) r,s=1,...,K and the vector h := (h r ) r=1,...,K have non-negative entries. Furthermore µ has the following tridiagonal structure: and is assumed to be symmetric without loss of generality. The geometrical architecture of the model is illustrated in Figure1. We point out that the very special choice of the Gaussian distribution (2), having mean values and variances tied to be the same, is called Nishimori line in Physics literature [26]. We will recall correlation identities and inequalities holding on the Nishimori line in the next Section. We denote m(σ) := (m r (σ)) r=1,...,K , q(σ, τ ) := (q r (σ, τ )) r=1,...,K with bold characters here and below standing for vectors and σ, τ ∈ Σ N = {−1, 1} N . We also set ∆ := (α r µ rs α s ) r,s=1,...,K ,α := diag(α 1 , α 2 , . . . , α K ) , where α r = N r /N are called the form factors. ∆ is the effective interaction matrix and encodes all the information on the interactions of the system. For later convenience we introduce also the matrix M := (µ rs α s ) r,s=1,...,K We notice that ∆ and M are tridiagonal matrices too. It is useful to express the Hamiltonian (1) in terms of centered Gaussian random variables plus a deterministic term (in vector notation): where (·, ·) denotes the Euclidean inner product in R K and The random term in (8) corresponds to the Hamiltonian studied in [3], but the addition of a deterministic part changes the properties of the model. We denote the random pressure per particle by and its quenched average byp where E is the expectation with respect to all the Gaussian random variables.
Remark 1. While throughout this paper we keep the form factors α r 's constant as N → ∞, all the results hold also under the weaker hypothesis that N r /N → α r ∈ (0, 1) (see also Remark 6, Sect. 4.3 for vanishing α r 's). Indeed any vanishing correction to α r doesn't change the thermodynamic limit of the quenched pressure density (11). This can be seen proving by interpolation method that at given N the quenched pressure is a Lipschitz function of ∆ w.r.t. the entrywise matrix norm r,s≤K |∆ r,s |. The (random) Boltzmann-Gibbs average will be denoted by To help the presentation we will occasionally make explicit the dependence of the Boltzmann-Gibbs measure on further parameters by using sub and superscripts, for instance · ( ) N,t . In the previous definitions (10)- (12) we have chosen to reabsorb the inverse absolute temperature β in the parameters µ rs and h r . The first result of this paper is the computation of the random pressure (10) in the thermodynamic limit.
Theorem 1 (Solution of the model). The random pressure (10) of a K-layer deep Boltzmann machine on the Nishimori line converges almost surely in the thermodynamic limit and its value is given by a K-dimensional variational principle: where x o and x e denote the vectors of the odd and even components of the order parameter x ∈ [0, 1) K respectively, and for any x ≥ 0 Moreover, definingx as the solution of the variational problem (13), we have for every r = 1, . . . , K and for all the points of the phase space (µ,α, h) wherex is h-differentiable and the matrix ∆ is invertible.
The proof of Theorem 1 relies on the adaptive interpolation method [8] combined with a concentration result and with the Nishimori identities, that will be presented in the next section. The main difference with the model solved in [2] is that the matrix ∆ is not definite, indeed its eigenvalues have alternating signs. This entails that the remainder identified by interpolation has not a definite sign and cannot be discarded a priori at the expense of an inequality. Moreover, the concentration of the overlap strongly depends on a notion of regularity of the path followed by the adaptive interpolation. Hence one has to carefully choose a path that is regular and allows also to exploit the convexities of the two sums involved in the functional (14).
Secondly, we focus on the properties of the consistency equation obtained from the optimization problem (13) when the matrix ∆ is invertible, that is when K is even. The stability of the optimizers of (13) is a more delicate problem with respect to the convex multi-species case [2], due to the min-max nature of the variational principle. In the following, given a square matrix A we denote by ρ(A) its spectral radius and by A (eo) the submatrix of A obtained by keeping only even rows and odd columns of A. An analogous definition is given for A (oe) , A (oo) , A (ee) . Notice that, when K is even, ∆ (eo) is an upper triangular K/2 × K/2 square matrix with non-zero diagonal elements and therefore it is invertible. Similar considerations hold for the sub-matrix ∆ (oe) = [∆ (eo) ] T . We prove the following Theorem 2. Let K be even and h = 0. If ρ([M 2 ] (oo) ) < 1 then x = 0 is the unique solution to the variational problem (13). Conversely, if ρ([M 2 ] (oo) ) > 1 then the solution of (13) is a vector x =x(M ) with strictly positive components satisfying the consistency equation: where z denotes a standard Gaussian random variable.
The proof of Theorem 2 amounts to the computation of the Hessian matrix of an auxiliary function introduced later and in a check of its eigenvalues. The peculiar form of the consistency equations due to the structure (3) plays a central role. Theorem 2 implies the existence of a phase transition in our model localized at zero magnetic field and unitary spectral radius as discussed in Remark 2 below. The following Proposition further clarifies the structure of the phase transition and how the system's geometry, encoded in the form factors α r 's, can influence it.
where the sup on the l.h.s. is taken over the form factors α 1 , . . . , α K ≥ 0 , K r=1 α r = 1 and the max on the r.h.s. is taken over r = 1, . . . , K − 1 . Furthermore the sup on the l.h.s. of (18) is attained if and only if one of the following conditions is verified: b) there exists r * ∈ {2, . . . , K − 1} such that Remark 2. For even K, Proposition 3 together with Theorems 1 and 2 show that if the interaction strengths µ r,r+1 < 2 for all r = 1, . . . , K − 1, then the magnetisations and the overlaps vanish as N → ∞ for every choice of the form factors (α 1 , . . . , α K ) ∈ (0, 1) K . By Theorem 2x is not identically zero on the space of parameters (µ,α) , hence the limiting quenched pressure (13) cannot be an analytic function. Proposition 3 also shows that as soon as µ r,r+1 > 2 for some r = 1, . . . , K − 1, then, by suitably localizing only two extensive layers near the maximal interaction (condition (19)), their magnetisations and overlaps turn out to be positive in the limit N → ∞.
Finally, we prove a uniqueness result that holds for arbitrary spectral radius.

Nishimori identities and correlation inequalities
The main thermodynamic properties of the model are consequences of a family of identities and inequalities for correlation functions that are due to the specific setting (2). The identities were introduced in the original work by H. Nishimori [25] while the inequalities were proved in [21,22]. The proof of the Nishimori identities can be found in the book [15] (Paragraph 2.6). In particular, for our purposes we will use the following: From the previous relations it follows that on the Nishimori line magnetizations and overlaps moments coincide. This can be seen by where the expectations q s N and q r q s N are taken with respect to the replicated Gibbs measure. As a consequence we have: Concerning the correlation inequalities on the Nishimori line [16,21,22] (see also Theorem 2.18 in [15] for a straightforward proof) we have that: Hence both the quenched pressure per particle and the magnetizations are nondecreasing with respect to each parameter h r , r = 1, . . . , K .

One-body system on the Nishimori line
It is useful to consider the following simple Hamiltonian on the Nishimori line, where only one-body interactions are taken into account: with h > 0. It is easy to show that the pressure of this model coincides with the function ψ(h) defined by (15): Since the Boltzmann-Gibbs average of a spin in the one body system equals σ 1 the Nishimori identities entail the following identities: for every n ∈ N, n ≥ 1. Starting from expression (15) we are going to determine the sign of the first derivatives of ψ . Gaussian integration by parts and identity (31) for n = 1 show that Using again Gaussian integration by parts and identity (31) for n = 1, 2, one finds: The sign of the third derivative can be obtained avoiding Gaussian integration by parts. Indeed by setting y = z √ h + h, replacing z 2 √ h + 1 = y+h 2h in the computations and using the identities (31) for n = 2, 3, one finds: The convexity of ψ will be crucial in the proof of Theorem 1. In particular, we will use the following Proof. ψ is convex on R ≥0 by equation (33). Then, using the linearity of (M x) r , it is easy to verify that for any λ ∈ [0, 1] and x 1 , x 2 ∈ A we have: In the proof of Theorem 4 we will use the following Lemma 6. Let z be a standard Gaussian random variable. The function is strictly positive, increasing and concave for h > 0.
Remark 3. As a consequence the function F is invertible on [0, ∞) . Its inverse F −1 is non negative and increasing on [0, 1) . Moreover one has

Interpolating model
We now introduce an interpolating model that compares the original model with a one-body model with suitably tuned external field.
Definition 1 (Interpolating model). Let t ∈ [0, 1]. The Hamiltonian of the interpolating model is: with J r i iid ∼ N (0, 1) independent of all the other Gaussian random variables, and Here Q =: (Q ,r ) r=1,...,K , while q := (q ,r ) r=1,...,K denotes a vector of K non-negative functions that will be suitably chosen in the following.
Now we can write the sum rule, which is contained in the following proposition.
Proposition 7 (Sum rule). The quenched pressure of the model rewrites as: where the remainder is: Proof. We stress that the interpolating model is on the Nishimori line for any t ∈ [0, 1], as can be seen by direct inspection. This allows us to use the identities and inequalities for any value of the interpolating parameter. See Proof of Proposition 2 in [2] for the details.
The tridiagonal form of ∆ allows us to specialize the previous sum rule as follows: or better, using the notation introduced for Theorem 2, where again the subscripts o, e denote the odd or even components of a vector, 1 := (1) r=1,...,K . We also denote The sum rules (42), (43) motivate the definition of the variational pressure (14) that for future convenience can be rewritten as: Remark 4. The variational function p var is convex in the even components x e and the odd components x o separately. This is due to the fact that the two bilinear forms in (45) have vanishing second derivatives w.r.t. pure odd or even components, while the terms containing ψ are convex by Lemma 5.
The sum rule exhibits a remainder (namely (41)) to deal with. Let us first introduce the following Definition 2 (Regularity of −→ Q (·)). We will say that the map This has to be combined with Liouville's formula, a standard analysis result that we report here for the reader's convenience.
Lemma 8 (Liouville's formula). Consider two matrices whose elements depend on a real parameter: Φ(t), A(t). Suppose that Φ satisfies the Cauchy problem Then: Now, the remainder (41) can be proved to concentrate under the regularity hypothesis, as stated in the following Lemma 9 (Concentration). Suppose −→ Q (·) is a regular map. For every r = 1, . . . , K consider the quantity and introduce the -average: We have: and for every r = 1, . . . , K . Therefore the magnetization (or the overlap) concentrates in -average.
The proof is carried out by treating the thermal and disordered fluctuations of L r separately. Actually, an estimate on the L 2 −convergence speed of the random pressure to the quenched one as N → ∞ is required and it will be given in the proofs section below. See Lemma 3 and Appendix A in [2] for the details. The role of is that of a regularizing perturbation and it is crucial for the proof. Its introduction intuitively allows to avoid critical points where the limiting pressure presents singularities and concentration may not occur, thus helping us to select always the stable state of the system. Indeed, for vanishing external magnetic fields h = 0 and in absence of , the system remains stuck in a vanishing average magnetization state because of the resulting spin flip symmetry in the Hamiltonian. However, as stated in Theorem 2 in the appropriate range of parameters the latter is thermodynamically unstable, meaning that any arbitrarily small magnetic field would bring the magnetization to positive values.

Proof of Theorem 1
The almost sure equality in (13) is a standard result based on the following concentration inequality: As a consequence Proof. The random pressure p N is a Lipschitz function of the independent standard Gaussian variablesĴ = (J rs ij / √ µ rs ) i,j,r,s ,ĥ = (h r i / √ h r ) i,r . Indeed: The inequality (53) then follows by a known concentration property of the Gaussian measure (see Theorem 1.3.4 in [29]). A tail integration finally leads to (54).
Since the r.h.s. in (53) is summable the Borel-Cantellli Lemma guarantees almost sure convergence. Now we move to the proof of the variational principle, i.e. the second equality in (13) which is going to be achieved through upper and lower bounds. For what follows, we neglect all the sub and superscripts in the Boltzmann-Gibbs averages, except for the t-dependence.
Lower bound. We select a path contained in [0, 1) K by means of the following coupled ODEsQ where f (t, Q) is the velocity field of the ODE. The perturbation is here introduced as an initial condition in order to have the interpolating functions in the form (44). Notice that f e is constant, while f o is a positive Lipschitz function of Q (t) ∈ (0, ∞) K thanks to identity (28) (where N is fixed). Therefore, by Cauchy-Lipschitz's theorem, the system of ODEs (56)-(57) has a unique global solution Q (t) , t ∈ [0, 1] , whose components are positive. By (56)-(57) we have ∆ (eo) q ,o (t) = ∆ (eo) x o and ∆ (oe) q ,e (t) = ∆ (oe) E m e t , hence: and reasoning in a similar way for the other t-integrations appearing in the sum rule (43) we obtain: where the reminder is Using Cauchy-Schwartz's inequality, thus, provided that the map → Q (t) is regular, the remainder R (t) vanishes in -average as N → ∞ by Lemma 9. To show that Q is regular we introduce the following matrix fields: Applying the chain rule we have: hence, by Liouville's formula (48) the Jacobian det(Φ (t)) is Now, using equations (56)-(57) one can compute: where non-negativity is a consequence of the correlation inequality (28), since Q ,r (t) can be seen as the variance of an external field on the Nishimori line in the interpolating Hamiltonian (39). Combining (64) and (65), it follows that Q is regular, as desired. Now, averaging on and tanking the lim inf N →∞ in inequality (59) we have The last term vanishes by Fubini's theorem, dominated convergence and Lemma 9. Finally, optimizing w.r.t. x o we get: Upper bound. Now, we seṫ In (68) With a slight abuse of notation we will stress the dependence of D (oo) (x, h) and D (ee) (x, h) on the even and odd components of x respectively as follows is a positive function of Q (t) with bounded derivatives for fixed N thanks to Lemma 6, indeed This ensures the existence of a unique global solution over [0, 1] to the system of ODEs (68)-(69). Moreover, the latter implies also that the map −→ Q (·) is still regular, because F is positive as proved in Lemma 6 and ∂E me t ∂Q ,r ≥ 0 thanks again to (28). This guarantees the positivity of the trace in (64) and forces the vanishing of the remainder R in -average by Lemma 9. Using Jensen's inequality, by the convexity of ψ we have and inserting it into the sum rule (43) we get where F ,o (t) := F (M (oe) E m e t + h o ) for brevity. As far as the last equality is concerned, we used the following: This is a consequence of the convexity of p var in x e (see Remark 4). In fact, a computation of the gradient of p var w.r.t. x e evaluated at E m e t yields: where we explicitly notice that the first term comes from the derivative of ψ (32). Then, taking the sup of p var over the odd components and the -average we get: Applying Lemma 9, Fubini's theorem and dominated convergence the two bounds match after sending N → ∞.
Proof of (16). Equations (27) and (28) imply that the quenched pressure is convex in each h r . Hence it is possible to exchange the derivative w.r.t. h r in (27) with the N → ∞ limit wherex is differentiable in h r (see Lemma IV.6.3 in [17]). Since for invertible ∆ the optimal order parameter must be a critical point of p var (see Proposition 11 below) by (32) and (14) we have that: A comparison with (27) and the Nishimori identity (22) lead to the identification: Remark 5. Assume for now that K is even. Observe that the entire proof could have been carried out also by computing all the inf xe over the convex set: on which all the functions involved are still real and well defined. This freedom is essentially due to the convexity of p var in x e . Indeed, p var has always a critical point in the domain A for any fixed x o ∈ [0, 1) K/2 , that must coincide with its minimum point by convexity as can be seen by direct inspection The inequality (59), that leads to the lower bound, clearly holds also for x e ∈ A ⊇ [0, 1) K/2 . The validity of (75) is less trivial and is due to the special choice F ,o (t).
In this case in fact, the critical point falls inside [0, 1) K/2 and this lets us extend the domain of x e to A without any loss of generality thanks to the mentioned convexity in x e . We will see later that even with this extension the point that realizes the sup inf lies inside the cube [0, 1) K .

Proof of Theorem 2
For this proof we rely on Remark 5, this will ease our computations. Let us write the gradient of (14) where we have used (31). In absence of external magnetic field (h = 0) x = 0 is a critical point for p var , namely a solution to the consistency equation obtained by equating (82) to 0. First of all, by Remark 4 and Remark 5 we infer that the optimization w.r.t. the even components x e is always stable, in the sense that there is always one optimizer once the odd components x o are fixed and it belongs to A. Define now the auxiliary function: withx e defined in (81). The following proposition investigates the possibility to have boundary solutions to the variational problem.
Proposition 11. Let K be even. The points x o at which the sup xo π(x o ; µ, h) is attained fulfill the consistency equation: As a consequence the necessary condition for x to realize the sup xo inf xe p var (x o , x e ; µ, h) is to be a critical point, namely to satisfy (84).
Proof. Using (81), the gradient of π is: We start by considering the case h = 0. One can immediately rule out the possibility that the sup is attained at the right border, i.e. x 2l−1 → 1 − for some l, because thanks to (38) ∂ x 2l−1 π → −∞. Then, the necessary condition for a point x o ∈ [0, 1) K/2 to realize the sup is that: component-wise, where equality holds for those components for which x 2l−1 > 0. If we set M 0,1 = M K,K+1 = 0, the generic 2l − 1 component of the previous is given by whence we understand that if x 2l−1 = 0 the only chance for the previous to be non positive is to have also x 2l−3 = x 2l+1 = 0 because F is positive and monotonic. On the contrary, if x 2l−1 > 0 first the corresponding gradient component must vanish; second by looking at the 2l + 1 component for instance we see that the last term is strictly positive. Necessarily, x 2l+1 must be strictly positive too with the corresponding gradient component that vanishes, and so on. Similar considerations hold for x 2l−3 . Finally, iterating these arguments, we infer that the supremum is attained at a point x o such that: The first in particular implies that alsox e = 0 . In both cases we can say that (84) is satisfied. When any h r is strictly positive it is immediate to see that there is a component of (85) with a positive contribution, the corresponding component of x o must then be positive. Therefore one iterates the same arguments as above obtaining again (84). In any case, by (81) the sup inf is attained at critical points of p var .
Using property (33), the Jacobian matrix of F (M x + h) is D(x, h), defined in (70), is diagonal, positive definite, invertible and its spectral radius is bounded by 1. From (85), an application of the Inverse Function Theorem leads to the Hessian matrix Thanks to the peculiar tridiagonal form of M we also have that from which by a simple rearrangement we can write the Hessian in its final form where for brevity we have neglected all the dependencies after the second equality in (93) and used (92). (93) uses only symmetric matrices in order to make manifest the global sign of the Hessian. It remains to show that the spectral radius of S (oo) is controlled by that of [M 2 ] (oo) . S (oo) is symmetric because ∆ (oe) = [∆ (eo) ] T , thus its spectral radius coincides with the matrix norm induced by the Euclidean scalar product. Then by norms sub-multiplicativity and matrix similarity one easily gets Iterating the same arguments we get to where the last equality follows again by matrix similarity. The previous implies that the matrix [−1 + S] (oo) in (93) is negative definite, making H xo π negative definite too, and hence π is concave under the hypothesis ρ([M 2 ] (oo) ) < 1. In turn, this ensures uniqueness of the solution to the consistency equation (84) and to the variational problem (13). In particular when h = 0, x = 0 is the unique solution.
Conversely, for h = 0 and ρ([M 2 ] (oo) ) > 1 the Hessian has at least one positive eigenvalue at the origin x o = 0, but this is in general not enough to ensure x o = 0 does not realize the sup anymore. One has to check that there is a direction of increment of π that intersects the cube [0, 1) K/2 , otherwise the system could remain stuck on the border at x o = 0 due to the positivity constraints on the variables.
It is easy to see that [M 2 ] (oo) is irreducible, because its associated graph is strongly connected, and it has non negative entries. Hence, by Perron-Frobenius Theorem the eigenvector v relative to the largest eigenvalue ρ([M 2 ] (oo) ) is component-wise strictly positive, thus pointing inside the cube, and by a Taylor expansion around x o = 0 we have: that is positive form small enough > 0. Finally, by Proposition 11 the solution shifts in favour of a point x =x(M ) ∈ (0, 1) K .

Proof of Proposition 3
Proposition 3 relies on an algebraic lemma, which we write here for convenience. Its proof can be found in [3] (see Lemma 1 therein).
For convenience we set α p ≡ 0 for p / ∈ {1, . . . , K} and µ p,p+1 ≡ 0 for p / ∈ {1, . . . , K − 1}. The last inequality in (102) follows by Lemma 12, since b (r) p ≤ µ 2 and p α p = 1 . Therefore ρ ≤ µ 2 4 combining inequalities (101), (102). Now assume that ρ = µ 2 4 . In particular the inequality in (102) must be saturated, hence there exists r such that Then by Lemma 12,condition (19) or (20) must be verified. Vice-versa assume that condition (19) or (20) holds true. In this case notice that many of the α r 's are zero, since K r=1 α r = 1 . Thus the matrix [M 2 ] (oo) notably simplifies and one can check directly that µ 2 4 is (the only non-zero) eigenvalue. This proves ρ = µ 2 4 . Remark 6. It is not difficult to realize that Theorem 1 holds also when α r → 0 for some r. Indeed, by (61) and Lemma 9 we see that it is sufficient to require: The proof of Lemma 9 consists in showing that (see inequalities (A.11) and (A.24) in [2]): The previous equalities both vanish in the thermodynamic limit with the choice s N ∼ N −1/16K for instance, independently on α r . Hence the remainder of the interpolation in the proof still goes to 0 with no variation in the hypothesis. When a form factor, say α r , vanishes the related component of the order parameter x r disappears from the variational pressure (14). Moreover, if the corresponding L r is an intermediate layer one can see that the system decouples into two independent DBMs because the effective interaction matrix ∆ becomes block diagonal and the convex ψ-term related to the mentioned layer is weighed by α r . The global variational pressure is thus constant in x r in the thermodynamic limit.

Proof of Theorem 4
Uniqueness of the solution of the consistency equation for positive external fields can be proven adapting the strategy used in [3], where the replica symmetric equation of a deep Boltzmann machines was proved to admit a unique solution when the couplings and the external fields are centred Gaussian random variables. In particular the layers structure permits to "decouple" the interactions as shown in the following Remark 7. The consistency equation (21) is equivalent to the following: where we have introduced the auxiliary variables a 1 , . . . , a K−1 > 0 and the functions Indeed, using the definition of the matrix M , it can be easily verified that (M x) r = Θ r (a) x r for r = 1, . . . , K , for a satisfying the second relation in (108).
The proof of Theorem 4 relies on the following Lemma 13. Let z be a standard Gaussian random variable. For every t, h > 0 the equation has a unique positive solution that we denote by x =x(t, h) > 0 . Moreoverx is strictly increasing as a function of both t > 0 and h > 0.
of Theorem 4. Equation (110) rewrites as x = F ( t x+h) , where F (h) ≡ E tanh(z √ h+ h) . By Lemma 6, F takes values in (0,1), is strictly increasing and concave. It follows that equation (110) admits a unique solution in (0, 1) and in particular we can show that the function f (x) ≡ 1 x F ( t x + h) is strictly decreasing for x > 0 . Indeed by Lemma 6 we have: hence Now denoting byx(t, h) the unique positive solution of equation (110), we can prove its monotonicity with respect to both parameters by differentiating the self-consistent which leads to Lemma 6 ensures that (115), (116) are positive quantities, hence to conclude it suffices to show that 1 − t F (tx + h) > 0. Indeed, dividing the inequality (113) by x, evaluating it at x =x(t, h) and using the self-consistent equation (114), one finds precisely: Proof of Theorem 4. By Lemma 13, the first line of (108) is equivalent to: wherex is uniquely defined and strictly increasing with respect to both its arguments. On the other hand the second line of (108) rewrites as: It is convenient to set X 1 (a 1 ) ≡ α 1x Θ 1 (a) , h 1 = α 1x ∆ 1,2 a 1 , h 1 and for r ≥ 2 X r 1 a r−1 , a r ≡ α rx Θ r (a) , h r = α rx ∆ r,r−1 a r−1 + ∆ r,r+1 a r , h r .
Therefore equation (108) is equivalent to the following: We will show by induction on r ≥ 1 that for any given a r+1 ≥ 0 there exists a unique a r =ā r (a r+1 ) > 0 such that . . .
X 1 (a 1 ) a 1 · · · a r−1 a r = X r+1 1 a r , a r+1 and moreoverā r is a strictly increasing function with respect to a r+1 . The uniqueness of solution of (121) will follow immediately by stopping the induction at r = K − 1 and choosing a K = 0 and the Theorem will be proven thanks to Remark 7.
• Case r = 1: given a 2 ≥ 0, let's consider the equation By Lemma 13 the left-hand side of (123) is a strictly increasing function of a 1 > 0 and takes all the values in the interval (0, ∞), while the right-hand side is a decreasing function of a 1 > 0 and takes non-negative values. Therefore there exists a unique a 1 =ā 1 (a 2 ) > 0 solution of (123). Now taking derivatives on both sides of (123) and using again Lemma 13, one finds: > 0 (124) henceā 1 is a strictly increasing function of a 2 .
By inductive hypothesis and Lemma 13, the left-hand side of (125) is a strictly increasing function of a r > 0 and takes all the values in the interval (0, ∞), while the right hand-side of (125) is a decreasing function of a r > 0 and takes non-negative values. Therefore for every a r+1 ≥ 0 there exists a unique a r =ā r (a r+1 ) > 0 solution of (125). Now taking derivatives on both sides of (125) one finds: which, using again the inductive hypothesis and Lemma 13, entails thatā r is a strictly increasing function of a r+1 .

Conclusions and perspectives
In this work we have solved the K-layer deep restricted Boltzmann machine on the Nishimori line which is an instance of a non-convex multi-species model. The solution consists in the computation of the pressure in the thermodynamic limit which is expressed in terms of an ordinary min-max variational principle over K real positive numbers. The properties of the optimizer show the presence of a phase transition related to the interaction strength and to the relative size of each layer defining the geometry of the system. In particular we discovered that the geometry of the system may tune the phase transition.
A possible way to investigate the model for general values of the parameters would be to test the stability of our results when the system is in a neighborhood of the Nishimori line. We plan to perturb the distribution (2) and check under which conditions the replica symmetry property breaks down.
After the completion of this work, paper [28] was brought to our attention where the mutual information for a wide class of inference problems is solved by means of a variational principle. While it is possible to obtain our model as an instance of the one considered there, the variational principle presented has no clear correspondence to ours. We finally mention that a subsequent work [14] contains a general result that extends the one in the present paper. In particular the authors compute the limiting free energy with a Hamilton-Jacobi approach which proves to be effective also when dealing with lack of convexity in the interactions. On the other hand, the simplicity of our setting allows us to carry out a thorough study of the variational formula by locating the phase transition and investigating its dependency on the geometry of the system as in Theorem 2, Proposition 3 and Theorem 4.