Empirical Bayes Method for Boltzmann Machines

Yasuda, Muneki

doi:10.1007/978-981-16-4095-7_11

Muneki Yasuda⁹

4554 Accesses

Abstract

The framework of the empirical Bayes method allows the estimation of the values of the hyperparameters in the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function. However, the maximization is computationally difficult because the empirical Bayes likelihood function involves intractable integrations of the partition function. The method presented in this chapter avoids this computational problem by using the replica method and the Plefka expansion, which is quite simple and fast because it does not require any iterative procedures and gives reasonable estimates under certain conditions.

You have full access to this open access chapter, Download chapter PDF

An Overview of Restricted Boltzmann Machines

Article 18 February 2019

Three learning stages and accuracy–efficiency tradeoff of restricted Boltzmann machines

Article Open access 17 September 2022

A Two-Stage Pretraining Algorithm for Deep Boltzmann Machines

1 Introduction

Boltzmann machine learning (BML) [1] has been actively studied in the fields of machine learning and statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the inverse Ising problem because a Boltzmann machine is the same as an Ising model, and it can be treated as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points, the appropriate values of the Boltzmann machine parameters, namely the biases and couplings, are estimated through maximum likelihood (ML) estimation. Because BML involves intractable multiple summations (i.e., evaluation of the partition function), several approximations have been proposed for it from the viewpoint of statistical mechanics [2]. Examples include methods based on mean-field approximations (e.g., the Plefka expansion [3] and the cluster variation method [4]) [5,6,7,8,9,10,11] and methods based on other approximations [12,13,14].

This chapter focuses on another type of learning problem for the Boltzmann machine. Consider the prior distributions of the Boltzmann machine parameters and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected to regularized ML estimation, in which the hyperparameters can be regarded as regularization coefficients. The regularized ML estimation is important for preventing overfitting to the dataset. As mentioned above, the usual BML aims to optimize the values of the Boltzmann machine parameters using a set of observed data points. However, the aim of the problem presented in this chapter is the estimation of the appropriate values of the hyperparameters from the dataset without estimating the specific values of the Boltzmann machine parameters. From the Bayesian viewpoint, this can be potentially accomplished by the empirical Bayes method (also known as type-II ML estimation or evidence approximation) [15, 16]. The schemes of the usual BML and the problem investigated in this chapter are illustrated in Fig. 11.1.

Recently, an effective algorithm was proposed for the empirical Bayes method for the Boltzmann machine [17]. Using this method, the hyperparameter estimates can be obtained without costly operations. This chapter aims to explain this effective method.

The rest of this chapter is organized as follows. The formulations of the Boltzmann machine and its usual and regularized ML estimations are presented in Sect. 11.2. The empirical Bayes method for the Boltzmann machine is presented in Sect. 11.3. Section 11.4 describes a statistical mechanical analysis for the empirical Bayes method and an inference algorithm obtained from the analysis. Experimental results for the presented algorithm are presented in Sect. 11.5. The summary and some discussions are presented in Sect. 11.6. The appendices for this chapter are given in Sect. 11.7.

2 Boltzmann Machine with Prior Distributions

Consider a fully connected Boltzmann machine with n (bipole) variables $\boldsymbol{S}:= \{S_i \in \{-1,+1\} \mid i = 1,2,\ldots , n\}$ [1]:

$$\begin{aligned} P(\boldsymbol{S} \mid h,\boldsymbol{J}):=\frac{1}{Z(h,\boldsymbol{J})}\exp \Big (h \sum _{i=1}^n S_i + \sum _{i<j}J_{ij}S_iS_j\Big ), \end{aligned}$$

(11.1)

where $\sum _{i<j}$ is the sum over all distinct pairs of variables, that is, $\sum _{i<j} = \sum _{i=1}^n\sum _{j = i+1}^n$. $Z(h,\boldsymbol{J})$ is the partition function defined by

$$\begin{aligned} Z(h,\boldsymbol{J}):= \sum _{\boldsymbol{S}}\exp \Big (h \sum _{i=1}^n S_i + \sum _{i<j}J_{ij}S_iS_j\Big ), \end{aligned}$$

where $\sum _{\boldsymbol{S}}$ is the sum over all possible configurations of $\boldsymbol{S}$, that is,

$$\begin{aligned} \sum _{\boldsymbol{S}} := \prod _{i=1}^n \sum _{S_i = \pm 1}. \end{aligned}$$

The parameters $h \in (-\infty , +\infty )$ and $\boldsymbol{J} := \{J_{ij} \in (-\infty , +\infty ) \mid i<j\}$ denote the bias and couplings, respectively.

Given N observed data points, $\mathcal {D}:=\{\mathbf {S}^{(\mu )} \in \{-1,+1\}^n \mid \mu = 1,2,\ldots , N\}$, the log-likelihood function is defined as

$$\begin{aligned} L_{\mathrm {ML}}(h,\boldsymbol{J}):=\frac{1}{n N}\sum _{\mu = 1}^N \ln P(\mathbf {S}^{(\mu )} \mid h,\boldsymbol{J}). \end{aligned}$$

(11.2)

The maximization of the log-likelihood function with respect to h and $\boldsymbol{J}$ (i.e., the ML estimation) corresponds to BML (or the inverse Ising problem), that is,

$$\begin{aligned} \{\hat{h}_{\mathrm {ML}},\hat{\boldsymbol{J}}_{\mathrm {ML}}\} = \mathop {\text {arg max}}_{h, \boldsymbol{J}}L_{\mathrm {ML}}(h,\boldsymbol{J}). \end{aligned}$$

(11.3)

However, the exact ML estimations cannot be obtained because the gradients of the log-likelihood function include intractable sums over $O(2^n)$ terms.

We now introduce the prior distributions of the parameters h and $\boldsymbol{J}$ as $P_{\mathrm {prior}}(h\mid H)$ and

$$\begin{aligned} P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma ):= \prod _{i<j} P_{\mathrm {prior}}(J_{ij} \mid \gamma ), \end{aligned}$$

(11.4)

where H and $\gamma $ are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is the Bayesian interpretation of the regularized ML estimation [16]. Given the observed dataset $\mathcal {D}$, using the prior distributions, the posterior distribution of h and $\boldsymbol{J}$ is expressed as

$$\begin{aligned} P_{\mathrm {post}}(h,\boldsymbol{J} \mid \mathcal {D}, H, \gamma ) = \frac{P(\mathcal {D} \mid h, \boldsymbol{J})P_{\mathrm {prior}}(h\mid H)P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma )}{P(\mathcal {D} \mid H, \gamma )}, \end{aligned}$$

(11.5)

where

$$\begin{aligned} P(\mathcal {D} \mid h, \boldsymbol{J}):= \prod _{\mu = 1}^N P(\mathbf {S}^{(\mu )} \mid h,\boldsymbol{J}). \end{aligned}$$

The denominator of Eq. (11.5) is sometimes referred to as evidence. Using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as

$$\begin{aligned} \{\hat{h}_{\mathrm {MAP}},\hat{\boldsymbol{J}}_{\mathrm {MAP}}\} = \mathop {\text {arg max}}_{h, \boldsymbol{J}}L_{\mathrm {MAP}}(h,\boldsymbol{J}), \end{aligned}$$

(11.6)

where

$$\begin{aligned} L_{\mathrm {MAP}}(h,\boldsymbol{J})&:= \frac{1}{nN}\ln P_{\mathrm {post}}(h,\boldsymbol{J} \mid \mathcal {D}, H, \gamma )\nonumber \\&\>=L_{\mathrm {ML}}(h,\boldsymbol{J}) + \frac{1}{nN} R_0(h) + \frac{1}{nN} R_1(\boldsymbol{J}) + \mathrm {constant}. \end{aligned}$$

(11.7)

The MAP estimation of Eq. (11.6) corresponds to the regularized ML estimation, in which $R_0(h):=\ln P_{\mathrm {prior}}(h\mid H)$ and $R_1(\boldsymbol{J}):=\ln P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma )$ work as penalty terms. For example, (i) when the prior distribution of $\boldsymbol{J}$ is a Gaussian prior,

$$\begin{aligned} P_{\mathrm {prior}}(J_{ij} \mid \gamma )= \sqrt{\frac{n}{2 \pi \gamma }} \exp \Big (-\frac{n J_{ij}^2}{2 \gamma }\Big ),\quad \gamma > 0, \end{aligned}$$

(11.8)

$R_1(\boldsymbol{J})$ corresponds to the $L_2$ regularization term and $\gamma $ corresponds to its coefficient; (ii) when the prior distribution of $\boldsymbol{J}$ is a Laplace prior,

$$\begin{aligned} P_{\mathrm {prior}}(J_{ij} \mid \gamma )= \sqrt{\frac{n}{2 \gamma }} \exp \Big (-\sqrt{\frac{2n}{\gamma }} |J_{ij}|\Big ),\quad \gamma > 0, \end{aligned}$$

(11.9)

$R_1(\boldsymbol{J})$ corresponds to the $L_1$ regularization term and $\gamma $ again corresponds to its coefficient. The variances of these prior distributions are identical, that is, $\mathrm {Var}[J_{ij}]=\gamma /n$.

The following uses the Gaussian prior for $\boldsymbol{J}$ and the following as a simple test case:

$$\begin{aligned} P_{\mathrm {prior}}(h\mid H) = \delta (h - H), \end{aligned}$$

(11.10)

where $\delta (x)$ is the Dirac delta function; that is, in this test case, h does not distribute. It is noteworthy that the resultant algorithm obtained based on the Gaussian prior can be applied to the case of the Laplace prior without modification [17].

3 Empirical Bayes Method

Using the empirical Bayes method, the values of the hyperparameters, H and $\gamma $, can be inferred from the observed dataset, $\mathcal {D}$. For the empirical Bayes method, a marginal log-likelihood function is defined as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma ):=\frac{1}{nN} \ln \big [ P(\mathcal {D} \mid h, \boldsymbol{J})\big ]_{h,\boldsymbol{J}}, \end{aligned}$$

(11.11)

where $[\cdots ]_{h,\boldsymbol{J}}$ is the average over the prior distributions, that is,

$$\begin{aligned}{}[\cdots ]_{h,\boldsymbol{J}}:= \int d\boldsymbol{J}\int d h (\cdots ) P_{\mathrm {prior}}(h\mid H)P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma ). \end{aligned}$$

This marginal log-likelihood function is referred to as the empirical Bayes likelihood function in this section. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, $\hat{H}$ and $\hat{\gamma }$, are obtained by maximizing the empirical Bayes likelihood function, that is,

$$\begin{aligned} \{\hat{H},\hat{\gamma }\} = \mathop {\text {arg max}}_{H, \gamma } L_{\mathrm {EB}}(H,\gamma ). \end{aligned}$$

(11.12)

It is noteworthy that $[P(\mathcal {D} \mid h, \boldsymbol{J})]_{h,\boldsymbol{J}}$ in Eq. (11.11) is identified as the evidence appearing in Eq. (11.5).

The marginal log-likelihood function can be rewritten as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )=\frac{1}{nN}\ln \Big [\exp \big (n N L_{\mathrm {ML}}(h,\boldsymbol{J})\big )\Big ]_{h, \boldsymbol{J}}. \end{aligned}$$

(11.13)

Consider the case $N\gg n$. In this case, by using the saddle point evaluation, Eq. (11.13) is reduced to

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )&\approx \frac{1}{n N} \ln P_{\mathrm {prior}}(\hat{h}_{\mathrm {ML}}\mid H) + \frac{1}{nN} \ln P_{\mathrm {prior}}(\hat{\boldsymbol{J}}_{\mathrm {ML}} \mid \gamma )+\mathrm {constant}. \end{aligned}$$

In this case, the empirical Bayes estimates $\{\hat{H},\hat{\gamma }\}$ thus converge to the ML estimates of the hyperparameters in the prior distributions in which the ML estimates of the parameters $\{\hat{h}_{\mathrm {ML}},\hat{\boldsymbol{J}}_{\mathrm {ML}}\}$ (i.e., the solution for BML) are inserted. This indicates that parameter estimations can be conducted independently of hyperparameter estimation. This trivial case is not considered in this section. Remember that the objective is to estimate the hyperparameter values without estimating the specific values of the parameters.

4 Statistical Mechanical Analysis of Empirical Bayes Likelihood

The empirical Bayes likelihood function in Eq. (11.11) involves intractable multiple integrations. This section presents an evaluation of the empirical Bayes likelihood function using statistical mechanical analysis. The outline of the evaluation is as follows. First, the intractable multiple integrations in Eq. (11.11) are evaluated using the replica method [18, 19]. This evaluation leads to a quantity with a certain intractable multiple summation. The quantity is approximately evaluated using the Plefka expansion [3]. Thus, from the two approximations, the replica method and Plefka expansion, the evaluation result for the empirical Bayes likelihood function is obtained.

4.1 Replica Method

The empirical Bayes likelihood function in Eq. (11.11) can be represented as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )=\frac{1}{nN}\ln \lim _{x \rightarrow -1} \Psi _x(H,\gamma ), \end{aligned}$$

(11.14)

where

$$\begin{aligned} \Psi _x(H,\gamma ):=\Big [ Z(h,\boldsymbol{J})^{x N} \exp N\Big (h \sum _{i=1}^n d_i + \sum _{i<j}J_{ij} d_{ij}\Big ) \Big ]_{h, \boldsymbol{J}}, \end{aligned}$$

(11.15)

and

$$\begin{aligned} d_i := \frac{1}{N} \sum _{\mu = 1}^N \mathrm {S}_i^{(\mu )},\quad d_{ij} := \frac{1}{N} \sum _{\mu = 1}^N \mathrm {S}_i^{(\mu )}\mathrm {S}_j^{(\mu )} \end{aligned}$$

are the sample averages of the observed data points. We now assume that $\tau _x := x N$ is a natural number larger than zero. Accordingly, Eq. (11.15) can be expressed as

$$\begin{aligned} \Psi _x(H,\gamma )&=\Big [ \sum _{\mathcal {S}_x}\exp \Big \{h \sum _{i=1}^n\Big ( \sum _{a = 1}^{\tau _x}S_i^{\{a\}}+ N d_i\Big ) \nonumber \\&\quad + \sum _{i<j}J_{ij}\Big (\sum _{a = 1}^{\tau _x}S_i^{\{a\}}S_j^{\{a\}} + N d_{ij}\Big )\Big \} \Big ]_{h, \boldsymbol{J}}, \end{aligned}$$

(11.16)

where $a ,b \in \{1,2,\ldots , \tau _x\}$ are the replica indices and $S_i^{\{a\}}$ is the ith variable in the ath replica. $\mathcal {S}_x:= \{S_i^{\{a\}} \mid i = 1,2,\ldots , n;\, a = 1,2,\ldots , \tau _x\}$ is the set of all variables in the replicated system (see Fig. 11.2) and $\sum _{\mathcal {S}_x}$ is the sum over all possible configurations of $\mathcal {S}_x$, that is,

$$\begin{aligned} \sum _{\mathcal {S}_x} := \prod _{i=1}^n\prod _{a=1}^{\tau _x} \sum _{S_i^{\{a\}} = \pm 1}. \end{aligned}$$

We evaluate $\Psi _x(H,\gamma )$ under the assumption that $\tau _x$ is a natural number, and then we take the limit of $x \rightarrow -1$ from the evaluation result as an analytic continuation,^{Footnote 1} to obtain the empirical Bayes likelihood function (this is the so-called replica trick).

By employing the Gaussian prior in Eq. (11.8), Eq. (11.16) becomes

$$\begin{aligned} \Psi _x^{\mathrm {Gauss}}(H,\gamma ) =\exp \Big \{ n N H M + \frac{\gamma (n-1) N^2}{4}\Big (C_2 + \frac{x}{N}\Big )- F_x(H, \gamma )\Big \}, \end{aligned}$$

(11.17)

where

$$\begin{aligned} M:= \frac{1}{n}\sum _{i=1}^n d_i,\quad C_k:= \frac{2}{n(n-1)}\sum _{i<j}d_{ij}^k, \end{aligned}$$

(11.18)

and

$$\begin{aligned} F_x(H, \gamma ):=-\ln \sum _{\mathcal {S}_x}\exp \big (-E_x(\mathcal {S}_x;H,\gamma )\big ) \end{aligned}$$

(11.19)

is the replicated (Helmholtz) free energy [20,21,22,23], where

$$\begin{aligned} E_x(\mathcal {S}_x;H,\gamma ) :=&-H \sum _{i=1}^n\sum _{a=1}^{\tau _x} S_i^{\{a\}} - \frac{\gamma N}{n}\sum _{i<j}d_{ij} \sum _{a = 1}^{\tau _x}S_i^{\{a\}}S_j^{\{a\}} \nonumber \\&- \frac{\gamma }{n}\sum _{i<j}\sum _{a < b}S_i^{\{a\}}S_j^{\{a\}}S_i^{\{b\}}S_j^{\{b\}} \end{aligned}$$

(11.20)

is the Hamiltonian (or energy function) of the replicated system, where $\sum _{a<b}$ is the sum over all distinct pairs of replicas, that is, $\sum _{a<b} = \sum _{a=1}^{\tau _x}\sum _{b = a+1}^{\tau _x}$.

4.2 Plefka Expansion

Because the replicated free energy in Eq. (11.19) includes intractable multiple summations, an approximation is required to proceed with the current evaluation. In this section, the replicated free energy in Eq. (11.19) is approximated using the Plefka expansion [3]. In brief, the Plefka expansion is a perturbative expansion in Gibbs free energy that is a dual form of a corresponding Helmholtz free energy.

The Gibbs free energy is obtained as

$$\begin{aligned} G_x(m,H,\gamma ) =- n \tau _x H m + \mathop {\text {extr}}_{\lambda }\Big \{\lambda n \tau _x m -\ln \sum _{\mathcal {S}_x}\exp \big ( - E_x(\mathcal {S}_x;\lambda ,\gamma )\big )\Big \}. \end{aligned}$$

(11.21)

The derivation of this Gibbs free energy is described in Sect. 11.7.1. The summation in Eq. (11.21) can be performed when $\gamma = 0$, which gives

$$\begin{aligned} G_x(m,H,0)&=- n \tau _x H m + n \tau _x \mathop {\text {extr}}_{\lambda }\big \{\lambda m- \ln (2\cosh \lambda ) \big \} \nonumber \\&=- n \tau _x H m + n \tau _xe(m), \end{aligned}$$

(11.22)

where e(m) is the negative mean-field entropy defined by

$$\begin{aligned} e(m):=\frac{1+m}{2} \ln \frac{1+m}{2} + \frac{1-m}{2} \ln \frac{1-m}{2}. \end{aligned}$$

(11.23)

In the context of the Plefka expansion, the Gibbs free energy $G_x(m,H,\gamma )$ is approximated by the perturbation from $G_x(m,H,0)$. Expanding $G_x(m,H,\gamma )$ around $\gamma = 0$ gives

$$\begin{aligned} \frac{G_x(m,H,\gamma )}{nN}&=-x H m+ x e(m) + \phi _x^{(1)}(m) \gamma + \phi _x^{(2)}(m)\gamma ^2 + O(\gamma ^3), \end{aligned}$$

(11.24)

where $\phi _x^{(1)}(m)$ and $\phi _x^{(2)}(m)$ are the expansion coefficients defined by

$$\begin{aligned} \phi _x^{(k)}(m):=\frac{1}{n N k!} \lim _{\gamma \rightarrow 0}\frac{\partial ^k G_x(m,H,\gamma )}{\partial \gamma ^k}. \end{aligned}$$

The forms of the two coefficients are presented in Eqs. (11.34) and (11.35) in Sect. 11.7.2.

From Eqs. (11.14), (11.17), (11.24), and (11.33), the approximation of the empirical Bayes likelihood function is obtained as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )&\approx HM -\mathop {\text {extr}}_{m}\Big [ Hm - e(m)+ \Phi (m)\gamma +\phi _{-1}^{(2)}(m)\gamma ^2\Big ], \end{aligned}$$

(11.25)

where

$$\begin{aligned} \Phi (m):=\phi _{-1}^{(1)}(m) - \frac{(n-1)N}{4n}\Big (C_2 - \frac{1}{N}\Big ). \end{aligned}$$

The forms of $\phi _{-1}^{(1)}(m)$ and $\phi _{-1}^{(2)}(m)$ are presented in Eqs. (11.37) and (11.38) in Sect. 11.7.2.

4.3 Algorithm for Hyperparameter Estimation

As mentioned in Sect. 11.3, the empirical Bayes inference is achieved by maximizing $L_{\mathrm {EB}}(H,\gamma )$ with respect to H and $\gamma $ (cf. Eq. (11.12)). The extremum condition in Eq. (11.25) with respect to H leads to

$$\begin{aligned} \hat{m} = M, \end{aligned}$$

(11.26)

where $\hat{m}$ is the value of m that satisfies the extremum condition in Eq. (11.25). By combining the extremum condition of Eq. (11.25) with respect to m with Eq. (11.26),

$$\begin{aligned} \hat{H} =\mathrm {atanh} M - \Big (\frac{\partial \phi _{-1}^{(1)}(M)}{\partial M}\gamma +\frac{\partial \phi _{-1}^{(2)}(M)}{\partial M}\gamma ^2\Big ) \end{aligned}$$

(11.27)

is obtained, where $\mathrm {atanh} x$ is the inverse function of $\tanh x$. From Eqs. (11.25) and (11.26), the optimal value of $\gamma $ is obtained by

$$\begin{aligned} \hat{\gamma }&=\mathop {\text {arg max}}_{\gamma }\big [-\Phi (M)\gamma -\phi _{-1}^{(2)}(M)\gamma ^2\big ]. \end{aligned}$$

(11.28)

Since Eq. (11.28) represents a univariate quadratic optimization, $\hat{\gamma }$ is immediately obtained as follows: (i) when $\phi _{-1}^{(2)}(M) > 0$ and $\Phi (M) \ge 0$ or when $\phi _{-1}^{(2)}(M) = 0$ and $\Phi (M) > 0$, $\hat{\gamma } = 0$, (ii) when $\phi _{-1}^{(2)}(M) > 0$ and $\Phi (M) < 0$, $\hat{\gamma } = - \Phi (M) / (2 \phi _{-1}^{(2)}(M))$, and (iii) $\hat{\gamma } \rightarrow \infty $, elsewhere. The case of $\phi _{-1}^{(2)}(M) = \Phi (M) = 0$ is ignored because it may be rarely observed in realistic settings. Using Eqs. (11.27) and (11.28), the solution to the empirical Bayes inference can be obtained without any iterative process. The pseudocode of the presented procedure is shown in Algorithm 1. The order of the computational complexity of the presented method is $O(Nn^2)$. Remember that the order of the computational complexity of the exact ML estimation is $O(2^n)$.

In the presented method, the value of $\hat{H}$ does not affect the determination of $\hat{\gamma }$. Several mean-field-based methods for BML (e.g., listed in Sect. 11.1) have similar procedures, in which $\hat{\boldsymbol{J}}_{\mathrm {ML}}$ is determined separately from $\hat{h}_{\mathrm {ML}}$. This is a common property of the mean-field-based methods for BML, including the current empirical Bayes problem.

Although the presented method is derived based on the Gauss prior presented in Eq. (11.8), the same procedure can be applied to the case of the Laplace prior presented in Eq. (11.9) [17].

5 Demonstration

This section discusses the results of numerical experiments. In these experiments, the observed dataset $\mathcal {D}$ was generated by the generative Boltzmann machine (gBM), which has the same form as Eq. (11.1), via Gibbs sampling (with a simulated-annealing-like strategy). The parameters of gBM were drawn from the prior distributions in Eqs. (11.4) and (11.10). This implies that the model-matched case (i.e., the generative and learning models are identical) was considered. In the following, the notation $\alpha := N / n$ and $J := \sqrt{\gamma }$ are used. The standard deviations of the Gaussian prior in Eq. (11.8) and the Laplace prior in Eq. (11.9) can thus be represented as $J / \sqrt{n}$. The hyperparameters of gBM are denoted by $H_{\mathrm {true}}$ and $J_{\mathrm {true}}$.

5.1 Gaussian Prior Case

We now consider the case in which the prior distribution of $\boldsymbol{J}$ is the Gaussian prior in Eq. (11.8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model [24], and thus exhibits a spin-glass transition at $J = 1$ when $h = 0$ (i.e., when $H = 0$).

We consider the case $H_{\mathrm {true}} = 0$. The scatter plots for the estimation of $\hat{J}$ for various $J_{\mathrm {true}}$ when $H_{\mathrm {true}} = 0$ and $\alpha = 0.4$ are shown in Fig. 11.3. When $J_{\mathrm {true}} < 1$, our estimates of $\hat{J}$ are significantly consistent with $J_{\mathrm {true}}$. This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with several mean-field approximations.

Figure 11.4 shows the scatter plots for various $\alpha $.

A smaller $\alpha $ causes $\hat{J}$ to be overestimated and a larger $\alpha $ causes it to be underestimated. In our experiments, at least, the optimal value of $\alpha $ seems to be $\alpha _{\mathrm {opt}} \approx 0.4$ when $H_{\mathrm {true}} = 0$. Our method can also estimate $\hat{H}$. The results for the estimation of $\hat{H}$ when $H_{\mathrm {true}} = 0$ and $\alpha = 0.4$ are shown in Fig. 11.5.

Figure 11.5a, b shows the average of $|H_{\mathrm {true}} - \hat{H}|$ (i.e., the mean absolute error (MAE)) and the standard deviation of $\hat{H}$ over 300 experiments, respectively. The MAE and standard deviation increase in the region where $J_{\mathrm {true}} > 1$.

5.2 Laplace Prior Case

We now consider the case in which the prior distribution of $\boldsymbol{J}$ is the Laplace prior in Eq. (11.9). The scatter plots for the estimation of $\hat{J}$ for various values of $J_{\mathrm {true}}$ when $H_{\mathrm {true}} = 0$ are shown in Fig. 11.6.

The plots shown in Fig. 11.6 almost completely overlap with those in Fig. 11.4.

6 Summary and Discussion

This chapter describes the hyperparameter inference algorithm proposed in [17]. As evident from the numerical experiments, the proposed inference method in both the Gaussian and Laplace prior cases works efficiently except for the spin-glass phase. However, the presented method has the drawback that it is sensitive to the value of $\alpha = N / n$. In the experiments in Sect. 11.5, although $\alpha \approx 0.4$ was appropriate when $H_{\mathrm {true}} = 0$, it is known that the appropriate value decreases as $H_{\mathrm {true}}$ increases [17]. Since we cannot know the value of $H_{\mathrm {true}}$ in advance, the appropriate setting of $\alpha $ is also unknown. Estimation of $\alpha _{\mathrm {opt}}$ is an open problem. It seems to be unnatural that there exists an optimal value of $\alpha $ because larger datasets are better in usual machine learning. Such peculiar behavior can be attributed to the truncating approximation in the Plefka expansion. A more detailed discussion of this issue is presented in [17].

Finally, we review the presented method from the perspective of sublinear computation without considering the aforementioned issues. The Boltzmann machine given in Eq. (11.1) has p parameters, where $p = O(n^2)$. In usual machine learning, $N = O(p)$ is, at least, required to obtain a good ML estimate for the Boltzmann machine. Therefore, a hyperparameter inference “without” the empirical Bayes method (namely, the strategy in which the hyperparameters are inferred through the ML estimate in a similar manner as that discussed in the latter part of Sect. 11.3) requires a dataset of size O(p). However, the presented method requires only $N = O(n)= O(\sqrt{p})$ because $\alpha = O(1)$ with respect to n.

Notes

1.
The justification for this analytic continuation may not be guaranteed mathematically. Thus, this type of analysis is regarded as “trick.”

References

D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for Boltzmann machines. Cognit. Sci. 9, 147–169 (1985)
Article Google Scholar
Y. Roudi, E. Aurell, J. Hertz, Statistical physics of pairwise probability models. Front. Comput. Neurosci. 3, 1–22 (2009)
Article Google Scholar
T. Plefka, Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model. J. Phys. A Math. Gen. 15(6), 1971–1978 (1982)
Article MathSciNet Google Scholar
A. Pelizzola, Cluster variation method in statistical physics and probabilistic graphical models. J. Phys. A Math. Gen. 38(33), R309 (2005)
Article MathSciNet Google Scholar
H.J. Kappen, F.B. Rodríguez, Efficient learning in Boltzmann machines using linear response theory. Neural Comput. 10(5), 1137–1156 (1998)
Article Google Scholar
T. Tanaka, Mean-field theory of Boltzmann machine learning. Phys. Rev. E 58, 2302–2310 (1998)
Article Google Scholar
M. Yasuda, T. Horiguchi, Triangular approximation for information ising model and its application to Boltzmann machine. Physica A 368, 83–95 (2006)
Article Google Scholar
V. Sessak, R. Monasson, Small-correlation expansions for the inverse Ising problem. J. Phys. A Math. Theoret. 42(5) (2009)
Google Scholar
M. Yasuda, K. Tanaka, Approximate learning algorithm in Boltzmann machines. Neural Comput. 21(11), 3130–3178 (2009)
Article MathSciNet Google Scholar
F. Ricci-Tersenghi, The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods. J. Stat. Mech. Theory Experi. 2012(08), P08015 (2012)
Google Scholar
C. Furtlehner, Approximate inverse Ising models close to a Bethe reference point. J. Stat. Mech. Theor. Exp. 2013(09), P09020 (2013)
Google Scholar
J. Sohl-Dickstein, P.B. Battaglino, M.R. DeWeese, New method for parameter estimation in probabilistic models: minimum probability flow. Phys. Rev. Lett. 107 (2011)
Google Scholar
M. Yasuda, Monte Carlo integration using spatial structure of Markov random field. J. Phys. Soc. Jpn. 84(3) (2015)
Google Scholar
M. Yasuda, K. Uchizawa, A generalization of spatial monte carlo integration. Neural Comput. 33(4), 1037–1062 (2021)
Article MathSciNet Google Scholar
D.J.C. MacKay, Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992)
Article Google Scholar
C.M. Bishop, Pattern Recognition and Machine Learning (Springer, 2006)
Google Scholar
M. Yasuda, T. Obuchi, Empirical Bayes method for Boltzmann machines. J. Phys. A Math. Theoret. 53(1), 014004 (2019)
Google Scholar
M. Mezard, G. Parisi, M. Virasoro, Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications (World Scientific, Singapore, 1987)
MATH Google Scholar
H. Nishimori, Statistical Physics of Spin Glass and Information Processing—Introduction (Oxford University Press, 2001)
Google Scholar
T. Rizzo, A. Lage-Castellanos, R. Mulet, F. Ricci-Tersenghi, Replica cluster variational method. J. Stat. Phys. 139, 375–416 (2010)
Article MathSciNet Google Scholar
M. Yasuda, Y. Kabashima, K. Tanaka, Replica plefka expansion of Ising systems. J. Stat. Mech. Theor. Exp. P04002 (2012)
Google Scholar
A. Lage-Castellanos, R. Mulet, F. Ricci-Tersenghi, T. Rizzo, Replica cluster variational method: the replica symmetric solution for the 2d random bond ising model. J. Phys. A Math. Theor. 46(13) (2013)
Google Scholar
M. Yasuda, S. Kataoka, K. Tanaka, Statistical analysis of loopy belief propagation in random fields. Phys. Rev. E 92, 042120 (2015)
Google Scholar
D. Sherrington, S. Kirkpatrick, Solvable model of a spin-glass. Phys. Rev. Lett. 35, 1792–1796 (1975)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, 18H03303, 25120013, and 17H00764), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312).

Author information

Authors and Affiliations

Graduate School of Science and Engineering, Yamagata University, Yamagata, Japan
Muneki Yasuda

Authors

Muneki Yasuda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muneki Yasuda .

Editor information

Editors and Affiliations

Graduate School of Information Science, University of Hyogo, Kobe, Hyogo, Japan
Naoki Katoh
Graduate School of Information Science, University of Hyogo, Kobe, Hyogo, Japan
Yuya Higashikawa
School of Informatics and Engineering, University of Electro-Communications, Chofu, Tokyo, Japan
Hiro Ito
Department of Information Science, Ochanomizu University, Bunkyo, Tokyo, Japan
Atsuki Nagao
Human Genome Center, University of Tokyo, Minato, Tokyo, Japan
Tetsuo Shibuya
Center for Advanced Intelligence Project, RIKEN, Chuo, Tokyo, Japan
Adnan Sljoka
Graduate School of Information Science, Tohoku University, Sendai, Miyagi, Japan
Kazuyuki Tanaka
Graduate School of Engineering, Osaka Prefecture University, Sakai, Osaka, Japan
Yushi Uno

11.7 Appendices

1.1 11.7.1 Appendix 1: Gibbs Free Energy

In this appendix, we derive the Gibbs free energy for the replicated (Helmholtz) free energy in Eq. (11.19).

The replicated free energy is obtained by minimizing the variational free energy, defined by

$$\begin{aligned} f[Q]:=\sum _{\mathcal {S}_x}E_x(\mathcal {S};H,\gamma )Q(\mathcal {S}_x) + \sum _{\mathcal {S}_x}Q(\mathcal {S}_x)\ln Q(\mathcal {S}_x), \end{aligned}$$

(11.29)

under the normalization constraint, that is, $\sum _{\mathcal {S}_x}Q(\mathcal {S}_x) = 1$, where $Q(\mathcal {S}_x)$ is a test distribution over $\mathcal {S}_x$, and $E_x(\mathcal {S}_x;H,\gamma )$ is the Hamiltonian for the replicated system defined in Eq. (11.20).

The Gibbs free energy is obtained by adding new constraints to the minimization of f[Q]. We add the relationship

$$\begin{aligned} m = \frac{1}{n\tau _x}\sum _{i=1}^n\sum _{a = 1}^{\tau _x}\sum _{\mathcal {S}_x}S_i^{\{a\}}Q(\mathcal {S}_x) \end{aligned}$$

(11.30)

as the constraint. Using Lagrange multipliers, the Gibbs free energy is obtained as

$$\begin{aligned} G_x(m,H,\gamma )&:=\mathop {\text {extr}}_{Q,\lambda , r} \Big \{ f[Q] - r \Big (\sum _{\mathcal {S}_x}Q(\mathcal {S}_x) - 1\Big )\nonumber \\&-\lambda \Big (\sum _{i=1}^n\sum _{a = 1}^{\tau _x}\sum _{\mathcal {S}_x}S_i^{\{a\}}Q(\mathcal {S}_x) - n \tau _x m\Big )\Big \}, \end{aligned}$$

(11.31)

where “extr” denotes the extremum with respect to the assigned parameters, and r and $\lambda $ are the Lagrange multipliers for the normalization constraint of $Q(\mathcal {S}_x)$ and the constraint in Eq. (11.30), respectively. Performing the extremum operation with respect to $Q(\mathcal {S})$ and r in Eq. (11.31) gives

$$\begin{aligned} G_x(m,H,\gamma ) = \mathop {\text {extr}}_{\lambda }\Big \{\lambda n \tau _x m -\ln \sum _{\mathcal {S}_x}\exp \big ( - E_x(\mathcal {S}_x;H+\lambda ,\gamma )\big )\Big \}. \end{aligned}$$

(11.32)

The replicated free energy in Eq. (11.19) coincides with the extremum of this Gibbs free energy with respect to m, that is,

$$\begin{aligned} F_x(H, \gamma ) = \mathop {\text {extr}}_{m}\,G_x(m,H,\gamma ). \end{aligned}$$

(11.33)

By performing the shift $H + \lambda \rightarrow \lambda $ in Eq. (11.32), Eq. (11.21) is obtained.

1.2 11.7.2 Appendix 2: Coefficients of Plefka Expansion

This appendix presents the coefficients of the Plefka expansion in Eq. (11.24). Refer to Ref. [17] for a detailed derivation. The first-order coefficient is given by

$$\begin{aligned} \phi _x^{(1)}(m)&=- \frac{x(n-1) N C_1}{2n} m^2 - \frac{(n-1) K_x}{2n N}m^4, \end{aligned}$$

(11.34)

where $K_x := \tau _x(\tau _x - 1) / 2$. The second-order coefficient is given by

$$\begin{aligned} \phi _x^{(2)}(m) =&-\frac{(n-1)^2 \tau _x N \Omega }{2n^2}m^2(1 - m^2) -\frac{(n-1)\tau _x N C_2}{4n^2}(1-m^2)^2 \nonumber \\&- \frac{(n-1)K_x C_1}{n^2}m^2(1-m^2)^2 -\frac{(n-1)K_x}{2n^2 N}\big (n + \tau _x-3\big )m^4(1-m^2)^2\nonumber \\&-\frac{(n-1)K_x}{4n^2 N}(1 - m^4)^2, \end{aligned}$$

(11.35)

where $\Omega $ in the first term of Eq. (11.35) is defined as

$$\begin{aligned} \Omega :=\frac{1}{n}\sum _{i=1}^n \omega _i^2, \quad \omega _i&:=\frac{1}{n-1}\sum _{j\in \partial (i)}d_{ij} - C_1, \end{aligned}$$

(11.36)

where $\partial (i):= \{1,2,\ldots ,n\} \setminus \{i\}$. When $x = -1$, these coefficients are

$$\begin{aligned} \phi _{-1}^{(1)}(m) =&\frac{(n-1) N C_1}{2n} m^2 - \frac{(n-1) (N+1)}{4n}m^4, \end{aligned}$$

(11.37)

$$\begin{aligned} \phi _{-1}^{(2)}(m)=&\frac{(n-1)^2 N^2 \Omega }{2n^2}m^2(1 - m^2) +\frac{(n-1) N^2 C_2}{4n^2}(1-m^2)^2 \nonumber \\&- \frac{(n-1)N(N+1)C_1}{2n^2}m^2(1-m^2)^2\nonumber \\&-\frac{(n-1) (N+1)}{4n^2}\big (n - N - 3\big )m^4(1-m^2)^2 -\frac{(n-1)(N+1)}{8n^2}(1 - m^4)^2. \end{aligned}$$

(11.38)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yasuda, M. (2022). Empirical Bayes Method for Boltzmann Machines. In: Katoh, N., et al. Sublinear Computation Paradigm. Springer, Singapore. https://doi.org/10.1007/978-981-16-4095-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-16-4095-7_11
Published: 20 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4094-0
Online ISBN: 978-981-16-4095-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Empirical Bayes Method for Boltzmann Machines

Abstract

Similar content being viewed by others

An Overview of Restricted Boltzmann Machines

Three learning stages and accuracy–efficiency tradeoff of restricted Boltzmann machines

A Two-Stage Pretraining Algorithm for Deep Boltzmann Machines

1 Introduction

2 Boltzmann Machine with Prior Distributions

3 Empirical Bayes Method