1 Introduction

Boltzmann machine learning (BML) [1] has been actively studied in the fields of machine learning and statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the inverse Ising problem because a Boltzmann machine is the same as an Ising model, and it can be treated as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points, the appropriate values of the Boltzmann machine parameters, namely the biases and couplings, are estimated through maximum likelihood (ML) estimation. Because BML involves intractable multiple summations (i.e., evaluation of the partition function), several approximations have been proposed for it from the viewpoint of statistical mechanics [2]. Examples include methods based on mean-field approximations (e.g., the Plefka expansion [3] and the cluster variation method [4]) [5,6,7,8,9,10,11] and methods based on other approximations [12,13,14].

This chapter focuses on another type of learning problem for the Boltzmann machine. Consider the prior distributions of the Boltzmann machine parameters and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected to regularized ML estimation, in which the hyperparameters can be regarded as regularization coefficients. The regularized ML estimation is important for preventing overfitting to the dataset. As mentioned above, the usual BML aims to optimize the values of the Boltzmann machine parameters using a set of observed data points. However, the aim of the problem presented in this chapter is the estimation of the appropriate values of the hyperparameters from the dataset without estimating the specific values of the Boltzmann machine parameters. From the Bayesian viewpoint, this can be potentially accomplished by the empirical Bayes method (also known as type-II ML estimation or evidence approximation) [15, 16]. The schemes of the usual BML and the problem investigated in this chapter are illustrated in Fig. 11.1.

Fig. 11.1
figure 1

Illustration of scheme of the empirical Bayes method considered in this chapter

Recently, an effective algorithm was proposed for the empirical Bayes method for the Boltzmann machine [17]. Using this method, the hyperparameter estimates can be obtained without costly operations. This chapter aims to explain this effective method.

The rest of this chapter is organized as follows. The formulations of the Boltzmann machine and its usual and regularized ML estimations are presented in Sect. 11.2. The empirical Bayes method for the Boltzmann machine is presented in Sect. 11.3. Section 11.4 describes a statistical mechanical analysis for the empirical Bayes method and an inference algorithm obtained from the analysis. Experimental results for the presented algorithm are presented in Sect. 11.5. The summary and some discussions are presented in Sect. 11.6. The appendices for this chapter are given in Sect. 11.7.

2 Boltzmann Machine with Prior Distributions

Consider a fully connected Boltzmann machine with n (bipole) variables \(\boldsymbol{S}:= \{S_i \in \{-1,+1\} \mid i = 1,2,\ldots , n\}\) [1]:

$$\begin{aligned} P(\boldsymbol{S} \mid h,\boldsymbol{J}):=\frac{1}{Z(h,\boldsymbol{J})}\exp \Big (h \sum _{i=1}^n S_i + \sum _{i<j}J_{ij}S_iS_j\Big ), \end{aligned}$$
(11.1)

where \(\sum _{i<j}\) is the sum over all distinct pairs of variables, that is, \(\sum _{i<j} = \sum _{i=1}^n\sum _{j = i+1}^n\). \(Z(h,\boldsymbol{J})\) is the partition function defined by

$$\begin{aligned} Z(h,\boldsymbol{J}):= \sum _{\boldsymbol{S}}\exp \Big (h \sum _{i=1}^n S_i + \sum _{i<j}J_{ij}S_iS_j\Big ), \end{aligned}$$

where \(\sum _{\boldsymbol{S}}\) is the sum over all possible configurations of \(\boldsymbol{S}\), that is,

$$\begin{aligned} \sum _{\boldsymbol{S}} := \prod _{i=1}^n \sum _{S_i = \pm 1}. \end{aligned}$$

The parameters \(h \in (-\infty , +\infty )\) and \(\boldsymbol{J} := \{J_{ij} \in (-\infty , +\infty ) \mid i<j\}\) denote the bias and couplings, respectively.

Given N observed data points, \(\mathcal {D}:=\{\mathbf {S}^{(\mu )} \in \{-1,+1\}^n \mid \mu = 1,2,\ldots , N\}\), the log-likelihood function is defined as

$$\begin{aligned} L_{\mathrm {ML}}(h,\boldsymbol{J}):=\frac{1}{n N}\sum _{\mu = 1}^N \ln P(\mathbf {S}^{(\mu )} \mid h,\boldsymbol{J}). \end{aligned}$$
(11.2)

The maximization of the log-likelihood function with respect to h and \(\boldsymbol{J}\) (i.e., the ML estimation) corresponds to BML (or the inverse Ising problem), that is,

$$\begin{aligned} \{\hat{h}_{\mathrm {ML}},\hat{\boldsymbol{J}}_{\mathrm {ML}}\} = \mathop {\text {arg max}}_{h, \boldsymbol{J}}L_{\mathrm {ML}}(h,\boldsymbol{J}). \end{aligned}$$
(11.3)

However, the exact ML estimations cannot be obtained because the gradients of the log-likelihood function include intractable sums over \(O(2^n)\) terms.

We now introduce the prior distributions of the parameters h and \(\boldsymbol{J}\) as \(P_{\mathrm {prior}}(h\mid H)\) and

$$\begin{aligned} P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma ):= \prod _{i<j} P_{\mathrm {prior}}(J_{ij} \mid \gamma ), \end{aligned}$$
(11.4)

where H and \(\gamma \) are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is the Bayesian interpretation of the regularized ML estimation [16]. Given the observed dataset \(\mathcal {D}\), using the prior distributions, the posterior distribution of h and \(\boldsymbol{J}\) is expressed as

$$\begin{aligned} P_{\mathrm {post}}(h,\boldsymbol{J} \mid \mathcal {D}, H, \gamma ) = \frac{P(\mathcal {D} \mid h, \boldsymbol{J})P_{\mathrm {prior}}(h\mid H)P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma )}{P(\mathcal {D} \mid H, \gamma )}, \end{aligned}$$
(11.5)

where

$$\begin{aligned} P(\mathcal {D} \mid h, \boldsymbol{J}):= \prod _{\mu = 1}^N P(\mathbf {S}^{(\mu )} \mid h,\boldsymbol{J}). \end{aligned}$$

The denominator of Eq. (11.5) is sometimes referred to as evidence. Using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as

$$\begin{aligned} \{\hat{h}_{\mathrm {MAP}},\hat{\boldsymbol{J}}_{\mathrm {MAP}}\} = \mathop {\text {arg max}}_{h, \boldsymbol{J}}L_{\mathrm {MAP}}(h,\boldsymbol{J}), \end{aligned}$$
(11.6)

where

$$\begin{aligned} L_{\mathrm {MAP}}(h,\boldsymbol{J})&:= \frac{1}{nN}\ln P_{\mathrm {post}}(h,\boldsymbol{J} \mid \mathcal {D}, H, \gamma )\nonumber \\&\>=L_{\mathrm {ML}}(h,\boldsymbol{J}) + \frac{1}{nN} R_0(h) + \frac{1}{nN} R_1(\boldsymbol{J}) + \mathrm {constant}. \end{aligned}$$
(11.7)

The MAP estimation of Eq. (11.6) corresponds to the regularized ML estimation, in which \(R_0(h):=\ln P_{\mathrm {prior}}(h\mid H)\) and \(R_1(\boldsymbol{J}):=\ln P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma )\) work as penalty terms. For example, (i) when the prior distribution of \(\boldsymbol{J}\) is a Gaussian prior,

$$\begin{aligned} P_{\mathrm {prior}}(J_{ij} \mid \gamma )= \sqrt{\frac{n}{2 \pi \gamma }} \exp \Big (-\frac{n J_{ij}^2}{2 \gamma }\Big ),\quad \gamma > 0, \end{aligned}$$
(11.8)

\(R_1(\boldsymbol{J})\) corresponds to the \(L_2\) regularization term and \(\gamma \) corresponds to its coefficient; (ii) when the prior distribution of \(\boldsymbol{J}\) is a Laplace prior,

$$\begin{aligned} P_{\mathrm {prior}}(J_{ij} \mid \gamma )= \sqrt{\frac{n}{2 \gamma }} \exp \Big (-\sqrt{\frac{2n}{\gamma }} |J_{ij}|\Big ),\quad \gamma > 0, \end{aligned}$$
(11.9)

\(R_1(\boldsymbol{J})\) corresponds to the \(L_1\) regularization term and \(\gamma \) again corresponds to its coefficient. The variances of these prior distributions are identical, that is, \(\mathrm {Var}[J_{ij}]=\gamma /n\).

The following uses the Gaussian prior for \(\boldsymbol{J}\) and the following as a simple test case:

$$\begin{aligned} P_{\mathrm {prior}}(h\mid H) = \delta (h - H), \end{aligned}$$
(11.10)

where \(\delta (x)\) is the Dirac delta function; that is, in this test case, h does not distribute. It is noteworthy that the resultant algorithm obtained based on the Gaussian prior can be applied to the case of the Laplace prior without modification [17].

3 Empirical Bayes Method

Using the empirical Bayes method, the values of the hyperparameters, H and \(\gamma \), can be inferred from the observed dataset, \(\mathcal {D}\). For the empirical Bayes method, a marginal log-likelihood function is defined as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma ):=\frac{1}{nN} \ln \big [ P(\mathcal {D} \mid h, \boldsymbol{J})\big ]_{h,\boldsymbol{J}}, \end{aligned}$$
(11.11)

where \([\cdots ]_{h,\boldsymbol{J}}\) is the average over the prior distributions, that is,

$$\begin{aligned}{}[\cdots ]_{h,\boldsymbol{J}}:= \int d\boldsymbol{J}\int d h (\cdots ) P_{\mathrm {prior}}(h\mid H)P_{\mathrm {prior}}(\boldsymbol{J} \mid \gamma ). \end{aligned}$$

This marginal log-likelihood function is referred to as the empirical Bayes likelihood function in this section. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, \(\hat{H}\) and \(\hat{\gamma }\), are obtained by maximizing the empirical Bayes likelihood function, that is,

$$\begin{aligned} \{\hat{H},\hat{\gamma }\} = \mathop {\text {arg max}}_{H, \gamma } L_{\mathrm {EB}}(H,\gamma ). \end{aligned}$$
(11.12)

It is noteworthy that \([P(\mathcal {D} \mid h, \boldsymbol{J})]_{h,\boldsymbol{J}}\) in Eq. (11.11) is identified as the evidence appearing in Eq. (11.5).

The marginal log-likelihood function can be rewritten as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )=\frac{1}{nN}\ln \Big [\exp \big (n N L_{\mathrm {ML}}(h,\boldsymbol{J})\big )\Big ]_{h, \boldsymbol{J}}. \end{aligned}$$
(11.13)

Consider the case \(N\gg n\). In this case, by using the saddle point evaluation, Eq. (11.13) is reduced to

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )&\approx \frac{1}{n N} \ln P_{\mathrm {prior}}(\hat{h}_{\mathrm {ML}}\mid H) + \frac{1}{nN} \ln P_{\mathrm {prior}}(\hat{\boldsymbol{J}}_{\mathrm {ML}} \mid \gamma )+\mathrm {constant}. \end{aligned}$$

In this case, the empirical Bayes estimates \(\{\hat{H},\hat{\gamma }\}\) thus converge to the ML estimates of the hyperparameters in the prior distributions in which the ML estimates of the parameters \(\{\hat{h}_{\mathrm {ML}},\hat{\boldsymbol{J}}_{\mathrm {ML}}\}\) (i.e., the solution for BML) are inserted. This indicates that parameter estimations can be conducted independently of hyperparameter estimation. This trivial case is not considered in this section. Remember that the objective is to estimate the hyperparameter values without estimating the specific values of the parameters.

4 Statistical Mechanical Analysis of Empirical Bayes Likelihood

The empirical Bayes likelihood function in Eq. (11.11) involves intractable multiple integrations. This section presents an evaluation of the empirical Bayes likelihood function using statistical mechanical analysis. The outline of the evaluation is as follows. First, the intractable multiple integrations in Eq. (11.11) are evaluated using the replica method [18, 19]. This evaluation leads to a quantity with a certain intractable multiple summation. The quantity is approximately evaluated using the Plefka expansion [3]. Thus, from the two approximations, the replica method and Plefka expansion, the evaluation result for the empirical Bayes likelihood function is obtained.

4.1 Replica Method

The empirical Bayes likelihood function in Eq. (11.11) can be represented as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )=\frac{1}{nN}\ln \lim _{x \rightarrow -1} \Psi _x(H,\gamma ), \end{aligned}$$
(11.14)

where

$$\begin{aligned} \Psi _x(H,\gamma ):=\Big [ Z(h,\boldsymbol{J})^{x N} \exp N\Big (h \sum _{i=1}^n d_i + \sum _{i<j}J_{ij} d_{ij}\Big ) \Big ]_{h, \boldsymbol{J}}, \end{aligned}$$
(11.15)

and

$$\begin{aligned} d_i := \frac{1}{N} \sum _{\mu = 1}^N \mathrm {S}_i^{(\mu )},\quad d_{ij} := \frac{1}{N} \sum _{\mu = 1}^N \mathrm {S}_i^{(\mu )}\mathrm {S}_j^{(\mu )} \end{aligned}$$

are the sample averages of the observed data points. We now assume that \(\tau _x := x N\) is a natural number larger than zero. Accordingly, Eq. (11.15) can be expressed as

$$\begin{aligned} \Psi _x(H,\gamma )&=\Big [ \sum _{\mathcal {S}_x}\exp \Big \{h \sum _{i=1}^n\Big ( \sum _{a = 1}^{\tau _x}S_i^{\{a\}}+ N d_i\Big ) \nonumber \\&\quad + \sum _{i<j}J_{ij}\Big (\sum _{a = 1}^{\tau _x}S_i^{\{a\}}S_j^{\{a\}} + N d_{ij}\Big )\Big \} \Big ]_{h, \boldsymbol{J}}, \end{aligned}$$
(11.16)

where \(a ,b \in \{1,2,\ldots , \tau _x\}\) are the replica indices and \(S_i^{\{a\}}\) is the ith variable in the ath replica. \(\mathcal {S}_x:= \{S_i^{\{a\}} \mid i = 1,2,\ldots , n;\, a = 1,2,\ldots , \tau _x\}\) is the set of all variables in the replicated system (see Fig. 11.2) and \(\sum _{\mathcal {S}_x}\) is the sum over all possible configurations of \(\mathcal {S}_x\), that is,

$$\begin{aligned} \sum _{\mathcal {S}_x} := \prod _{i=1}^n\prod _{a=1}^{\tau _x} \sum _{S_i^{\{a\}} = \pm 1}. \end{aligned}$$

We evaluate \(\Psi _x(H,\gamma )\) under the assumption that \(\tau _x\) is a natural number, and then we take the limit of \(x \rightarrow -1\) from the evaluation result as an analytic continuation,Footnote 1 to obtain the empirical Bayes likelihood function (this is the so-called replica trick).

Fig. 11.2
figure 2

Illustration of the replicated system. The \(\tau _x\) replicas, \(\boldsymbol{S}^{\{1\}},\boldsymbol{S}^{\{2\}}, \ldots , \boldsymbol{S}^{\{\tau _x\}}\), arise from \(Z(h,\boldsymbol{J})^{\tau _x}\) in Eq. (11.15)

By employing the Gaussian prior in Eq. (11.8), Eq. (11.16) becomes

$$\begin{aligned} \Psi _x^{\mathrm {Gauss}}(H,\gamma ) =\exp \Big \{ n N H M + \frac{\gamma (n-1) N^2}{4}\Big (C_2 + \frac{x}{N}\Big )- F_x(H, \gamma )\Big \}, \end{aligned}$$
(11.17)

where

$$\begin{aligned} M:= \frac{1}{n}\sum _{i=1}^n d_i,\quad C_k:= \frac{2}{n(n-1)}\sum _{i<j}d_{ij}^k, \end{aligned}$$
(11.18)

and

$$\begin{aligned} F_x(H, \gamma ):=-\ln \sum _{\mathcal {S}_x}\exp \big (-E_x(\mathcal {S}_x;H,\gamma )\big ) \end{aligned}$$
(11.19)

is the replicated (Helmholtz) free energy [20,21,22,23], where

$$\begin{aligned} E_x(\mathcal {S}_x;H,\gamma ) :=&-H \sum _{i=1}^n\sum _{a=1}^{\tau _x} S_i^{\{a\}} - \frac{\gamma N}{n}\sum _{i<j}d_{ij} \sum _{a = 1}^{\tau _x}S_i^{\{a\}}S_j^{\{a\}} \nonumber \\&- \frac{\gamma }{n}\sum _{i<j}\sum _{a < b}S_i^{\{a\}}S_j^{\{a\}}S_i^{\{b\}}S_j^{\{b\}} \end{aligned}$$
(11.20)

is the Hamiltonian (or energy function) of the replicated system, where \(\sum _{a<b}\) is the sum over all distinct pairs of replicas, that is, \(\sum _{a<b} = \sum _{a=1}^{\tau _x}\sum _{b = a+1}^{\tau _x}\).

4.2 Plefka Expansion

Because the replicated free energy in Eq. (11.19) includes intractable multiple summations, an approximation is required to proceed with the current evaluation. In this section, the replicated free energy in Eq. (11.19) is approximated using the Plefka expansion [3]. In brief, the Plefka expansion is a perturbative expansion in Gibbs free energy that is a dual form of a corresponding Helmholtz free energy.

The Gibbs free energy is obtained as

$$\begin{aligned} G_x(m,H,\gamma ) =- n \tau _x H m + \mathop {\text {extr}}_{\lambda }\Big \{\lambda n \tau _x m -\ln \sum _{\mathcal {S}_x}\exp \big ( - E_x(\mathcal {S}_x;\lambda ,\gamma )\big )\Big \}. \end{aligned}$$
(11.21)

The derivation of this Gibbs free energy is described in Sect. 11.7.1. The summation in Eq. (11.21) can be performed when \(\gamma = 0\), which gives

$$\begin{aligned} G_x(m,H,0)&=- n \tau _x H m + n \tau _x \mathop {\text {extr}}_{\lambda }\big \{\lambda m- \ln (2\cosh \lambda ) \big \} \nonumber \\&=- n \tau _x H m + n \tau _xe(m), \end{aligned}$$
(11.22)

where e(m) is the negative mean-field entropy defined by

$$\begin{aligned} e(m):=\frac{1+m}{2} \ln \frac{1+m}{2} + \frac{1-m}{2} \ln \frac{1-m}{2}. \end{aligned}$$
(11.23)

In the context of the Plefka expansion, the Gibbs free energy \(G_x(m,H,\gamma )\) is approximated by the perturbation from \(G_x(m,H,0)\). Expanding \(G_x(m,H,\gamma )\) around \(\gamma = 0\) gives

$$\begin{aligned} \frac{G_x(m,H,\gamma )}{nN}&=-x H m+ x e(m) + \phi _x^{(1)}(m) \gamma + \phi _x^{(2)}(m)\gamma ^2 + O(\gamma ^3), \end{aligned}$$
(11.24)

where \(\phi _x^{(1)}(m)\) and \(\phi _x^{(2)}(m)\) are the expansion coefficients defined by

$$\begin{aligned} \phi _x^{(k)}(m):=\frac{1}{n N k!} \lim _{\gamma \rightarrow 0}\frac{\partial ^k G_x(m,H,\gamma )}{\partial \gamma ^k}. \end{aligned}$$

The forms of the two coefficients are presented in Eqs. (11.34) and (11.35) in Sect. 11.7.2.

From Eqs. (11.14), (11.17), (11.24), and (11.33), the approximation of the empirical Bayes likelihood function is obtained as

$$\begin{aligned} L_{\mathrm {EB}}(H,\gamma )&\approx HM -\mathop {\text {extr}}_{m}\Big [ Hm - e(m)+ \Phi (m)\gamma +\phi _{-1}^{(2)}(m)\gamma ^2\Big ], \end{aligned}$$
(11.25)

where

$$\begin{aligned} \Phi (m):=\phi _{-1}^{(1)}(m) - \frac{(n-1)N}{4n}\Big (C_2 - \frac{1}{N}\Big ). \end{aligned}$$

The forms of \(\phi _{-1}^{(1)}(m)\) and \(\phi _{-1}^{(2)}(m)\) are presented in Eqs. (11.37) and (11.38) in Sect. 11.7.2.

4.3 Algorithm for Hyperparameter Estimation

As mentioned in Sect. 11.3, the empirical Bayes inference is achieved by maximizing \(L_{\mathrm {EB}}(H,\gamma )\) with respect to H and \(\gamma \) (cf. Eq. (11.12)). The extremum condition in Eq. (11.25) with respect to H leads to

$$\begin{aligned} \hat{m} = M, \end{aligned}$$
(11.26)

where \(\hat{m}\) is the value of m that satisfies the extremum condition in Eq. (11.25). By combining the extremum condition of Eq. (11.25) with respect to m with Eq. (11.26),

$$\begin{aligned} \hat{H} =\mathrm {atanh} M - \Big (\frac{\partial \phi _{-1}^{(1)}(M)}{\partial M}\gamma +\frac{\partial \phi _{-1}^{(2)}(M)}{\partial M}\gamma ^2\Big ) \end{aligned}$$
(11.27)

is obtained, where \(\mathrm {atanh} x\) is the inverse function of \(\tanh x\). From Eqs. (11.25) and (11.26), the optimal value of \(\gamma \) is obtained by

$$\begin{aligned} \hat{\gamma }&=\mathop {\text {arg max}}_{\gamma }\big [-\Phi (M)\gamma -\phi _{-1}^{(2)}(M)\gamma ^2\big ]. \end{aligned}$$
(11.28)

Since Eq. (11.28) represents a univariate quadratic optimization, \(\hat{\gamma }\) is immediately obtained as follows: (i) when \(\phi _{-1}^{(2)}(M) > 0\) and \(\Phi (M) \ge 0\) or when \(\phi _{-1}^{(2)}(M) = 0\) and \(\Phi (M) > 0\), \(\hat{\gamma } = 0\), (ii) when \(\phi _{-1}^{(2)}(M) > 0\) and \(\Phi (M) < 0\), \(\hat{\gamma } = - \Phi (M) / (2 \phi _{-1}^{(2)}(M))\), and (iii) \(\hat{\gamma } \rightarrow \infty \), elsewhere. The case of \(\phi _{-1}^{(2)}(M) = \Phi (M) = 0\) is ignored because it may be rarely observed in realistic settings. Using Eqs. (11.27) and (11.28), the solution to the empirical Bayes inference can be obtained without any iterative process. The pseudocode of the presented procedure is shown in Algorithm 1. The order of the computational complexity of the presented method is \(O(Nn^2)\). Remember that the order of the computational complexity of the exact ML estimation is \(O(2^n)\).

figure a

In the presented method, the value of \(\hat{H}\) does not affect the determination of \(\hat{\gamma }\). Several mean-field-based methods for BML (e.g., listed in Sect. 11.1) have similar procedures, in which \(\hat{\boldsymbol{J}}_{\mathrm {ML}}\) is determined separately from \(\hat{h}_{\mathrm {ML}}\). This is a common property of the mean-field-based methods for BML, including the current empirical Bayes problem.

Although the presented method is derived based on the Gauss prior presented in Eq. (11.8), the same procedure can be applied to the case of the Laplace prior presented in Eq. (11.9) [17].

5 Demonstration

This section discusses the results of numerical experiments. In these experiments, the observed dataset \(\mathcal {D}\) was generated by the generative Boltzmann machine (gBM), which has the same form as Eq. (11.1), via Gibbs sampling (with a simulated-annealing-like strategy). The parameters of gBM were drawn from the prior distributions in Eqs. (11.4) and (11.10). This implies that the model-matched case (i.e., the generative and learning models are identical) was considered. In the following, the notation \(\alpha := N / n\) and \(J := \sqrt{\gamma }\) are used. The standard deviations of the Gaussian prior in Eq. (11.8) and the Laplace prior in Eq. (11.9) can thus be represented as \(J / \sqrt{n}\). The hyperparameters of gBM are denoted by \(H_{\mathrm {true}}\) and \(J_{\mathrm {true}}\).

5.1 Gaussian Prior Case

We now consider the case in which the prior distribution of \(\boldsymbol{J}\) is the Gaussian prior in Eq. (11.8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model [24], and thus exhibits a spin-glass transition at \(J = 1\) when \(h = 0\) (i.e., when \(H = 0\)).

We consider the case \(H_{\mathrm {true}} = 0\). The scatter plots for the estimation of \(\hat{J}\) for various \(J_{\mathrm {true}}\) when \(H_{\mathrm {true}} = 0\) and \(\alpha = 0.4\) are shown in Fig. 11.3. When \(J_{\mathrm {true}} < 1\), our estimates of \(\hat{J}\) are significantly consistent with \(J_{\mathrm {true}}\). This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with several mean-field approximations.

Fig. 11.3
figure 3

Scatter plots of \(J_{\mathrm {true}}\) (horizontal axis) versus \(\hat{J}\) (vertical axis) when \(H_{\mathrm {true}} = 0\) and \(\alpha = 0.4\): a \(n = 300\) and b \(n = 500\). These plots represent the average values over 300 experiments

Figure 11.4 shows the scatter plots for various \(\alpha \).

Fig. 11.4
figure 4

Scatter plots of \(J_{\mathrm {true}}\) (horizontal axis) versus \(\hat{J}\) (vertical axis) for various \(\alpha = N / n\) when \(H_{\mathrm {true}} = 0\): a \(n = 300\) and b \(n = 500\). These plots represent the average values over 300 experiments

A smaller \(\alpha \) causes \(\hat{J}\) to be overestimated and a larger \(\alpha \) causes it to be underestimated. In our experiments, at least, the optimal value of \(\alpha \) seems to be \(\alpha _{\mathrm {opt}} \approx 0.4\) when \(H_{\mathrm {true}} = 0\). Our method can also estimate \(\hat{H}\). The results for the estimation of \(\hat{H}\) when \(H_{\mathrm {true}} = 0\) and \(\alpha = 0.4\) are shown in Fig. 11.5.

Fig. 11.5
figure 5

Results of estimation of \(\hat{H}\) versus \(J_{\mathrm {true}}\) when \(H_{\mathrm {true}} = 0\) and \(\alpha = 0.4\): a the MAE and b standard deviation. These plots represent the average values over 300 experiments

Figure 11.5a, b shows the average of \(|H_{\mathrm {true}} - \hat{H}|\) (i.e., the mean absolute error (MAE)) and the standard deviation of \(\hat{H}\) over 300 experiments, respectively. The MAE and standard deviation increase in the region where \(J_{\mathrm {true}} > 1\).

5.2 Laplace Prior Case

We now consider the case in which the prior distribution of \(\boldsymbol{J}\) is the Laplace prior in Eq. (11.9). The scatter plots for the estimation of \(\hat{J}\) for various values of \(J_{\mathrm {true}}\) when \(H_{\mathrm {true}} = 0\) are shown in Fig. 11.6.

Fig. 11.6
figure 6

Scatter plots of \(J_{\mathrm {true}}\) (horizontal axis) versus \(\hat{J}\) (vertical axis) for various \(\alpha = N/n\), when \(H_{\mathrm {true}} = 0\), in the case of the Laplace prior: a \(n = 300\) and b \(n = 500\). These plots represent the average values over 300 experiments

The plots shown in Fig. 11.6 almost completely overlap with those in Fig. 11.4.

6 Summary and Discussion

This chapter describes the hyperparameter inference algorithm proposed in [17]. As evident from the numerical experiments, the proposed inference method in both the Gaussian and Laplace prior cases works efficiently except for the spin-glass phase. However, the presented method has the drawback that it is sensitive to the value of \(\alpha = N / n\). In the experiments in Sect. 11.5, although \(\alpha \approx 0.4\) was appropriate when \(H_{\mathrm {true}} = 0\), it is known that the appropriate value decreases as \(H_{\mathrm {true}}\) increases [17]. Since we cannot know the value of \(H_{\mathrm {true}}\) in advance, the appropriate setting of \(\alpha \) is also unknown. Estimation of \(\alpha _{\mathrm {opt}}\) is an open problem. It seems to be unnatural that there exists an optimal value of \(\alpha \) because larger datasets are better in usual machine learning. Such peculiar behavior can be attributed to the truncating approximation in the Plefka expansion. A more detailed discussion of this issue is presented in [17].

Finally, we review the presented method from the perspective of sublinear computation without considering the aforementioned issues. The Boltzmann machine given in Eq. (11.1) has p parameters, where \(p = O(n^2)\). In usual machine learning, \(N = O(p)\) is, at least, required to obtain a good ML estimate for the Boltzmann machine. Therefore, a hyperparameter inference “without” the empirical Bayes method (namely, the strategy in which the hyperparameters are inferred through the ML estimate in a similar manner as that discussed in the latter part of Sect. 11.3) requires a dataset of size O(p). However, the presented method requires only \(N = O(n)= O(\sqrt{p})\) because \(\alpha = O(1)\) with respect to n.