1 Introduction

Overfitting is a critical issue in machine learning, and it becomes severe as the representation power of the learning model increases and the size of training dataset decreases. Regularization techniques are the most popular methods for mitigating the problem of overfitting (Bishop 2006). In standard regularization methods, such as \(L_1\) or \(L_2\) regularizations, penalties are imposed on the learning parameters by adding penalty terms to the objective functions such as loss or log likelihood functions. These standard regularization methods often involve hyperparameters (e.g., regularization coefficients) that control the strength of penalties, and the values of the hyperparameters are fixed during training.

Discriminative restricted Boltzmann machine (dRBM) is a probabilistic three-layered neural network, consisting of input, hidden, and output layers, designed for solving classification problems (Larochelle and Bengio 2008; Larochelle et al. 2012). The dRBM is constructed based on restricted Boltzmann machine (RBM) (Smolensky 1986; Hinton 2002). The representational capacity of the dRBM can be regulated by adjusting the size of the hidden layer, which expands as the hidden layer’s size increases. A regularization method for the dRBM, sparse dRBM (S-dRBM), was previously proposed (Yasuda and Katsumata 2023). In this regularization method, a sparse prior for the hidden layer is employed in the form of a Laplace-type distribution, the effect of which encourages sparse representations of the hidden layer. An advantage of this regularization method is that the regularization strength (i.e., the strength of the prior) is trainable; in other words, the regularization strength can be adaptively tuned based on dataset complexity, within the standard scenario of maximum likelihood (ML) learning.

Gaussian–Bernoulli RBM (GBRBM) is a variant of RBM that can handle continuous data points (Hinton and Salakhutdinov 2006; Cho et al. 2011), and canonicalized GBRBM is a reparameterized version of the GBRBM (Yasuda and Xiong 2023). RBMs are also actively investigated in the field of physics (Decelle and Furtlehner 2021; Chen et al. 2018; Nomura and Imada 2021; Torlai et al. 2018; Carleo and Troyer 2017.) In standard GBRBMs, the hidden variables take binary values, for example, \(\{0,1\}\) or \(\{-1,1\}\). In this study, we consider (canonicalized) Gaussian–Discrete RBM (GDRBM) in which the hidden variables can accept multiple discrete values. When the hidden variables are binary, the GDRBM is equivalent to the canonicalized GBRBM. This study proposes a sparse-regularized GDRBM, referred to as sparse GDRBM (S-GDRBM), by applying the successful regularization method employed in S-dRBM.

The remainder of this paper is organized as follows. The GDRBM is defined in section 2. Section 3 presents the S-GDRBM. The S-GDRBM is obtained by combining a Laplace-type sparse prior for the hidden layer with the GDRBM. The details of the S-GDRBM are discussed in section 3.1, and the maximum-likelihood learning of the S-GDRBM based on spatial Monte Carlo integration (SMCI) method (Yasuda 2015; Yasuda and Uchizawa 2021) is discussed in section 3.2. In section 4, we demonstrate learning experiments using artificial datasets, which show that the proposed S-GDRBM effectively suppresses overfitting. Section 5 concludes the paper and presents future research directions.

2 Gaussian-discrete restricted Boltzmann machine

We consider a GDRBM defined on a complete bipartite graph consisting of two layers: visible and hidden layers. The visible layer consists of continuous visible variables \(\varvec{v}:= \{v_i \in \mathbb {R}\mid i \in V\}\), and the hidden layer consists of discrete hidden variables \(\varvec{h}:= \{h_j \in \mathcal {X} _H \mid j \in H\}\), where V and H are the sets of indices of the visible and hidden variables, respectively; \(\mathcal {X} _H\) is a discrete sample space. The sizes of the visible and hidden layers are denoted by n and m, respectively (i.e., \(|V| = n\) and \(|H| = m\)). The energy function of the GDRBM is defined by

$$\begin{aligned} E_{\theta }(\varvec{v}, \varvec{h}):=\sum _{i \in V} \frac{v_i^2}{2 {{\,\textrm{sfp}\,}}\sigma _i} -\sum _{i \in V}b_i v_i- \sum _{j \in H}c_j h_j - \sum _{i\in V}\sum _{j \in H}w_{i,j}v_ih_j , \end{aligned}$$
(1)

where \({{\,\textrm{sfp}\,}}z:= \ln (1 + e^z)\) is the softplus function; here, the learning parameters, \(\{b_i, \sigma _i, c_j, w_{i,j}\}\), are collectively denoted by \(\theta\). The GDRBM is a joint distribution expressed as

$$\begin{aligned} P_{\theta }(\varvec{v}, \varvec{h}):=\frac{1}{Z_{\theta }} \exp \big ( - E_{\theta }(\varvec{v}, \varvec{h})\big ), \end{aligned}$$
(2)

where

$$\begin{aligned} Z_{\theta }:= \int _{-\infty }^{+\infty } \Big (\sum _{\varvec{h} }\exp \big ( - E_{\theta }(\varvec{v}, \varvec{h})\big ) \Big )d\varvec{v} \end{aligned}$$

is the normalization constant (or the partition function); here, \(\sum _{\varvec{h}}\) denotes the multiple summation over \(\varvec{h} \in \mathcal {X} _H^m\), and \(\int _{-\infty }^{+\infty } d\varvec{v}\) denotes the multiple integration over \(\varvec{v} \in \mathbb {R}^n\). The GDRBM in equation (2) is a generalized model of the canonicalized GBRBM (Yasuda and Xiong 2023). When the hidden variables are binary, \(\mathcal {X} _H = \{0,1\}\), the GDRBM is equivalent to the canonicalized GBRBM. The softplus function in the first term of equation (1) is employed for learning stability (Yasuda and Xiong 2023).

3 Proposed model: sparse GDRBM

The representation power of the GDRBM increases with an increase in m (i.e., the size of the hidden layer), and the overfitting problem increases in severity as the representation power increases. Therefore, the optimization of m is critical to prevent overfitting. However, in the standard scenario, m is a hyperparameter and is not trainable. Numerous studies have addressed this issue, introducing various approaches such as sparse RBM (S-RBM) (Lee et al. 2007), sparse group RBM (SG-RBM) (Luo et al. 2011), Gaussian cardinality RBM (GC-RBM) (Wan et al. 2015), and energy-function-constraint sparse RBM (ES-RBM) (Wei et al. 2019). The S-RBM, SG-RBM, GC-RBM, and ES-RBM introduce regularizers to encourage sparse representations of the hidden layer; however, they have hyperparameters related to the strength of the regularizers.

In the dRBM, an alternative sparse regularization, S-dRBM, is proposed (Yasuda and Katsumata 2023); this model introduces a regularizer that penalizes the activations of hidden variables in its energy function, aiming to encourage sparse representations of the hidden layer. The concept of the S-dRBM is similar to that of the ES-RBM. However, it has no hyperparameters, which means that the strength of regularization in the S-dRBM can be adaptively tuned to the complexity of the dataset within the standard scenario of the ML learning (and the S-dRBM is confirmed to be superior to the ES-RBM) (Yasuda and Katsumata 2023). This section presents the S-GDRBM, which is based on the S-dRBM.

3.1 Model definition

The key concept of the proposed sparse regularization is simple; if \(h_j\) always takes zero value (i.e., \(h_j\) is always in the off-state), the influence of the variable is effectively eliminated from the model. Based on this, a sparsity assumption, similar to that in \(L_1\) regularization, is imposed on the values of the hidden variables. In the Bayesian interpretation, \(L_1\) regularization can be viewed as a Laplace prior (Bishop 2006; Rish and Grabarnik 2014). Here, we assume that \(\mathcal {X} _H\) is a discrete sample space defined by

$$\begin{aligned} \mathcal {X} _H = \mathcal {X} _H(R) := \{ -1 + r / R \mid r = 0, 1,2,\ldots , 2R\}, \end{aligned}$$
(3)

where R is a finite positive integer greater than zero; therefore, e.g., \(\mathcal {X} _H(1)= \{-1,0,1\}\) and \(\mathcal {X} _H(2) = \{-1,-1/2,0,1/2,1\}\). We consider a (discrete-type) Laplace distribution over \(\varvec{h}\):

$$\begin{aligned} P_{ \textrm{lap} }(\varvec{h} \mid \varvec{\alpha }) \propto \prod _{j \in H} \exp \big ( - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\big ). \end{aligned}$$
(4)

In this distribution, the hidden variables are zero with high probabilities, and \(\varvec{\alpha }:= \{\alpha _j \in \mathbb {R}\mid j \in H\}\) controls the probabilities. By combining the Laplace distribution with the GDRBM, a new model can be defined as \(P_{\phi }(\varvec{v},\varvec{h})\propto P_{\theta }(\varvec{v}, \varvec{h})P_{ \textrm{lap} }(\varvec{h} \mid \varvec{\alpha })\); thus, the resultant model is expressed as

$$\begin{aligned} P_{\phi }(\varvec{v},\varvec{h})= \frac{1}{Z_{\phi }} \exp \Big ( -E_{\theta }(\varvec{v}, \varvec{h}) - \sum _{j \in H} ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\Big ), \end{aligned}$$
(5)

where

$$\begin{aligned} Z_{\phi }:= \int _{-\infty }^{+\infty } \left\{ \sum _{\varvec{h} }\exp \Big ( -E_{\theta }(\varvec{v}, \varvec{h}) - \sum _{j \in H} ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\Big ) \right\} d\varvec{v} \end{aligned}$$

is the normalization constant, and \(\phi\) denotes the set of parameters comprising \(\theta\) and \(\varvec{\alpha }\). The second term in the exponent of equation (5) functions as the penalties for non-zero hidden variables, and the strength of the penalties is controlled by \(\varvec{\alpha }\). Equation (5) is the S-GDRBM. The S-GDRBM is identical to the GDRBM when \(\alpha _j \rightarrow -\infty\) (i.e., \({{\,\textrm{sfp}\,}}\alpha _j = 0\)) for all \(j \in H\).

The layer-wise conditional distributions of the S-GDRBM are as follows:

$$\begin{aligned} P_{\phi }(\varvec{v} \mid \varvec{h})&=\prod _{i \in V}\frac{1}{\sqrt{2 \pi {{\,\textrm{sfp}\,}}\sigma _i}} \exp \left\{ - \frac{( v_i -\lambda _i(\varvec{h}))^2}{2 {{\,\textrm{sfp}\,}}\sigma _i}\right\} ,\end{aligned}$$
(6)
$$\begin{aligned} P_{\phi }(\varvec{h} \mid \varvec{v})&=\prod _{j \in H} \frac{\exp (\tau _j(\varvec{v}) h_j - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j| )}{G_j(\varvec{v})}, \end{aligned}$$
(7)

where

$$\begin{aligned} \tau _j(\varvec{v}):=c_j + \sum _{i \in V} w_{i,j} v_i, \quad \lambda _i(\varvec{h}):=({{\,\textrm{sfp}\,}}\sigma _i)\left( b_i + \sum _{j \in H}w_{i,j}h_j\right) , \end{aligned}$$
(8)

and

$$\begin{aligned} G_j(\varvec{v}):= \sum _{h_j} \exp \big (\tau _j(\varvec{v}) h_j - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j| \big ) \end{aligned}$$
(9)

is the normalization constant of \(P_{\phi }(h_j \mid \varvec{v})\). The marginal distribution over \(\varvec{v}\) is obtained as

$$\begin{aligned} P_{\phi }(\varvec{v})=\frac{1}{Z_{\phi }} \exp \left( -\sum _{i \in V} \frac{v_i^2}{2 {{\,\textrm{sfp}\,}}\sigma _i} +\sum _{i \in V}b_i v_i + \sum _{j \in H} \ln G_j(\varvec{v})\right) . \end{aligned}$$
(10)

The marginal distribution over \(\varvec{h}\) is obtained through the multivariate Gaussian integral, which leads to

$$\begin{aligned} P_{\phi }(\varvec{h})=\frac{1}{ \mathcal {Z} _{\phi }}\exp \left( \varvec{\beta }^{ \textrm{t} }\varvec{h} + \frac{1}{2} \varvec{h}^{ \textrm{t} }\varvec{J}\varvec{h} - \sum _{j \in H} ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\right) , \end{aligned}$$
(11)

where \(\varvec{\beta } \in \mathbb {R}^{m}\) and \(\varvec{J} \in \mathbb {R}^{m \times m}\) are defined as

$$\begin{aligned} \varvec{\beta }:=\varvec{c} + \varvec{W}^{ \textrm{t} }\varvec{S}\varvec{b},\quad \varvec{J}:=\varvec{W}^{ \textrm{t} }\varvec{S}\varvec{W}, \end{aligned}$$
(12)

where \(\varvec{S} \in \mathbb {R}^{n \times n}\) is a diagonal matrix whose (ii)-element is \({{\,\textrm{sfp}\,}}\sigma _i\), \(\varvec{b} \in \mathbb {R}^{n}\) and \(\varvec{c} \in \mathbb {R}^{m}\) are the vectors of \(b_i\) and \(c_j\), respectively, and \(\varvec{W} \in \mathbb {R}^{n \times m}\) is the matrix of \(w_{i,j}\); \(\mathcal {Z} _{\phi }\) is the normalization constant defined by

$$\begin{aligned} \mathcal {Z} _{\phi }:=\sum _{\varvec{h}}\exp \left( \varvec{\beta }^{ \textrm{t} }\varvec{h} + \frac{1}{2} \varvec{h}^{ \textrm{t} }\varvec{J}\varvec{h} - \sum _{j \in H} ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\right) , \end{aligned}$$
(13)

which is expressed in terms of \(Z_{\phi }\) as

$$\begin{aligned} \mathcal {Z} _{\phi }=Z_{\phi }\exp \left( -\frac{1}{2}\sum _{i \in V} \ln (2 \pi {{\,\textrm{sfp}\,}}\sigma _i) - \frac{1}{2} \varvec{b}^{ \textrm{t} }\varvec{S}\varvec{b}\right) . \end{aligned}$$
(14)

Equation (11) can be regarded as a Boltzmann machine defined on a fully connected graph:

$$\begin{aligned} P_{\phi }(\varvec{h})\propto \exp \left( \sum _{j \in H} q_j(h_j) + \sum _{i < j \in H}J_{i,j}h_i h_j \right) , \end{aligned}$$

where \(q_j(h_j):= \beta _j h_j + J_{j,j}h_j^2/2 - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|\) are the potential on the hidden variables.

The marginal distribution of the S-GDRBM (as well as the GDRBM) over \(\varvec{v}\) can be viewed as a Gaussian mixture model. The marginal distribution is expressed as

$$\begin{aligned} P_{\phi }(\varvec{v}) = \sum _{\varvec{h}}P_{\phi }(\varvec{v} \mid \varvec{h}) P_{\phi }(\varvec{h}). \end{aligned}$$

Here, the conditional distribution, \(P_{\phi }(\varvec{v} \mid \varvec{h})\), is the Gaussian distribution (cf. equation (6)); thus, this expression can be considered a Gaussian mixture model with \(| \mathcal {X} _H|^m\) Gaussian components in which \(P_{\phi }(\varvec{h})\) functions as the mixture weight. The number of Gaussian components rises exponentially with increasing m and power-functionally with increasing R because \(| \mathcal {X} _H|^m = (2R + 1)^m\). Therefore, although it is small compared to the increase of m, the increase of R may also increase the representation power of the S-GDRBM.

3.2 Maximum-likelihood learning based on spatial Monte Carlo integration

We assume that a training dataset consisting of N data points, \(D:=\{ \textbf{v} ^{(\mu )} \}_{\mu =1}^N\), is obtained. The learning of the S-GDRBM is achieved by maximizing the log likelihood,

$$\begin{aligned} \ell (\phi ):= \frac{1}{N}\sum _{\mu =1}^N \ln P_{\phi }( \textbf{v} ^{(\mu )}), \end{aligned}$$
(15)

with respect to \(\phi\). From equation (10), the log likelihood is expressed as

$$\begin{aligned} \ell (\phi )=-\sum _{i \in V} \frac{1}{2 {{\,\textrm{sfp}\,}}\sigma _i}\mathbb {E}_D[v_i^2] +\sum _{i \in V}b_i \mathbb {E}_D[v_i] + \sum _{j \in H} \mathbb {E}_D\big [\ln G_j(\varvec{v})\big ] - \ln Z_{\phi }, \end{aligned}$$

where \(\mathbb {E}_D[\cdots ]\) denotes the sample average over the training dataset, that is,

$$\begin{aligned} \mathbb {E}_D[f(\varvec{v})] = \frac{1}{N}\sum _{\mu =1}^N f \big ( \textbf{v} ^{(\mu )} \big ). \end{aligned}$$

Therefore, the gradients of the log likelihood are obtained as follows. The gradients for \(b_i\) and \(\sigma _i\) are

$$\begin{aligned} \frac{\partial \ell (\phi )}{\partial b_i} = \mathbb {E}_D[v_i] - \mathbb {E}_{\phi }[v_i] \end{aligned}$$
(16)

and

$$\begin{aligned} \frac{\partial \ell (\phi )}{\partial \sigma _i} = \frac{{{\,\textrm{sig}\,}}\sigma _i}{2 ({{\,\textrm{sfp}\,}}\sigma _i)^2}\big (\mathbb {E}_D[v_i^2] - \mathbb {E}_{\phi }[v_i^2]\big ), \end{aligned}$$
(17)

respectively, where \({{\,\textrm{sig}\,}}z:=1/(1 + e^{-z})\) is the sigmoid function, and \(\mathbb {E}_{\phi }[\cdots ]\) denotes the model expectation of the S-GDRBM, that is,

$$\begin{aligned} \mathbb {E}_{\phi }[\cdots ]:= \int _{-\infty }^{+\infty }\sum _{\varvec{h}}(\cdots )P_{\phi }(\varvec{v},\varvec{h}) d\varvec{v}. \end{aligned}$$

Next, the gradients for \(c_j\) and \(w_{i,j}\) are

$$\begin{aligned} \frac{\partial \ell (\phi )}{\partial c_j} =\mathbb {E}_D\big [H_j(\varvec{v})\big ] -\mathbb {E}_{\phi }[h_j] \end{aligned}$$
(18)

and

$$\begin{aligned} \frac{\partial \ell (\phi )}{\partial w_{i,j}} =\mathbb {E}_D\big [v_i H_j(\varvec{v})\big ] -\mathbb {E}_{\phi }[v_i h_j], \end{aligned}$$
(19)

respectively, where

$$\begin{aligned} H_j(\varvec{v}):= \sum _{h_j}h_j P_{\phi }(h_j \mid \varvec{v}) =\frac{\sum _{h_j}h_j \exp (\tau _j(\varvec{v}) h_j - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j| )}{G_j(\varvec{v})}. \end{aligned}$$
(20)

Finally, the gradients for \(\alpha _j\) is

$$\begin{aligned} \frac{\partial \ell (\phi )}{\partial \alpha _j} = ({{\,\textrm{sig}\,}}\alpha _j) \big (-\mathbb {E}_D \big [Q_j(\varvec{v}) \big ] + \mathbb {E}_{\phi } \big [|h_j| \big ]\big ), \end{aligned}$$
(21)

where

$$\begin{aligned} Q_j(\varvec{v}):=\sum _{h_j}|h_j| P_{\phi }(h_j \mid \varvec{v}) =\frac{\sum _{h_j}|h_j| \exp (\tau _j(\varvec{v}) h_j - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j| )}{G_j(\varvec{v})}. \end{aligned}$$
(22)

The ML learning is conducted using a gradient ascent method based on the gradients in equations (16), (17), (18), (19), and (21), which implies that the sparsity parameters \(\varvec{\alpha }\) and the other learning parameters \(\theta\) are simultaneously tuned within the ML learning. To encourage sparsity, relatively large values are preferred for the initial values of \(\varvec{\alpha }\), for example \(\alpha _j \approx 10\), as recommended in reference (Yasuda and Katsumata 2023). However, these gradients include the intractable model expectations, the computational costs of which exponentially grow with the size of the model (the model expectations can be computed when m is sufficiently small; see Appendix A for the details).

In the following, an approximation of the model expectations based on the first-order SMCI method (Yasuda 2015; Yasuda and Uchizawa 2021) (which can be viewed as a Rao-Backwellization) is considered; SMCI-based evaluation has outperformed the evaluation based on the standard Monte Carlo integration (MCI) in Bernoulli–Bernoulli RBMs (Sekimoto and Yasuda 2023) and deep Boltzmann machines (Katsumata and Yasuda 2021). We assume that we have K sample points, \(S:=\{ \textbf{v} ^{(\nu )}, \textbf{h} ^{(\nu )}\}_{\nu =1}^K\), drawn from the S-GDRBM. Here, the first-order SMCI method is briefly introduced. The visible and hidden variables are collectively denoted by \(\varvec{x} = \varvec{v} \cup \varvec{h}\), and \(\nu\)th sample point is denoted by \(\textbf{x} ^{(\nu )} = \textbf{v} ^{(\nu )} \cup \textbf{h} ^{(\nu )}\). Based on the first-order SMCI method, the model expectation for a function of \(\varvec{x}_t \subseteq \varvec{x}\) is evaluated as

$$\begin{aligned} \mathbb {E}_{\phi }[f(\varvec{x}_t)]\approx \mathbb {E}_S\left[ \sum _{\varvec{x}_t}f(\varvec{x}_t) P_{\phi }(\varvec{x}_t \mid \varvec{x}_{\partial t})\right] =\frac{1}{K}\sum _{\nu =1}^K \sum _{\varvec{x}_t}f(\varvec{x}_t) P_{\phi }(\varvec{x}_t \mid \textbf{x} _{\partial t}^{(\nu )}), \end{aligned}$$
(23)

where \(\varvec{x}_{\partial t} \subseteq \varvec{x}\) is the nearest-neighbor variables of \(\varvec{x}_{t}\), for example, \(\varvec{x}_{\partial t} =\varvec{h}\) when \(\varvec{x}_t = \{v_i\}\) and \(\varvec{x}_{\partial t} =\varvec{x} \setminus \{v_i, h_j\}\) when \(\varvec{x}_t = \{v_i, h_j\}\). Here, \(\mathbb {E}_S[\cdots ]\) denotes the sample average over the sample set S. In equation (23), the sum over \(x_i \in \varvec{x}_{t}\) is replaced with the integration over \(x_i\) when \(x_i\) is continuous. Based on equation (23), the model expectations, \(\mathbb {E}_{\phi }[v_i]\) and \(\mathbb {E}_{\phi }[v_i^2]\), are approximated as

$$\begin{aligned} \mathbb {E}_{\phi }[v_i] \approx \frac{1}{K}\sum _{\nu =1}^K\int _{-\infty }^{+\infty } v_i P_{\phi }(v_i \mid \textbf{h} ^{(\nu )}) =\frac{1}{K}\sum _{\nu =1}^K\lambda _i( \textbf{h} ^{(\nu )}) \end{aligned}$$
(24)

and

$$\begin{aligned} \mathbb {E}_{\phi }[v_i^2] \approx \frac{1}{K}\sum _{\nu =1}^K\int _{-\infty }^{+\infty } v_i ^2P_{\phi }(v_i \mid \textbf{h} ^{(\nu )}) ={{\,\textrm{sfp}\,}}\sigma _i + \frac{1}{K}\sum _{\nu =1}^K\lambda _i( \textbf{h} ^{(\nu )})^2, \end{aligned}$$
(25)

respectively, where equation (6) is used. Similarly, using equation (7), the model expectations, \(\mathbb {E}_{\phi }[h_j]\) and \(\mathbb {E}_{\phi }[|h_j|]\), are approximated as

$$\begin{aligned} \mathbb {E}_{\phi }[h_j] \approx \frac{1}{K}\sum _{\nu =1}^K\sum _{h_j} h_j P_{\phi }(h_j \mid \textbf{v} ^{(\nu )}) =\frac{1}{K}\sum _{\nu =1}^KH_j( \textbf{v} ^{(\nu )}) \end{aligned}$$
(26)

and

$$\begin{aligned} \mathbb {E}_{\phi }\big [|h_j| \big ] \approx \frac{1}{K}\sum _{\nu =1}^K\sum _{h_j} |h_j| P_{\phi }(h_j \mid \textbf{v} ^{(\nu )}) =\frac{1}{K}\sum _{\nu =1}^K Q_j( \textbf{v} ^{(\nu )}), \end{aligned}$$
(27)

respectively. Finally, the approximation of \(\mathbb {E}_{\phi }[v_ih_j]\) is considered. Based on equation (23), it is approximated as

$$\begin{aligned} \mathbb {E}_{\phi }[v_ih_j] \approx \frac{1}{K}\sum _{\nu =1}^K\int _{-\infty }^{+\infty }\sum _{h_j} v_i h_j P_{\phi }(v_i ,h_j \mid \textbf{v} _{-i}^{(\nu )}, \textbf{h} _{-j}^{(\nu )}) dv_i, \end{aligned}$$
(28)

where \(\varvec{v}_{-i}:=\varvec{v} \setminus \{v_i\}\) and \(\varvec{h}_{-j}:=\varvec{h} \setminus \{h_j\}\). The conditional distribution in the right hand side of equation (28) is

$$\begin{aligned}&P_{\phi }(v_i ,h_j \mid \varvec{v}_{-i}, \varvec{h}_{-j})\nonumber \\&\propto \exp \left( -\frac{v_i^2}{2 {{\,\textrm{sfp}\,}}\sigma _i} + b_{i,j}(\varvec{h}_{-j}) v_i + c_{j,i}(\varvec{v}_{-i}) h_j - ({{\,\textrm{sfp}\,}}\alpha _j) |h_j| + w_{i,j}v_i h_j\right) , \end{aligned}$$
(29)

where

$$\begin{aligned} b_{i,j}(\varvec{h}_{-j})&:= b_i + \sum _{\ell \in H \setminus \{j\}}w_{i,\ell }h_{\ell } =\frac{\lambda _i(\varvec{h})}{{{\,\textrm{sfp}\,}}\sigma _i} - w_{i,j}h_j, \nonumber \\ c_{j,i}(\varvec{v}_{-i})&:= c_j + \sum _{k \in V \setminus \{i\}}w_{k,j}v_k =\tau _j(\varvec{v}) - w_{i,j}v_i. \end{aligned}$$

From equations (28) and (29), we obtain

$$\begin{aligned} \mathbb {E}_{\phi }[v_ih_j] \approx \frac{{{\,\textrm{sfp}\,}}\sigma _i}{K}\sum _{\nu =1}^K\frac{\sum _{h_j}(w_{i,j}h_j + b_{i,j}( \textbf{h} _{-j}^{(\nu )}))h_j \exp (-e_j^{(\nu )}(h_j))}{\sum _{h_j} \exp (-e_j^{(\nu )}(h_j))}, \end{aligned}$$
(30)

where

$$\begin{aligned} e_j^{(\nu )}(h_j):=-\frac{{{\,\textrm{sfp}\,}}\sigma _i}{2}w_{i,j}^2 h_j^2 - \big (c_{j,i}( \textbf{v} _{-i}^{(\nu )}) + ({{\,\textrm{sfp}\,}}\sigma _i) b_{i,j}( \textbf{h} _{-j}^{(\nu )}) w_{i,j}\big ) h_j + ({{\,\textrm{sfp}\,}}\alpha _j) |h_j|. \end{aligned}$$

By substituting the model expectations, \(\mathbb {E}_{\phi }[\cdots ]\), in the gradients in equations (16), (17), (18), (19), and (21) with the corresponding approximations provided in equations (24), (25), (26), (27), and (30), respectively, the approximated gradients are obtained. The cost of the computation of the SMCI-based expectations is O(Knm); thus, they can be computed even when the size of the S-GDRBM is large. Although the aforementioned SMCI-based approximations are formulated for the S-GDRBM, they can be directly applied to the GDRBM by setting \({{\,\textrm{sfp}\,}}\alpha _j = 0\) (i.e., \(\alpha _j \rightarrow - \infty\)).

To demonstrate the validity of the SMCI-based evaluation, using numerical experiments, we compared the approximation accuracy of it with that of the MCI-based evaluation on small-sized S-GDRBMs with \(n=m=10\). The parameter setup of the S-GDRBMs was as follows: the bias parameters, \(\varvec{b}\) and \(\varvec{c}\), and weight parameters, \(\varvec{W}\), were drawn from a uniform distribution in the interval \([-\beta , \beta ]\), \(\varvec{\alpha }\) were drawn from a uniform distribution in the interval \([-10, 10]\), and \(\{\sigma _i \}\) were fixed by \(\sigma _i = \ln (e - 1)\) (i.e., \({{\,\textrm{sfp}\,}}\sigma _i= 1\)). The sample set with \(K = 1000\) was generated using layer-wised blocked Gibbs sampling on the S-GDRBM. Figure 1 depicts the mean absolute errors (MAEs) between the exact model expectations and their approximations. Because the S-GDRBMs are small, the exact model expectations can be evaluated (see Appendix A). The SMCI-based evaluation outperforms the MCI-based evaluation in terms of MAE.

Fig. 1
figure 1

MAEs between exact expectations and their approximations for various \(\beta\): (a) \(\mathbb {E}_{\phi }[v_i]\), (b) \(\mathbb {E}_{\phi }[v_i^2]\), (c) \(\mathbb {E}_{\phi }[h_j]\), (d) \(\mathbb {E}_{\phi }[|h_j|]\), and (e) \(\mathbb {E}_{\phi }[v_ih_j]\). The plots present the average values of 3000 experiments

4 Numerical experiment

Fig. 2
figure 2

Log likelihoods and negative cross-entropies obtained based on the exact learning. The sizes of the learning models are \(n = 5\) and \(m = 3\)

Fig. 3
figure 3

Log likelihoods and negative cross-entropies obtained based on the exact learning. The sizes of the learning models are \(n = 5\) and \(m = 7\)

Fig. 4
figure 4

Log likelihoods and negative cross-entropies obtained based on the SMCI-based learning. The sizes of the learning models are \(n = 5\) and \(m = 3\)

Fig. 5
figure 5

Log likelihoods and negative cross-entropies obtained based on the SMCI-based learning. The sizes of the learning models are \(n = 5\) and \(m = 7\)

In this section, we demonstrate the ML learning of the S-GDRBM and compare it to that of the GDRBM using artificial training datasets in which the artificial training datasets were generated from the GBRBM (Yasuda and Xiong 2023).

First, we demonstrate numerical experiments on small-sized models. The size of the data-generative GBRBM, \(P_{ \textrm{gen} }(\varvec{v}, \varvec{h})\), was \(n = 5\) and \(m = 3\), in which the bias parameters, \(\varvec{b}\) and \(\varvec{c}\), and weight parameters, \(\varvec{W}\), were drawn from a Gaussian distribution with zero mean and variance 0.01, and \(\{\sigma _i \}\) were fixed by \(\sigma _i = \ln (e - 1)\). Using the data-generative GBRBM, artificial datasets with size N were generated based on layer-wised blocked Gibbs sampling. The use of artificial datasets is appropriate for our purpose because their complexities can be controlled, and moreover, the degree of generalization can be monitored (using a negative cross-entropy described below).

For the artificial datasets, the ML learnings were conducted using the GDRBM and S-GDRBM (with \(R = 1,2\)) in which the sizes of the visible layers were \(n = 5\) and the sizes of the hidden layers were \(m= 3\) or \(m=7\). The bias parameters were initialized to zero, while the weight parameters were initialized using (Gaussian-type) Xavier’s initialization (Glorot and Bengio 2010), and \(\{\sigma _i\}\) were initialized to \(\sigma _i = -3\) for all \(i \in V\). In the S-GDRBM, the initial values of \(\varvec{\alpha }\) were set to a fixed value of \(\alpha _j = 10\) for all \(j \in H\). The adamax optimizer (Kingma and Ba 2015) with the full-batch training was used in the gradient ascent. The log likelihood,

$$\begin{aligned} \frac{1}{N}\sum _{\mu =1}^N \ln P_{ \textrm{tr} }( \textbf{v} ^{(\mu )}), \end{aligned}$$

and the negative cross-entropy,

$$\begin{aligned} \int _{-\infty }^{+\infty }P_{ \textrm{gen} }(\varvec{v}) \ln P_{ \textrm{tr} }(\varvec{v}) d\varvec{v}, \end{aligned}$$

were used as measures to assess the quality of the learning process, where \(P_{ \textrm{tr} }(\varvec{v}) = P_{\theta }(\varvec{v})\) when the learning model is the GDRBM and \(P_{ \textrm{tr} }(\varvec{v}) = P_{\phi }(\varvec{v})\) when it is the S-GDRBM. The log likelihood represents the fitness to the training dataset, and the negative cross-entropy represents the degree of generalization. As the learning proceeds without overfitting, both log likelihood and negative cross-entropy monotonically increase; whereas the negative cross-entropy decreases as overfitting begins to appear. The log likelihood and cross-entropy were exactly computed because the sizes of the data-generative and learning models were sufficiently small (see Appendix A). The exact learning and the SMCI-based learning presented in section 3.2 were conducted. In the SMCI-based learning, the sample points, S, required to evaluate the model expectations in equations (24), (25), (26), (27), and (30) were obtained based on 10-steps layer-wised blocked Gibbs sampling starting from the training data points (i.e., the sampling procedure used in \(\text {CD}_{10}\) (Hinton 2002)).

Figures 25 depict the values of the log likelihoods and negative cross-entropies against the training epoch; the upper plots, labeled (a) and (b), in the figures represent the results obtained when \(N = 10\), while the lower plots, labeled (c) and (d), display the results for the case \(N = 100\). The plots in the figures present the average values obtained from 100 experiments. The results in figures 2 and 3 were obtained based on the exact learning and those in figures 4 and 5 were based on the SMCI-based learning. Overfitting is particularly observed in figures 3(b) and 5(b). We can observe that the S-GDRBMs successfully reduce overfitting. Whereas, the S-GDRBMs yield similar results to those of the GDRBMs in the experiments where overfitting is not a significant issue.

Fig. 6
figure 6

Negative cross-entropies obtained based on the SMCI-based learning: (a) \(m = 50\) and \(N = 1000\) (\(B=100\)) and (b) \(m = 100\) and \(N = 150\) (\(B=30\)). (c) and (d) are the enlarged plots of (a) and (d), respectively. The plots in these figures present the average values of 30 experiments

Next, we demonstrate numerical experiments on larger models in which the size of the data-generative GBRBM was \(n = m = 50\). The parameter setup of the data-generative GBRBM was as follows: the bias and weight parameters were drawn from Gaussian distributions with zero mean and variances 0.05 and 0.002, respectively, and \(\{\sigma _i \}\) were the same as in the aforementioned experiments. For the artificial datasets generated from the data-generative GBRBM, the SMCI-learnings (with \(\text {CD}_{50}\)) were conducted using the GDRBM and S-GDRBM (with \(R = 1,2\)) in which the sizes of the visible layers were \(n = 50\) and the sizes the hidden layers were \(m= 50\) or \(m=100\). The initialization of the learning parameters were the same as in the aforementioned experiments, and the adamax optimizer with the mini-batch training was used in which the mini-batch size was B. Figure 6 depicts the values of negative cross-entropies against the training epoch; (a) displays the results for the learning models with \(m = 50\) when \(N = 1000\) and \(B = 100\), and (b) displays the results for the learning models with \(m = 100\) when \(N = 150\) and \(B = 30\). The negative cross-entropy was evaluated using a sampling-based approximationFootnote 1 More pronounced overfitting is observed in figure 6(b). The S-GDRBM successfully reduces overfitting; moreover, the peaks of the rise of the S-GDRBMs are higher than those of the GDRBMs (similar behaviors can be observed in figures 2(b) and 3(b)).

Table 1 Values of \(\rho\): (a) the trained models obtained in the experiments in figure 6(a) and (b) the trained models obtained in the experiments in figure 6(b). The values of the table are the average values obtained from 30 experiments

On the trained models obtained in the experiments in figure 6, we evaluate \(\rho :=\sum _{j \in H}\mathbb {E}_{ \textrm{tr} }[|h_j|] / m\), where \(\mathbb {E}_{ \textrm{tr} }[\cdots ]\) denotes the expectation on the trained models. \(\rho \in [0,1]\) can be read as the statistical activation-ratio of the hidden layer; \(\rho = 1\) when all hidden variables always take \(\pm 1\) and \(\rho = 0\) when all hidden variables always take zero. In the models with \(R = 2\), the hidden variables can take two kinds of activations, \(|h_j| = 1\) and \(|h_j| = 1/2\), and we regard the former as the strong activation and the latter as the weak activation. If the effect of sparse regularization functions as expected, the values of \(\rho\) are suppressed. Table 1 presents the \(\rho\)-values on the trained models; here, \(\mathbb {E}_{ \textrm{tr} }[|h_j|]\) was computed based on the SMCI-evaluation. The \(\rho\)-values of the S-GDRBMs are considerably lower than those of the GDRBMs, which means the proposed regularization functions. From (a) to (b) in table 1, the \(\rho\)-values of the GDRBMs increase and approach one; this implies that the effect of redundant hidden-variable-activations causes overfitting. Conversely, the \(\rho\)-values of the S-GDRBMs decrease, which implies that the S-GDRBMs shrink the effect of redundant hidden-variable-activations to suppress overfitting.

We can observe that the \(\rho\)-values of the S-GDRBMs with \(R= 1\) are lower than those with \(R=2\) in table 1, which implies that the effect of sparse regularization is more enhanced in the case \(R=1\). This is intuitively understood as follows: the penalties for the hidden variables taking non-zero values tend to be larger in \(R=1\) when \(\varvec{\alpha }\) of both S-GDRBMs are the same. In addition, compared with the S-GDRBM with \(R=1\), the S-GDRBM with \(R= 2\) more significantly decreases the negative cross-entropies in figures 3(b) and 6(d). However, we consider that these results do not immediately indicate that the S-GDRBM with \(R=1\) is superior to that with \(R = 2\). As mentioned in section 3.1, the representation power of the S-GDRBM can be increased by increasing the R-value. There is the possibility that the S-GDRBMs with \(R = 2\) or more are more suitable for more complex training datasets than the S-GDRBM with \(R=1\). The aforementioned experimental results confirm the proposed sparse regularization functions, i.e., the strength of regularization is adaptively tuned during training. This might seem counterintuitive because the ML learning aims to achieve a good fit to the training data and does not inherently prioritize the suppression of overfitting. From the ML perspective, the strength of regularization should ideally decrease to zero (i.e., \(\alpha _j\) goes to \(- \infty\)) because the solution exhibiting overfitting will be globally optimum. This matter might be considered as follows. The model learns the abstract of the data distribution in the early stage of the learning; both log likelihood and cross-entropy grow in this stage. After the early stage, the model starts to be finely tuned to learn the details of the data distribution and to increase the log likelihood. This fine tuning causes overfitting. The sparse regularization prevents the fine tuning by shrinking the hidden variables representations and attempts the model to stay at a locally optimum near the point reached in the early stage. Figure 7 depicts the long-term learning version of the experiments in figures 3(a) and 3(b). The S-GDRBMs converge to better solutions in terms of the negative cross-entropy. However, not as much as the GDRBMs, the S-GDRBMs also exhibit the tendency of overfitting. If early stopping could be properly conducted, the learning solution presenting a high negative cross-entropy can be obtained. However, appropriate early stopping in terms of the negative cross-entropy is not practical because the true data-generative model is unknown. The log likelihood for a separate test dataset may be used as the alternative criterion for early stopping. However, the log likelihood involves the intractable normalization constant, a precise evaluation of which is expensive in large systems even if a sampling-based approximation is employed.

Fig. 7
figure 7

Long-term learning version of the experiments in figures 3(a) and 3(b)

5 Conclusion and future studies

In this study, a sparse-regularized GDRBM, S-GDRBM, is proposed by imposing a Laplace-like prior on the hidden layer. In the S-GDRBM, the strength of sparse regularization (in other words, the strength of the prior) is trainable in contrast to that in conventional sparse regularizations. The results of our numerical experiments in section 4 show that the proposed regularization method functioned as expected. The present regularization becomes strong for training datasets in which overfitting is severe and is weakened for datasets in which overfitting is not severe, which implies the strength of regularization is adaptively tuned during training.

As mentioned in section 3, there are several related works that aim to promote sparse representations of the hidden layer (Lee et al. 2007; Luo et al. 2011; Wan et al. 2015; Wei et al. 2019). Another relevant work is infinite RBM (iRBM) (Côté and Larochelle 2016), which treats m as a random variable and tunes its distribution during training (note that the iRBM and its hybrid-type learning algorithm (Peng et al. 2018) have hyperparameters). In the iRBM, the effective size of m is optimized according to the complexity of the training dataset. The objective of the iRBM study is similar to that of the present study. The S-GDRBM and the related works (excluding the ES-RBMFootnote 2) are not in direct competition, suggesting that the S-GDRBM can potentially be used in conjunction with them for further developments. The combination with the iRBM is important, and it will be conducted in our future studies. In the future, additional studies will explore the applications of the S-GDRBM in various contexts. These could include its use as a feature extractor (Yasuda and Xiong 2023) or as an input converter for classification systems (Kanno and Yasuda 2021).