Restricted Boltzmann Machine with Multivalued Hidden Variables

Generalization is one of the most important issues in machine learning problems. In this study, we consider generalization in restricted Boltzmann machines (RBMs). We propose an RBM with multivalued hidden variables, which is a simple extension of conventional RBMs. We demonstrate that the proposed model is better than the conventional model via numerical experiments for contrastive divergence learning with artificial data and a classification problem with MNIST.


Introduction
Generalization is one of the most important goals in statistical machine learning problems [1].In various standard machine learning techniques, given a particular data set, we fit our probabilistic learning model to the empirical distribution (or the data distribution) of the data set.When our learning model is sufficiently flexible, it can fit the empirical distribution exactly via an appropriate learning method.A learning model that is too close to the empirical distribution frequently gives poor results for new data points.This situation is known as over-fitting.Over-fitting impedes generalization; therefore, techniques that can suppress over-fitting are needed to achieve good generalizations.Regularizations, such as L 1 and L 2 regulari- zations or their combination (the elastic net) [18], are popular techniques used for this purpose.
Here, we focus on a restricted Boltzmann machine (RBM) [5,15].RBMs have a wide range of applications such as collaborating filtering [14], classification [8], and deep learning [6,12,13].The suppression of over-fitting is also important in RBMs.An RBM is a probabilistic neural network defined on a bipartite undirected graph comprising two different layers: visible layer and hidden layer.The visible layer, which consists of visible (random) variables, directly corresponds to the data points, while the hidden layer, which consists of hidden (random) variables, does not.The hidden layer creates complex correlations among the visible variables.The sample space of the visible variables is determined by the range of data elements, whereas the sample space of the hidden variables can be set freely.Typically, the hidden variables are given binary values ( {0, 1} or {−1, +1}).
In this study, we propose an RBM with multivalued hidden variables.The proposed RBM is a very simple extension of the conventional RBM with binary-hidden variables (referred to as the binary-RBM in this paper).However, we demonstrate that the proposed RBM is better than the binary-RBM in terms of suppressing the over-fitting.The remainder of this paper is organized as follows.We define the proposed RBM in Sect. 2 and explain its maximum likelihood estimation in Sect.2.1.In Sect.2.2, we demonstrate the validity of the proposed RBM using numerical experiments for contrastive divergence (CD) learning [5] with artificial data.We give an insight on the effect of our extension (i.e., the effect of multivalued hidden variables) using a toy example in Sect.2.3.In Sect.3, we apply the proposed RBM to a classification problem and show that it is also effective in such type of problems.Finally, the conclusion is given in Sect. 4.

Restricted Boltzmann Machine with Multivalued Hidden Variables
Let us consider a bipartite graph consisting of two different layers: the visible layer and hidden layer, as shown in Fig. 1.Binary (or bipolar) visible variables, v∶={v i ∈ {−1, +1} | i ∈ V} , are assigned to the corresponding nodes in the visible layer.The corresponding hidden variables, h∶={h j ∈ X(s) | j ∈ H} , are assigned to the nodes in the hidden layer, where X(s) is the sample space defined by where ℕ is the set of all natural numbers.For example, X(1) = {−1, +1} , X(2) = {−1, 0, +1} , and X(3) = {−1, −1∕3, +1∕3, +1} .Namely, X(s) is the set of values that evenly partition the interval [−1, +1] into (s + 1) parts.We define that in (1) Bipartite graph consisting of two layers: the visible layer and the hidden layer.V and H are the sets of indices of the nodes in the visible layer and hidden layer, respectively 1 3 The Review of Socionetwork Strategies (2019) 13:253-266 On the bipartite graph, we define the energy function for s ∈ ℕ as where {b i } , {c j } , and {w i,j } are the learning parameters of the energy function, and they are collectively denoted by .Specifically, {b i } and {c j } are the biases for the visible and hidden variables, respectively, and {w i,j } are the couplings between the visible and hidden variables.Our RBM is defined in the form of a Boltzmann distribution in terms of the energy function given in Eq. ( 2): where is the partition function.The multiple summations in Eq. ( 4) mean The factor (s)∶={2∕(s + 1)} |H| appearing in Eqs. (3)and ( 4) is a constant unre- lated to v and h .Although it vanishes by reducing the fraction in Eq. ( 3), we leave it for the sake of the subsequent analysis.The factor lets the summation over h j be a Riemann sum and prevents the divergence of the partition function in s → ∞ .It is noteworthy that when s = 1 , Eq. ( 3) is equivalent to the binary-RBM.The marginal distribution of RBM is expressed as where j (v, )∶=c j + ∑ i∈V w i,j v i and s (x)∶= ∑ h∈X(s) 2(s + 1) −1 e xh .It is noteworthy that factor 2(s + 1) −1 in the definition of s (x) comes from (s) .Using the formula of geometric series, we obtain . ( The additive factor, 2(s + 1) −1 , ensures that lim s→∞ s (x) = ∞ (x) .The conditional distributions are where i (h, )∶=b i + ∑ j∈H w i,j h j .We can easily sample v from a given h using Eq. ( 8) and sample h from a given v using Eq. ( 9).Alternately repeating these two kinds of conditional samplings yields a (blocked) Gibbs sampling on the RBM.It is noteworthy that when s → ∞ , the conditional sampling of h using Eq. ( 9) can be imple- mented using the inverse transform sampling.The cumulative distribution function of P ∞ (h j | v, ) is and therefore, its inverse function is , where u is a sample point from the uniform distribution over [0, 1].

Given
N training data points for the visible layer, the learning of RBM is done by maxi- mizing the log-likelihood function (or the negative cross-entropy loss function), defined by with respect to (namely, the maximum likelihood estimation).The distribution in the logarithmic function in Eq. ( 10) is the marginal distribution obtained in Eq. ( 5).The log-likelihood function is regarded as the negative training error.Usually, the log-likelihood function is maximized using a gradient ascent method.The gradients of the log-likelihood function with respect to the learning parameters are as follows.
The Review of Socionetwork Strategies (2019) 13:253-266 where ⟨⋯⟩ s is the expectation of RBM, i.e., and The log-likelihood function can be maximized by a gradient ascent method with the gradients expressed in Eqs. ( 11)- (13).However, the evaluation of the expectations, ⟨⋯⟩ s , included in the above gradients is computationally hard.The computation time of the evaluation grows exponentially as the number of variables increases.Therefore, in practice, an approximate approach is used, for example, CD [5], pseudolikelihood [10], composite likelihood [16], Kullback-Leibler importance estimation procedure (KLIEP) [17], and Thouless-Anderson-Palmer (TAP) approximation [3].
In particular, the CD method is the most popular method.In the CD method, the intractable expectations in Eqs. ( 11)-( 13) are approximated by the sample averages of the sampled points in which each sampled point is generated from the (one-time) Gibbs sampling using Eqs.( 8) and ( 9), starting from each data point ( ) .

Numerical Experiment Using Artificial Data
In the numerical experiments in this section, we used two RBMs: the generative RBM (gRBM), P gen 1 , and the learning RBM (tRBM), P train s .We obtained N = 200 artificial training data points, D V , from the gRBM using Gibbs sampling, and subsequently, we trained the tRBM using the data points.The sizes of the visible layers of both RBMs were the same, namely, |V| = 8 .The sizes of the hidden layers of the gRBM and tRBM were set to |H| = 4 and |H| = 4 + R , respectively.The sample space of the hidden vari- ables in the gRBM was X(1) = {−1, +1} , implying that the gRBM is the binary-RBM.The parameters of gRBM were randomly drawn: b i , c j ∼ G(0, 0.1 2 ) and ( 11) (Xavier's initialization [4]), where G( , 2 ) is the Gaussian distribution and U[min, max] is the uniform distribution.We trained the tRBM using the CD method.In the training, the parameters of tRBM were initialized by b i = c j = 0 and Eq. ( 15).In the gradient ascent method, we used the full batch learning with the Adam method [7].The quality of learning was measured using the Kullback-Leibler divergence (KLD) between the gRBM and tRBM: The KLD is regarded as the (pseudo) distance between the gRBM and tRBM.Thus, it is a type of generalization error.We can evaluate the KLD (the generalization error) and log-likelihood function in Eq. (10) (the negative training error) because the sizes of the RBMs are not large.
Figure 2a, b show the KLDs against the number of parameter updates (i.e., the number of gradient ascent updates).We observe that all KLDs increase as the learnings proceed owing to the effect of over-fitting.In Fig. 2a, because the gRBM and tRBM have the same structure (in other words, there is no model error), the effect of overfitting is not severe.In contrast, in Fig. 2b, because the tRBM is more flexible than the gRBM, the effect of over-fitting tends to become severe.In fact, in Fig. 2b, the KLDs increase more rapidly as the learnings proceed.The increase in the KLD of higher s is evidently slower.Figure3a, b show the log-likelihood functions divided by |V| against the number of parameter updates.We observe that the log-likelihood function with lower s grows more rapidly.In other words, the training error in the tRBM with lower s decreases more rapidly.These results indicate that the multivalued hidden variables suppress over-fitting.In these experiments, the tRBM with s = ∞ is the best in terms of generalization.The Review of Socionetwork Strategies (2019) 13:253-266

Effect of Multivalued Hidden Variables
In the numerical experiments described in the previous section, we demonstrated that the multivalued hidden variables suppress over-fitting.In this section, we provide an insight into the effect of multivalued hidden variables using a toy example.Although the consideration presented below is for a simple RBM, which is significantly different from practical RBMs, it is expected to provide an insight into the effect of multivalued hidden variables.First, let us consider a simple RBM with two visible variables: The marginal distribution of Eq. ( 17) is where we used lim x→0 s (x) = 2 and s (x) = s (−x) .Because v 1 , v 2 ∈ {−1, +1} , Eq. ( 18) can be expanded as where (17) where (x, y) is the Kronecker delta function.Similar to Eq. ( 19), the empirical dis- tribution is expanded as . For simplicity, in the following discussion, we assume that d 1 = d 2 = 0 and ≥ 0 .Under this assumption, using the expanded forms in Eqs. ( 19) and (20), the log-likelihood function of the simple RBM is expressed by Ultimately, the aim of the maximum likelihood estimation is to find the value of w that realizes or in other words, to find a value of w * s that satisfies s (w * s ) = .The log-likelihood function in Eq. ( 21) is globally maximized at w = w * s and the RBM with w * s over-fits the data distribution.It can be shown that the function s (w) has the following three properties: (i) it is symmetric with respect to w, (ii) it monotonically increases with an increase in w ≥ 0 , and (iii) it monotoni- cally decreases with an increase in s when |x| ≠ 0 .The function s (w) with |H| = 2 is shown in Fig. 4 (a) as an example.These three properties lead to the inequality The Review of Socionetwork Strategies (2019) 13:253-266 |w * s | < |w * s+1 | for a certain  > 0 , which implies that the global maximum point of the log-likelihood function in Eq. ( 21) moves away from the origin, w = 0 , as s increases (see Fig. 4b).
Usually, the initial value of w is set to a value around the origin.As shown in Fig. 4b, the global maximum point moves closer to the origin and the peak becomes sharper (in other words, the global maximum point becomes the stronger attractor) as s decreases.This implies that with a gradient ascent type of algorithm, the RBM with a lower s can reach the global maximum point more rapidly and causes over-fitting during an early stage of the learning.Whereas the convergence with the global maximum point of the RBM with a higher s is slower and it prevents over-fitting during an early stage of the learning 1 .In fact, the increases in the generalization error (the KLD) and negative training error (the log-likelihood function) become faster as s decreases in the numerical results obtained in the previous section (cf.Figs. 2, 3).
From the above analysis, we found that the global maximum point moves away from the origin and becomes a weaker attractor as s increases.This could lead to some expectations, for example: (i) in a more practical RBM, its log-likelihood function usually has several local maximum points, and thus, the RBM with a higher s is more easily trapped by one of the local maximum points before converging with the global maximum point (namely, the over-fitting point) and (ii) some regularization methods, such as early stopping or L 2 regularization, are more effective in the RBM with a higher s.

Application to Classification Problem
Let us consider a classification (or pattern recognition) problem in which an n-dimensional input vector It is convenient to use a 1-of-K representation (or a 1-of- K coding) to identify each class [1].In the 1-of-K representation, each class corresponds to the K-dimensional vector t = (t 1 , t 2 , … , t K ) T , where t k ∈ {0, 1} and ∑ K k=1 t k = 1 , i.e., t is a vector in which the value of only one element is one and the remaining elements are zero.When t k = 1 , t indicates class C k .For simplicity of the notation, we denote the 1-of-K vector, whose kth element is one, by 1 k .In the fol- lowing section, we consider the application of the proposed RBM to the classification problem.

Discriminative Restricted Boltzmann Machine
A discriminative restricted Boltzmann machine (DRBM) was proposed to solve the classification problem [8,9], which is a conditional distribution of the output 1-of-K vector t conditioned with a continuous input vector x .The conventional DRBM can be obtained by a simple extension to the binary-RBM.The DRBM is obtained by the following process.The visible variables in the RBM are divided into two layers, the input and output layers.The K visible variables assigned to the output layer, t , are redefined as the 1-of-K vector with 1 k as its realization (i.e., t ∈ {1 k | k = 1, 2, … , K} ), and the n visible variables assigned to the input layer, x , are redefined as the continuous input vector (see Fig. 5).Subsequently, we make a conditional distribution conditioned with the variables in the input layer: P(t, h | x) .Finally, by marginalizing the hidden vari- ables out, we obtain the DRBM: . By using the proposed RBM instead of the binary-RBM, we obtain an extension to the conventional DRBM, i.e., we obtain a DRBM with multivalued hidden variables.The proposed DRBM for s ∈ ℕ is obtained by where j (t, x, )∶=c j + ∑ K k=1 w (2)  j,k t k + ∑ n i=1 w (1)  i,j x i and The function s (x) appearing in Eqs. ( 22) and ( 23) is already defined in Eq. ( 6).It is noteworthy that when s = 1 , Eq. ( 22) is equivalent to the conventional DRBM proposed in Ref. [8].Eq. ( 22) is regarded as the class probability, indicating that Fig. 5 Discriminative restricted Boltzmann machine is obtained to an extension of the RBM.Because the output layer corresponds to the 1-of-K vector, it takes only K different states.For distinction, the couplings between the input and hidden layers are represented by w (1) and those between the hidden and output layers are represented by w (2)  1 3 The Review of Socionetwork Strategies (2019) 13:253-266 The gradients of the log-likelihood function with respect to the parameters are obtained as follows.
where ⟨⋯⟩ ( ,s) t denotes the expectation defined by The function s (x) appearing in the above gradients is already defined in Eq. ( 14).It is noteworthy that the gradients expressed in Eqs. ( 25)-( 28) are computed without an approximation, unlike those in the RBM, owing to the special structure of DRBM.In the training, we maximize l † s ( ) with respect to using a gradient ascent method with Eqs. ( 25)-(28).

Numerical Experiment Using MNIST Data Set
In this section, we show the results of the numerical experiment using MNIST.MNIST is a data set of 10 different handwritten digits, 0, 1, … , and 9 and is com- posed of 60, 000 training data points and 10, 000 test data Each data point includes the input data, a 28 × 28 digit (8-bit) image, and the corresponding target digit label.Therefore, for the data set, we set n = 784 and K = 10 .All input images were normalized by dividing by 255 during preprocessing.
We trained the proposed DRBM with |H| = 200 using N = 1000 training data points in MNIST and tested it using 10000 test data points.In the training, we used the stochastic gradient ascent (SGA), for which the mini-batch size was B = 100 , with the AdaMax optimizer [7].All coupling parameters were initialized by the Xavier method [4], and all bias parameters were initialized by zero. Figure 6 shows the plots of the missclassification rates for (a) training data set and (b) test data set versus the number of parameter updates.All input images in the test data set (25) l † s ( ) w (1)   i,j were corrupted by the Gaussian noise with = 120 before the normalization 2 .We observe that the DRBM with s = ∞ is better in terms of generalization because it shows a higher training error and lower test error.This indicates that the multivalued hidden variables are also effective in the DRBM.

Conclusion
In this paper, we proposed an RBM with multivalued hidden variables, which is a simple extension to the conventional binary-RBM and showed that the proposed RBM is better than the binary-RBM in terms of the generalization property via numerical experiments conducted on CD learning with artificial data (in Sect.2.2) and classification problem with MNIST (in Sect.3.2).
It is important to understand the reason why the multivalued hidden variables are effective in terms of over-fitting.We provided a basic insight into it by analyzing a simple example in Sect.2.3.However, practical RBMs are much more complex than the simple example used in this study.Therefore, we need to perform further analysis to clarify this reason.We think that a mean-field analysis [11] can be used to perform the further analysis.Moreover, a criteria for over-fitting was provided in Ref. [2].The relationship between the criteria and our multivalued hidden variables is also interesting.These issues will be addressed in our future studies.The Review of Socionetwork Strategies (2019) 13:253-266

Fig. 2
Fig.2KLDs against the number of parameter updates (epochs) when a R = 0 and b R = 5 .We used the tRBM with s = 1, 2, 4, ∞ .These plots show the average over 300 experiments

Fig. 4 a
Fig. 4 a Plot of s (w) versus w for various s when |H| = 2 .For = 0.6 , the values of |w * s | for s = 1, 2, 4 , and ∞ are approximately 0.6585, 0.7834, 0.8941, and 1.0887, respectively.b Plot of the log-likelihood function versus w for various s when |H| = 2 and = 0.6 .The shape of the peak around the global maximum point becomes sharper as s decreases ) is the probability of the input x belonging to class C k .The input x should be assigned into a class that gives the maximum class probability.Given N supervised training data points, D∶={( ( ) , ( ) ) | = 1, 2, … , N} , the log-likelihood function of the proposed DRBM in Eq. (22) is defined as(22) P s (t | x, )∶= 1 Z s (x, ) exp K ∑ k=1 b k t k + ∑j∈H ln s j (t, x, ) .(23) Z s (x, )∶= K ∑ k=1 exp b k + ∑ j∈H ln s j (1 k , x, ) .P s ( ( ) | ( ) , ).

Fig. 6
Fig. 6 Missclassification errors against the number of parameter updates (epochs): a training error and b test error.Here, one epoch consists of one full update cycle over the training data set, implying that one epoch involves N∕B = 10 updates by the SGA in this case.We used the DRBM with s = 1, ∞ .These plots show the average over 120 experiments