# On better training the infinite restricted Boltzmann machines

- 481 Downloads
- 1 Citations

## Abstract

The infinite restricted Boltzmann machine (iRBM) is an extension of the classic RBM. It enjoys a good property of automatically deciding the size of the hidden layer according to specific training data. With sufficient training, the iRBM can achieve a competitive performance with that of the classic RBM. However, the convergence of learning the iRBM is slow, due to the fact that the iRBM is sensitive to the ordering of its hidden units, the learned filters change slowly from the left-most hidden unit to right. To break this dependency between neighboring hidden units and speed up the convergence of training, a novel training strategy is proposed. The key idea of the proposed training strategy is randomly regrouping the hidden units before each gradient descent step. Potentially, a mixing of infinitely many iRBMs with different permutations of the hidden units can be achieved by this learning method, which has a similar effect of preventing the model from over-fitting as the dropout. The original iRBM is also modified to be capable of carrying out discriminative training. To evaluate the impact of our method on convergence speed of learning and the model’s generalization ability, several experiments have been performed on the binarized MNIST and CalTech101 Silhouettes datasets. Experimental results indicate that the proposed training strategy can greatly accelerate learning and enhance generalization ability of iRBMs.

## Keywords

Infinite restricted Boltzmann machines Model averaging Regularization Discriminative and generative training objective## 1 Introduction

Boltzmann machines are stochastic neural networks consisting of symmetrically-coupled binary stochastic units (Ackley et al. 1985). They are proposed to find statistical regularities in the data. One of the most popular subset of Boltzmann machines is the restricted Boltzmann machine (RBM) (Smolensky 1986). The RBM and its various extensions have enjoyed much popularity for pattern analysis and generation. The generality and flexibility of RBMs enable them to be used in a wide range of applications, e.g., image and audio classification (Hinton and Salakhutdinov 2006; Mohamed and Hinton 2010), generation (Mohamed et al. 2011), collaborative filtering (Salakhutdinov et al. 2007; Tran et al. 2014), motion modeling (Taylor et al. 2007) etc. More specifically, an RBM is a bipartite graphical model that uses a layer of hidden binary variables or units to model the probability distribution of a layer of visible variables. With enough hidden units, an RBM is able to represent a binary probability distribution fitting the training data as well as possible.

In general, adding more hidden units can improve the representational power of the model (Fischer and Igel 2010). However, as the number of hidden units increases, many learned features become strongly correlated (Srivastava et al. 2014; Tomczak and Gonczarek 2015), which increases the risk of over fitting. Choosing a proper number of hidden units requires a procedure of model selection, which is time-consuming. To deal with this issue, Welling et al. (2002) propose a boosting algorithm in the feature space of RBM, and at each iteration, a new feature is added and greedily learned. Nair and Hinton (2010) conceptually tie the weights of an infinite number of binary hidden units, and connect these sigmoid units with noisy rectified linear units (ReLUs) for better feature learning. More recently, Côté and Larochelle (2016) have proposed a non-parametric model called the iRBM. By making the effective number of hidden units participating in the energy function change freely during training, the iRBM can automatically adjust the effective number of hidden units according to the data. The implicitly infinite capability of the iRBM also makes it capable of lifelong learning. Despite that, there is a major drawback for the iRBM, the slow convergence of learning, which offsets the advantage it brings to a certain degree. The reason for this drawback is that, the hidden units are correlated with each other given the visible variables. The learned filters or feature detectors change slowly from the left-most hidden unit to right, which is also called “ordering effect” in Côté and Larochelle (2016).

The fixed order of hidden units is the reason of strong dependency between filters learned by neighboring hidden units. The newly added hidden unit always begins to learns features jointly with the previous hidden units. In fact, the hidden units does not have to be constrained to a fixed order, all possible permutations can be considered and evaluated. To achieve this, a random permutation of the hidden units is sampled from a certain distribution before each gradient descent step. Thus, the neighbors of each hidden unit are continuously changing as training progresses, which encourages the hidden units to learn features depending on themselves. By doing this, a different iRBM is trained at each gradient descent step. As there are infinite many hidden units in the iRBM, a mixture of infinite many iRBMs with different permutation of hidden units but shared parameters can be achieved theoretically. From this point of view, the proposed training strategy provides an effective way of preventing the model from over-fitting, as averaging of different models nearly always improves the generalization performance (Dietterich 2000). A similar effect can be achieved by dropout (Srivastava et al. 2014), which randomly drops units of the neural network during training, and is equivalent to combining exponentially many different sub networks, and also serves as regularization by an adaptive weight decay and sparse representation (Baldi and Sadowski 2014). In fact, a more general iRBM is defined by treating the permutation of hidden units as a model parameter. Besides the new training strategy, another contribution of our work is extending the iRBM to being capable of performing supervised learning, which allows us to compare the generalization performance of our training strategy to other training methods and models on discrimination tasks.

*Related work* It needs to be mentioned here that, Ping and Liu (2016) have proposed an alternative definition of the infinite RBM, and accordingly the Frank–Wolfe algorithm to learn it. We name their model FW-iRBM to avoid confusing with the model studied in this paper. The definition of FW-iRBM is motivated by the marginal log-likelihood of classic RBMs. The weight matrix **W** is treated as *N* samples of weight vector \({\varvec{\upomega }}\). The marginal log-likelihood of RBM is thus sampling approximation to marginal log-likelihood of the so-called “fractional RBM”. Adding a hidden unit is equivalent to draw a new sample of \({\varvec{\upomega }}\) of fractional RBM. The training objective of fractional RBM tries to learning the distribution of \({\varvec{\upomega }}\), \(q\left( {\varvec{\upomega }}\right) \). They proposes a greedy training procedure which adds a hidden unit at each iteration and updates the weight of newly added hidden unit. The advantage of FW-iRBM is that it is a more generalized model that can be extended to an RBM with uncountable number of hidden units, as long as \(q\left( {\varvec{\upomega }}\right) \) is a continuous distribution. However, even though the order of hidden units invariant to model’s marginal log-likelihood, the training procedure of FW-iRBM indicates that the order of hidden units still has an effect on the final performance. The reason is that at each step only the parameter of newly added hidden unit is updated based on all the previous added hidden units. And this greedy optimization algorithm is more likely to result in a sub-optimal solution, that is the reason that the performance is not monotonically improving as the model size gets larger. The iRBM doesn’t encounter this problem as it simultaneously updates all the non-zero parameters and automatically decides whether adding a new hidden unit or not.

The remainder of this paper is organized as follows: Sect. 2 gives a brief review of the iRBM model, after which the discriminative iRBM is introduced, and the cause of ordering effect is briefly analyzed. In Sect. 3, the proposed training strategy is formally presented, and an condition under which the model is invariant to the order of hidden units is proposed with a proof in the appendix. And in Sect. 4, several experiments are performed to empirically evaluate our training strategy. Finally, we conclude our work in Sect. 5.

## 2 Infinite restricted Boltzmann machines

In this section, the original iRBM is first introduced briefly. After that, some modifications to the energy function of the iRBM are made, which leads to the discriminative iRBMs. Finally, we briefly analyze cause of the ordering effect in iRBMs.

### 2.1 iRBM

The iRBM (Côté and Larochelle 2016) is proposed to address the difficulty of deciding a proper size of the hidden layer for the RBM, it can effectively adapt its capacity as training progresses.

*D*dimensional visible vector representing the observable data. \(h_i \in \left\{ {0,1} \right\} \) is the

*i*th element of the infinite-dimensional hidden vector \(\mathbf{h}\). The random variable \(z\in \mathrm{N}\) can be understood as the total number of hidden units being selected to participate in the energy function. \(\beta _i \) is the penalty for each selected hidden unit \(h_i \). \(\mathbf{W}_{i\cdot }\) is the

*i*th row of the weight matrix \(\mathbf{W}\) connecting the visible units and the hidden units. \(\mathbf{b}^{v}\) is the visible units bias vector. \(b_i^h\) is the

*i*th hidden unit bias.

*z*is

*H*is the set of all possible values \(\mathbf{h}\) takes. Thus \(H_z\) defines the legal values of \(\mathbf{h}\) given

*z*. The graphical model of the iRBM is shown in Fig. 1.

It should be noticed that, for a given *z*, the value of the energy function is irrelevant for the dimensions of \(\mathbf{h}\) from \(z+1\) to \(\infty \), which means that \(h_i \) where \(i>z\) will never be activated. Thus, (1) has the same form with the energy function of the classic RBM with *z* hidden units except the penalty \(\beta _i \) which the latter does not have.

*z*, the marginal distribution \(p\left( \mathbf{v} \right) \) is derived as follows:

For nearly all types of BMs, computing the partition function *Z* is intractable, therefore, the gradients of \(f\left( {\Theta ,D_{\mathrm{train}} } \right) \) cannot be exactly computed. This is also the case for the iRBMs, as there are infinite many hidden units. Côté and Larochelle (2016) suggest using the Contrastive Divergence (CD) and the Persistent Contrastive Divergence (PCD) (Hinton 2002; Tieleman 2008) algorithms to approximately calculate the gradients, i.e. a Gibbs sampling is used to sample \(\left( {z,\mathbf{v}} \right) \sim p\left( {z,\mathbf{v}} \right) \), and then the second expectation in (11) is estimated with these samples. The *k*th Gibbs step is done in the following order: \(z^{\left( k \right) }\sim p\left( {z|\mathbf{v}^{\left( {k-1} \right) }} \right) \rightarrow \mathbf{h}^{\left( k \right) }\sim p\left( {\mathbf{h}|\mathbf{v}^{\left( {k-1} \right) },z^{\left( k \right) }} \right) \rightarrow \mathbf{v}^{\left( k \right) }\sim p\left( {\mathbf{v}|\mathbf{h}^{\left( k \right) },z^{\left( k \right) }} \right) \). The detail of learning is provided in Côté and Larochelle (2016).

The hidden units of iRBM are selected in sequence as *z* takes the value from 1 to \(\infty \), and if the penalty \(\beta _i \) is chosen properly, an infinite pool of hidden units can be achieved. A way to parameterize \(\beta _i\) is suggested in Côté and Larochelle (2016), which is \(\beta _i =\beta \ln \left( {1+e^{b_i^h }} \right) \). This will ensure the infinite summing in partition function *Z* is convergent as long as \(\beta >1\) and the number of hidden units having non-zero weights and biases is always finite.

### 2.2 Discriminative iRBM

*i*th row of the weight matrix \(\mathbf{U}\) connecting \(\mathbf{h}\) and \(\mathbf{e}_y \).

*y*and

*z*is given below:

*z*, we get the marginal distribution \(p\left( {\mathbf{v},y} \right) \) as follows:

### 2.3 The ordering effect in iRBMs

The reason for slow convergence of learning for iRBMs is that newly added hidden units are correlated to previous hidden units, it takes a long time for the filters to become diverse from each other.

*i*th hidden unit:

*t*. Now,

Equation (17) indicates that, the newly added hidden unit is always influenced by all the previous added hidden units. It starts to learn the feature jointly with the previous features not independently by itself.

## 3 The proposed training strategy

In this section, the proposed training strategy is formally presented, which consists of two parts, the dynamic training objective and the approximated gradient descent algorithm for optimizing the objective.

### 3.1 The dynamic training objective

Stochastic gradient descent or mini-batch gradient descent method can be used to minimize the objective function (18). However, as mentioned above, convergence of learning is slow for iRBMs.

*i*th hidden unit is independent with the other hidden units. This indicates that, if \(p\left( {z<M\,|\mathbf{v}} \right) \rightarrow 0\), then the first

*M*hidden units are independent with each other and act like a classic RBM, and any order of the first

*M*hidden units does not influence the performance of the model.

In Côté and Larochelle (2016), the iRBM is trained in a fixed order, which leads to an order biased model. If we assume the order of hidden units is changeable, and jointly train iRBMs with all possible orders, the bias might be alleviated, and the learned features will become closer to (19). This inspires us to propose an alternative training objective as follows.

*t*, we first draw a sample of permutation \(\tilde{\mathbf{o}}_t\) of the left-most \(M_t\) hidden units from a distribution \(p_t \left( {\mathbf{o}_t } \right) \):

*t*. To stabilize the growing of selected hidden units, only part of them are regrouped \((M_t<l_t)\).

*t*, its gradient (20) is in fact a sampling approximation to the gradient of the following marginalized objective function:

From the dynamic objective function (21), we can see that a mixture of \(M_t !\) iRBMs is been trained at gradient descent step *t*. All the iRBMs share the same set of hidden units but with different permutations. When the number of hidden units with non-zero weights \(l_t \) grows during training, the number of mixed iRBMs grows accordingly. Theoretically, \(l_t \) is allowed to take arbitrary positive integer value. Thus, the proposed training objective can potentially allow a mixture of infinite many iRBMs. For convenience, we name our training strategy as “random permutation (RP)”.

\(M_t \) controls the proportion of hidden units being regrouped. If \(M_t \) is too large, the model will grow explosively. If \(M_t\) is too small, the boosting effect is minor. We prefer as many hidden units to be regrouped as possible, in the meanwhile, the model would not growing too rapidly. The strategy of choosing a proper \(M_t \) will be specifically discussed in Sect. 4.

### 3.2 The approximated gradient descent algorithm

In the previous subsection we have defined the objective function (21) together with the approximated gradient (20) for parameter learning. In this subsection, we will provide a way of computing the gradient (20).

*t*, the dynamic gradient (20) is identical to the gradient of the objective function (18) except the probability \(p\left( {y_n |\mathbf{v}_n } \right) \) and \(p\left( {\mathbf{v}_n } \right) \) are replaced by \(p\left( {y_n |\mathbf{v}_n;{\tilde{\mathbf{o}}}_t } \right) \) and \(p\left( {\mathbf{v}_n |{\tilde{\mathbf{o}}}_t } \right) \). Once \({\tilde{\mathbf{o}}}_t \) is sampled, the training objective aims at learning an iRBM with permutation \({\tilde{\mathbf{o}}}_t \) of the hidden units. The learning of the generative part \(f_{\mathrm{gen}} \left( {\Theta _t,D_{\mathrm{train}},t} \right) \) is identical to the learning of the iRBM, which is briefly introduced in Sect. 2. CD or PCD can be directly used to compute the gradients. The approximated gradient for the generative part is given below:

The pseudo code of model parameter update for generative training objective is summarized in Algorithm 1, which is shown in the “Appendix”.

*t*, the complexity of computing (32) is \(\hbox {O}\left( {l_t^2 D+l_t^2 C} \right) \).

The gradients \({\partial F\left( {y_t |\mathbf{v}_t ;{\tilde{\mathbf{o}}}_t } \right) }/{\partial \theta ^{t}}\) can be exactly computed, which are shown in the “Appendix”. However, this involves computing gradients for infinite many parameters. To avoid this issue, we only compute the gradients for parameters of first \(l_t\) hidden units, and leave all the remaining parameters to be 0. This operation is equivalent to using (30) to compute the gradients for non-zero parameters and (29) for the remaining parameters.

The maximum number of activated hidden units \(l_t \) changes gradually during training. Practically, if the Gibbs sampling chain ever samples a value of *z* larger than \(l_t \), we clamp it to \(l_t+1\). This avoids filling large memory for a large (but rare) value of *z*.

### 3.3 Evaluation of the models

After the training is done (with *T* steps), we can simply treat \(l_T\) as the number of hidden units been trained. However this results in many “redundant” hidden units, as many of weight vectors is fairly small.

*z*giving the highest \(P\left( {z|\mathbf{v}_n } \right) \) in each mini-batch.

*y*conditioned on \(\mathbf{{v}'}\), \(p\left( {y|\mathbf{{v}'}} \right) \) for Dis-iRBM are computed as follows:

But if \(p\left( {z\le M_T |\mathbf{v},{\tilde{\mathbf{o}}}} \right) \rightarrow 0\) for an arbitrary order of the first \(M_T \) hidden units is satisfied, the likelihood \(p\left( \mathbf{v} \right) \) is invariant to the order of the first \(M_T \) hidden units. Any order gives the same result, then \(p\left( \mathbf{v} \right) =p\left( {\mathbf{{v}'}|{\tilde{\mathbf{o}}}_T \left( {1:M_T } \right) ,{\tilde{\mathbf{o}}}_0 \left( {M_T +1,N_h } \right) } \right) \), where \({\tilde{\mathbf{o}}}_T \left( {1:M_T } \right) \) is an arbitrary order of the first \(M_T \) hidden units, \({\tilde{\mathbf{o}}}_0 \left( {M_T +1,N_h } \right) \) is the original order of the remaining hidden units. This conclusion is formally represented in the following form:

### Proposition 3.1

The proof is given in the “Appendix”. Experimental results have shown that RP training can successfully make the condition in Proposition 3.1 approximately satisfied, any ordering gives a nearly identical result, thus a small *N* (e.g. \(N=5\)) is enough to give a good estimate of (34) and (35).

## 4 Experiments and discussions

In this section, we evaluate our training strategy empirically according to the convergence speed and the final generalization performance. The datasets used for evaluation are binarized MNIST (Salakhutdinov and Murray 2008) and CalTech101 Silhouettes (Marlin et al. 2010). The MNIST dataset is composed of 70,000 images of size \(28\times 28\) pixels representing handwritten digits (0–9), among which 60,000 images are used for training, and 10,000 images for testing. Each pixel of the image has been stochastically binarized according to its intensity as in Salakhutdinov and Murray (2008). The CalTech101 Silhouettes dataset is composed of 8671 images of size \(28\times 28\) binary pixels, representing object silhouettes of 101 classes. The dataset is divided into three parts: 4100 examples for training, 2264 for validation and 2307 for testing. We reshape each image of both datasets into a 784-dimensional vector by concatenating the adjacent rows one by one.

We have designed several experiments for different purposes. In Sect. 4.1, the principle of choosing a proper regrouping rate \(M_t\) was experimentally investigated. In Sect. 4.2, we evaluated the generalization performance of the iRBM trained with RP according to its log-likelihood on the test sets of binarized MNIST and CalTech101 Silhouettes. In Sect. 4.3, we evaluated the generalization performance of Dis-iRBMs trained with RP on classification tasks. For all the experiments, the mini-batch size is 100 and (P)CD is used to compute the gradients. Max-norm regularization (Srivastava et al. 2014) was also used to suppress very large weights, the bounds for each \(\mathbf{W}_{i\cdot } \) and \(\mathbf{U}_{i\cdot }\) were 10 and 5 respectively. Côté and Larochelle (2016) claims that results of learning are robust to the value of the hidden unit penalty \(\beta _i \). We have tried several different \(\beta _i \) and find that smaller \(\beta _i \) enables the model to grow to proper size faster at the beginning of learning. However, due to the ordering effect, it takes a long time for the hidden units to learn filters diverse from each other. RP training also prefers a small \(\beta _i \) to allow as many hidden units to be mixed as possible. But a too small \(\beta _i\) (coupled with a large \(M_t\)) is more likely to cause the model grow explosively. Based on the above arguments, and for convenience of comparing, we used the same \(\beta _i =1.01\times \ln 2\) for all the models in this paper, which is identical to that in Côté and Larochelle (2016). We also used L1 regularization and L2 regularization to regularize the models. The code to reproduce the results of this paper is available on GitHub.^{1}

We would like to mention that there exist a number of sophisticated techniques that improve performance of classic RBMs on sampling strategies (Cho et al. 2010; Sohldickstein et al. 2009), model architectures (Welling et al. 2005; Salakhutdinov and Hinton 2009), etc. However, the aim of this paper is to propose a alternative training strategy for faster convergence and better generalization of the original iRBMs in general. Combining these techniques can benefit for both training strategies, here we focus on comparing on basic settings of parameters.

### 4.1 The principle of choosing the regrouping rate \(M_t \)

As mentioned in Sect. 3, choosing a proper \(M_t\) is essential for RP training. In this subsection, we experimentally investigated the influence of \(M_t\) on the growing of the model, and a principle of choosing it is proposed based on the experimental results.

*z*that maximize \(p\left( {z|\mathbf{v}_n ;{\tilde{\mathbf{o}}}_T } \right) \) above \(M_T \). Similar numbers of hidden units are activated even though the inputs are quite different. The number of activated hidden units ranges from 500–520 on binarized MNIST and 700–750 on CalTech101 Silhouettes.

At the beginning of training, \(M_t =0.7l_t \sim 0.8l_t\) to allow a greedy mixing of iRBMs. After that, a more adaptable \(M_t \) defined by (36) is used to stabilize the growing.

### 4.2 Evaluating the generalization performance of RP trained iRBMs as density models

The best results of average log-likelihood on test sets of binarized MNIST and CalTech101 Silhouettes of different models

Binarized MNIST | CalTech101 Silhouettes | ||||
---|---|---|---|---|---|

Model | Size | Avg. LL | Model | Size | Avg. LL |

RBM (Salakhutdinov and Murray 2008) | 500 | \(-\) 86.34 | RBM (Côté and Larochelle 2016) | 500 | \(-\) 119.05 |

iRBM (Côté and Larochelle 2016) | 1208 | \(-\) 85.65 | iRBM (Côté and Larochelle 2016) | 915 | \(-\) 121.47 |

FW-iRBM (Ping and Liu 2016) | 460 | \(\approx \) \(-\) 85 | FW-iRBM (Ping and Liu 2016) | 550 | \(\approx \) \(-\) 127 |

iRBM, RP | | \(-\) | iRBM, RP | | \(-\) |

*z*and \(\mathbf{v}\). We computed the histogram of \(z_m =\arg \mathop {\max }\limits _z p\left( {z|\mathbf{v}_t } \right) \approx \sum _{n=1}^5 {p\left( {z|\mathbf{v}_t ;{\tilde{\mathbf{o}}}_T^n } \right) }\Big /5 \) on the two test sets. The results are shown in Fig. 10, which reveals two facts: (a) All \(z_m \) are larger than \(M_t \); (b) All the inputs have similar numbers of activated hidden units, as all \(z_m \) are close to each other. The number of example-specific filters has been greatly reduced.

To further validate that the filters are independent from each other, we just use all the learned filters to compose a classic RBM, i.e. *z* is clamped to \(N_h\), \(p(\mathbf{v}|N_h)\). The average log-likelihood on the two test sets are \(-\,88.13\) (binarized MNIST) and \(-\,115.90\) (CalTech101 Silhouettes) for the converted RBMs.

### 4.3 Evaluating the generalization performance of RP trained Dis-iRBMs on classification tasks

The best results of classification error on the test set of MNIST and CalTech101 Silhouettes achieved by different models

Binarized MNIST | CalTech101 Silhouettes | ||||||
---|---|---|---|---|---|---|---|

Model | Size | Training objective | Test error (%) | Model | Size | Training objective | Test error (%) |

ClassRBM (Hinton 2002) | 500 | Dis. | 1.81 | Dis-iRBM | 235 | Dis. | 37.10 |

1500 | Hybrid (\(\alpha =0.01\)) | 1.28 | 373 | Hybrid (\(\alpha =0.01)\) | 34.59 | ||

Dis-iRBM | 382 | Dis. | 2.20 | FW-iRBM+SR (Ping and Liu 2016) | 600 | – | 34.5 |

416 | Hybrid (\(\alpha =0.005)\) | 1.71 | Dis-iRBM, RP | | Dis. | | |

FW-iRBM+SR (Ping and Liu 2016) | 600 | – | 2.2 | | Hybrid (\(\alpha =0.01)\) | | |

Dis-iRBM, RP | | Dis. | | ||||

| Hybrid (\(\alpha =0.005)\) | |

The best Dis-iRBM trained using hybrid training objective (\(\alpha =0.005)\) achieves a test error of 1.42% on MNIST, which is better than 1.71% achieved by normally trained Dis-iRBM. The global learning rate for it is 0.1, and the L1 regularization weight is \(1\times 10^{-4}\). Dis-iRBM performs slightly worse than the ClassRBM on MNIST, but the difference between the two best results is smaller than 0.2%, which is commonly regarded as statistical insignificant for MNIST. The best result on CalTech101 Silhouettes is also achieved by a Dis-iRBM trained using hybrid training objective (\(\alpha =0.01)\) and RP training. The test error is 30.95%, which is better than 34.59% achieved by normally trained Dis-iRBM. The global learning rate for the best model is 0.01, and the L1 regularization weight is \(1\times 10^{-3}\). Interestingly, the size of iRBM is smaller when using the discriminative training objective. This makes sense as less features are often needed if the model only needs to discriminate objects from each other, instead of modeling all the examples well.

The FW-iRBM (Ping and Liu 2016) have also been used for classification by taking the hidden units’ activation vectors and using them as input for a softmax regression (SR). As FW-iRBM cannot perform discriminative training directly due to its training objective, learning the FW-iRBM and learning the SR are two separate procedures. i.e., after training the FW-iRBM with *T* iterations, fix the parameters and using its hidden units’ activation vectors to train a SR. The results of FW-iRBM are also listed in the table.

A “trick” to make the learning a bit faster is using the momentum, we use different momentum values for parameters of different hidden units according to the time they have been trained. When the hidden unit is added to training, the starting momentum value is 0.5, and then it gradually increases to 0.9.

## 5 Conclusion and future work

In this paper, we have proposed a novel training strategy (RP training) for the infinite RBMs, which aims at achieving better convergence and generalization. The core concept of the RP training is a dynamic training objective that allows a different model to be optimized at each gradient descent step. More specifically, an iRBM with an random grouping of hidden units is sampled before doing gradient descent. An implicit mixture of infinite many iRBMs with different permutations of hidden units is achieved with RP training. Experiments on binarized MNIST and CalTech101 Silhouettes have shown that, RP can train the hidden units more efficiently, thus results in smaller hidden layer size and better generalization performance. Compared with the FW-iRBM, the iRBM trains all the hidden units jointly not one unit greedily each update step, thus the former is more likely to reach sub-optimal solution. In the future, more datasets especially some real-valued datasets, will be used to give a further evaluation of the performance of our training strategy. Meanwhile, we are exploring a multi-layer extension of the iRBM, the idea of RP training can be also applied to this new architecture, combined with a greedy layer-wise pre-training.

## Footnotes

## References

- Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines.
*Cognitive Science*,*9*, 147–169.CrossRefGoogle Scholar - Baldi, P., & Sadowski, P. (2014). The dropout learning algorithm.
*Artificial Intelligence*,*210*, 78–122.MathSciNetCrossRefzbMATHGoogle Scholar - Cho, K., Raiko, T., & Ilin, A. (2010). Parallel tempering is efficient for learning restricted Boltzmann machines. In
*Proceedings of the International Joint Conference on Neural Networks (IJCNN)*(pp. 3246–3253). IEEE Press.Google Scholar - Côté, M. A., & Larochelle, H. (2016). An infinite restricted Boltzmann machine.
*Neural Computation*,*28*, 1265–1289.CrossRefGoogle Scholar - Dietterich, T. G. (2000). Ensemble methods in machine learning. In
*Multiple classifier systems*(pp. 1–15). Springer.Google Scholar - Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization.
*Journal of Machine Learning Research*,*12*, 2121–2159.MathSciNetzbMATHGoogle Scholar - Fischer, A., & Igel, C. (2010). Empirical analysis of the divergence of Gibbs sampling based learning algorithms for restricted Boltzmann machines. In
*Artificial Neural Networks—ICANN 2010*(pp. 208–217). Berlin: Springer.Google Scholar - Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.
*Neural Computation*,*14*, 1771–1800.CrossRefzbMATHGoogle Scholar - Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.
*Science*,*313*, 504–507.MathSciNetCrossRefzbMATHGoogle Scholar - Larochelle, H., et al. (2012). Learning algorithms for the classification restricted Boltzmann machine.
*Journal of Machine Learning Research*,*13*, 643–669.MathSciNetzbMATHGoogle Scholar - Marlin, B. M., Swersky, K., Chen, B., & de Freitas, N. (2010). Inductive principles for restricted Boltzmann machine learning. In
*Proceedings of the international conference on artificial intelligence and statistics*(pp. 305–306).Google Scholar - Mohamed, A., & Hinton, G. E. (2010). Phone recognition using restricted Boltzmann machines. In
*IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)*(pp. 4354–4357).Google Scholar - Mohamed, A., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief networks.
*IEEE Transactions on Audio, Speech, and Language Processing*,*20*, 14–22.CrossRefGoogle Scholar - Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In
*ICML*.Google Scholar - Ping, W., & Liu, Q. (2016). AT Ihler. In
*NIPS: Learning infinite RBMs with Frank–Wolfe*.Google Scholar - Salakhutdinov, R., & Hinton G. E. (2009). Deep Boltzmann machines. In
*AISTATS*.Google Scholar - Salakhutdinov, R., & Murray, I. (2008). On the quantitative analysis of deep belief networks. In
*Proceedings of the 25th Annual International Conference on Machine Learning (ICML)*(pp. 872–879).Google Scholar - Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In Z. Ghahramani (Edi.),
*Proceedings of the 24th International Conference on Machine Learning (ICML)*(pp. 791–798). ACM.Google Scholar - Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In
*Parallel distributed processing: Explorations in the microstructure of cognition, foundations*(Vol. 1, pp. 194–281).Google Scholar - Sohldickstein, J., Battaglino, P., & Deweese, M. R. (2009). Minimum probability flow learning. In
*ICML*.Google Scholar - Srivastava, N., Hinton, G., Krizhevsky, A., et al. (2014). Dropout: A simple way to prevent neural networks from overfitting.
*Journal of Machine Learning Research*,*15*, 1929–1958.MathSciNetzbMATHGoogle Scholar - Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from over fitting.
*Journal of Machine Learning Research*,*15*, 1929–1958.MathSciNetzbMATHGoogle Scholar - Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2007). Modeling human motion using binary latent variables. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.),
*Advances in neural information processing systems (NIPS 19)*(pp. 1345–1352). Cambridge: MIT Press.Google Scholar - Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In
*International Conference on Machine learning(ICML)*(pp. 1064–1071).Google Scholar - Tomczak, J. M., & Gonczarek, A. (2015). Sparse hidden units activation in restricted Boltzmann machine. In
*Progress in systems engineering*(pp. 181–185). Springer International Publishing.Google Scholar - Tran, T., Phung, D., & Venkatesh, S. (2014). Mixed-variate restricted Boltzmann machines. Eprint arxiv:1408.1160v1.
- Welling, M., Rosen-Zvi, M., & Hinton G. (2005). Exponential family harmoniums with an application to information retrieval. In
*NIPS*.Google Scholar - Welling, M., Zemel, R. S., & Hinton, G. E. (2002). Self supervised boosting. In
*NIPS*.Google Scholar