LowDimensional PerturbandMAP Approach for Learning Restricted Boltzmann Machines
Abstract
This paper introduces a new approach to maximum likelihood learning of the parameters of a restricted Boltzmann machine (RBM). The proposed method is based on the PerturbandMAP (PM) paradigm that enables sampling from the Gibbs distribution. PM is a two step process: (i) perturb the model using Gumbel perturbations, then (ii) find the maximum a posteriori (MAP) assignment of the perturbed model. We show that under certain conditions the resulting MAP configuration of the perturbed model is an unbiased sample from the original distribution. However, this approach requires an exponential number of perturbations, which is computationally intractable. Here, we apply an approximate approach based on the first order (lowdimensional) PM to calculate the gradient of the loglikelihood in binary RBM. Our approach relies on optimizing the energy function with respect to observable and hidden variables using a greedy procedure. First, for each variable we determine whether flipping this value will decrease the energy, and then we utilize the new local maximum to approximate the gradient. Moreover, we show that in some cases our approach works better than the standard coordinatedescent procedure for finding the MAP assignment and compare it with the Contrastive Divergence algorithm. We investigate the quality of our approach empirically, first on toy problems, then on various image datasets and a text dataset.
Keywords
Unsupervised deep learning Gumbel perturbation Restricted Boltzmann machine Greedy optimization1 Introduction
The commonly used procedure for learning parameters of Markov random fields (MRFs) is to maximize the loglikelihood function for observed data, by updating the model parameters along the gradient of the objective. This gradient step requires inference on the current model, which is often approximated using a deterministic or Markov Chain Monte Carlo (MCMC) procedure [10]. In general, the gradient step attempts to update the parameters to increase the unnormalized probability of the observations (clamped or positive phase), while decreasing the sum of unnormalized probabilities over all states, i.e., the partition function (unclamped or negative phase). The positive phase is rather straightforward, but the negative phase is difficult to perform, mainly because of the complexity of computing the partition function. Therefore, in order to overcome the issue, alternative approaches have been proposed, such as Contrastive Divergence (CD) [7] or Pseudolikelihood or ratio matching (see [19] for a review). In the context of Restricted Boltzmann Machines (RBMs), the widelyused training procedure is CD, which utilizes the MCMC in the negative phase to decrease the probability of the configurations that are in the vicinity of training data.
Recently, perturbation methods combined with efficient maximum a posteriori (MAP) solvers (PerturbandMAP, PM) were used to efficiently sample from the MRF [3, 5, 6, 17, 23, 24]. The main idea behind PM is based on extreme value theory, which states that the MAP assignments for particular perturbations of any Gibbs distribution can produce unbiased samples from the unperturbed model [23]. In practice, however, models cannot be perturbed in the ideal form and kthorder approximations are used. In [5] these order approximations are used to bound from above the partition function, suggesting that PMbased sampling procedures can be used in the negative phase to maximize a lower bound on the loglikelihood of the data. However, this is feasible only if efficient MAP solvers are accessible, e.g., MRF with submodular potentials [11], and even so, repeated MAP estimation at each step of learning could be prohibitively expensive. Nonetheless, this PM approach has been successfully applied to different problems, e.g., computer vision [24], feature learning using Cardinality RBM [15], structured prediction [1, 30], boundary of object annotation in images [18], image segmentation [8, 20]. The idea of PM was also used in a preliminary study on learning RBMs [26].
In [26] an approach closely related to CD and PerturbandMAP for sampling from RBM in the negative phase of learning was proposed. The basic idea was to perturb the parameters of the model, then starting from the training data, find the local optima of the energy function using blockcoordinate descent method. This produces samples from the joint distribution (over both the hidden and the visible variables) in the RBM. We call this approach perturbandcoordinate descent (P&CD).
In this work, we rely on a different strategy to find MAP assignments, namely, greedy search for local optimal assignments for both visible and hidden variables. Since learning RBM using the PM approach requires performing optimization in each minibatch, it is critical that we approximate MAP assignments quickly. That is why we exploit the greedy optimization technique, hoping it will produce reasonable local optimal assignments. Moreover, we show a close relationship of our proposed greedybased method to the coordinate descent method in the context of the RBM.
Using three experiments, we show that using the PM approach for learning the RBM is promising: (1) a toy problem in which the exact loglikelihood can be computed; (2) several blackandwhite image benchmark datasets: letters, MNIST, Omniglot, Frey Face, Handwritten character recognition; and (3) a text analysis of blog posts. Our empirical studies show that our proposed PMbased learning procedure for RBM in general performs similarly to CD, but in some cases can be better.

We propose to use a greedy optimization method for finding MAP solutions in the PM approach for learning RBMs.

We perform an empirical comparison of the PM approach against the Contrastive Divergence. We compare the methods both on unsupervised and supervised data. Given that [26] do not report any experimental results, this is the first empirical evaluation of the PM approach utilized for learning RBMs.
2 Background
2.1 Restricted Boltzmann Machine
2.2 PerturbandMAP Approach
Sampling from a MRF, including a RBM, is problematic due to the difficulty of calculating the partition function. However, assuming that a MAP assignment in the MRF can be found efficiently, it is possible to take advantage of random perturbation methods to obtain unbiased samples [23]. Further, the unbiased samples can be utilized in the standard stochastic approximation of the loglikelihood gradient to calculate the second sum in Eq. 8.
Let us consider a system described by variables \({\mathbf {z}}\), and an energy \({\mathcal {E}}({\mathbf {z}})\).^{3} It turns out that a probability distribution of MAP assignments of the perturbed energy using Gumbeldistributed random variables is equivalent to the Gibbs distribution. The following theorem clarifies the connection between the Gumbel distribution and the Gibbs distribution [23]:
Theorem 1
 (Perturb step)

Add a random perturbation \(\gamma ({\mathbf {z}})\) to the energy \({\mathcal {E}}({\mathbf {z}})\).
 (MAP step)
 Find the maximum of the perturbed energy:$$\begin{aligned} {\mathbf {z}} = \arg \max _{\hat{{\mathbf {z}}}} \{ {\mathcal {E}}(\hat{{\mathbf {z}}}) + \gamma (\hat{{\mathbf {z}}}) \}. \end{aligned}$$(10)
Since the domain of \({\mathbf {z}}\) grows exponentially with the number of variables, it is troublesome to find the MAP assignment of the perturbed energy efficiently. Therefore, first order (lowdimensional) Gumbel perturbations are often employed [5]. Here, for the first order perturbation, the joint perturbation is fully decomposable \(\gamma ({\mathbf {z}}) = \sum _{i} \gamma (z_i)\) and it corresponds to perturbing unary potentials (i.e., biases) only [5, 6, 23].
3 Learning RBM Using PM
In order to apply the PM approach for learning RBMs we need to specify both steps of the PM process, namely, Perturb and MAP. Here, we consider lowdimensional Gumbel perturbations and two optimization methods for finding MAP assignments: coordinatedescent method [26] and greedy optimization.
3.1 Perturb Step
Recently it has been shown theoretically that the lowdimensional perturbations can be used to draw approximate sample from the Gibbs distribution:
Theorem 2
The proof of this theorem takes advantage of approximating a sum of standard Gumbel distributions with single Gumbel distribution using the momentmatching method [22]. The approximation error between the distribution of the sum of Gumbel variables and the distribution for single Gumbel variable was bounded from above using the BerryEsseen inequality and it was shown to be small enough to be used as a valid approximation [33]. Therefore, theoretically we can apply the lowdimensional perturbations for sampling from the RBM. However, the crucial part of the PM approach lies in finding MAP solutions of the perturbed energy, that is, in formulating fast and efficient optimization procedure in the MAP step.
3.2 MAP Step
As we have presented, a set of S solutions of the MAP step in Eq. 10 can be used in the stochastic approximation of the gradient in Eq. 8. Since the lowdimensional distributions lead to the almost exact Gibbs distribution according to the Theorem 2, the feasibility of the PM approach strongly depends on the efficiency of the optimization procedure. In general, MAP estimation in MRF is NPhard [28] and only a limited class of MRFs allow efficient energy minimization [11]. In the case of RBM, the problem of finding a solution that maximizes the energy function can be cast as the unconstrained binary quadratic programming problem (BQP) that is known to be NPhard as well [21]. In the following subsections we present two alternatives to MAP step that are suitable when inference is employed within the context of learning RBMs.
3.2.1 Perturb and Coordinate Descent (P&CD) Learning
The procedure starts from any \({\mathbf {x}} \in {\mathcal {D}}\), where \({\mathbf {h}}\) and \({\mathbf {x}}\) are repeatedly updated for K steps and the procedure is run S times. As a result, S final configurations of visibles are used to calculate the second sum in Eq. 8 while the hiddens can be ”effectively” discarded.^{4}
Interestingly, the procedure utilizing the coordinate descent method resembles the Contrastive Divergence algorithm in which the consecutive steps in the forloop are performed according to the Gibbs sampler. The main difference between these two approaches is that, in the P&CD approach, the stochasticity is injected to biases and the solution of the optimization problem (a sample) is found in a deterministic manner. On the contrary, in the Contrastive Divergence the ”solution” (a sample) is found in a stochastic fashion.
3.2.2 Perturb and Greedy Energy Optimization (P&GEO) Learning
Interestingly, the greedy approach for finding MAP solutions is almost identical to the coordinate descent method (see steps 6 and 7 in Algorithms 1 and 2). The key difference lies in updating visible and hidden variables at once. This property may be profitable in case of training deep Boltzmann machines (DBM) because it is easy to apply the greedy method to optimize the energy function, while application of the coordinate descent for DBM is rather indirect. However, we leave investigating this issue for future research.
4 Experiments
In the empirical study we compare learning binary RBM using stochastic gradient descent (SGD) with P&CD and P&GEO, against the wellknown Contrastive Divergence (CD). In the experiments we consider a toy problem in which exact loglikelihood can be calculated, as well as various image benchmark datasets and a text benchmark dataset.
We use the loglikelihood of the test data as the evaluation metric. For highdimensional problems exact calculation of \(\log (Z)\) is intractable, therefore, we apply the Annealed Importance Sampling (AIS) procedure [27] for approximations. The AIS is performed 100 times with 10, 000 temperature scales evenly spaced between 0 and 1 and 100 particles in each run. The base distribution in AIS is set to independent binary draws at the mean of the observations. Moreover, we evaluate all methods in a discriminative manner using k Nearest Neighbor classifier.
In all experiments we used the following values of the hyperparameters. The learning rate of SGD was chosen from the set \(\{0.001, 0.01, 0.1\}\). CD used \(K \in \{1,5,10\}\) steps of the block Gibbs sampling, while the number of optimization iterations of P&CD and P&GEO procedures were in the set \(\{1, 5, 10\}\). Moreover, we performed the SGD procedure with the momentum coefficient in \(\{0, 0.9\}\) and we penalized the loglikelihood objective with the weight decay (the regularization coefficient in \(\{0, 10^{5}, 10^{4}, 10^{3}\}\)). The gradient approximation for all considered learning methods was calculated using \(S = 1\) samples.
The optimal hyperparameters were selected using the evaluation metric. The number of iterations over the training set was determined using early stopping according to the validation dataset loglikelihood, with a look ahead of 30 epochs.
All experiments were performed on Nvidia GeForce GTX970. The code for the paper is available online: https://github.com/szymonzareba/perturb_and_map_rbm.
4.1 Toy Problem
We generated 10,000 training images, 10,000 validation images and 10,000 test images. For training, we performed at most 500,000 weight updates for an RBM with 10 hidden units, and stochastic gradient descent with momentum term. We use the small number of hidden units because for 10 hidden units it is possible to calculate the exact value of the loglikelihood.
Average test loglikelihood with one standard deviation calculated using AIS
Toy \(p=0.01\)  Toy \(p=0.1\)  

Avg. LL  Avg. Std  Avg. Iters  Avg. LL  Avg. Std  Avg. Iters  
CD1  \(\,4.45\)  1.218  48  \(\,6.63\)  0.011  79 
P&GEO1  \(\,3.70\)  0.697  59  \(\,6.86\)  0.221  87 
P&CD1  \(\,2.99\)  0.418  48  \(\,6.81\)  0.233  71 
CD5  \(\,2.43\)  0.053  67  \(\,6.61\)  0.008  85 
P&GEO5  \(\,2.67\)  0.359  42  \(\,6.62\)  0.018  91 
P&CD5  \(\,2.47\)  0.355  54  \(\,6.62\)  0.003  86 
CD10  \(\,2.56\)  0.093  80  \(\,6.62\)  0.005  93 
P&GEO10  \(\,2.36\)  0.089  59  \(\,6.62\)  0.053  133 
P&CD10  \(\,2.78\)  0.173  105  \(\,6.62\)  0.004  98 
Discussion When the modes are close (\(p = 0.1\)), mixing is fast and all training techniques perform similarly. However, for \(K=1\) the PM approach converges much faster than the CD method but eventually obtains slightly worst result (see Fig. 2b). For \(K=5\) and \(K=10\), the CD converges faster than the PM by about 10 and 30 iterations (epochs), however, all of the considered methods converges to the same result. In the case of \(p=0.01\), where good mixing is crucial, both the CD and the PM perform similarly. Nevertheless, the CD requires about 20 more iterations than P&GEO to converge (see Fig. 2c, e). Interestingly, the P&GEO seems to be more stable than the P&CD because it requires less iterations to converge and has smaller variance (see especially Fig. 2e). Moreover, for \(K=10\), P&CD appears to struggle with poor mixing.
A possible explanation for unstable behavior of the P&CD is the way the update is performed: first visibles are calculated and then new hiddens are computed for given new visibles. We presume that the algorithms ”jumps” between poor (local) optima and that is why it is unable to reach a better solution. On the other hand, the greedy optimization performs an update at once, for both visibles and latents, and, therefore, its performance is stable and the variance is much lower.
4.2 Image Datasets
4.2.1 Unsupervised Evaluation
Average test loglikelihood with one standard deviation calculated using AIS
LETTERS  MNIST  OMNIGLOT  FREY FACE  HCR  

CD1  \(\,38.63\pm 0.18\)  \(\,108.10\pm 0.08\)  \(\,126.02\pm 0.21\)  \(\,119.54\pm 1.79\)  \(\,160.04\pm 1.09\) 
CD5  \(\,37.11\pm 0.06\)  \(\,92.30\pm 0.31\)  \(\,117.02\pm 1.27\)  \(\,112.07\pm 0.88\)  \(\,124.88\pm 1.48\) 
CD10  \(\,36.94\pm 0.05\)  \(\,88.94\pm 0.09\)  \(\,114.76\pm 0.36\)  \(\,111.33\pm 1.99\)  \(\,114.57\pm 0.74\) 
P&GEO1  \(\,39.17\pm 0.25\)  \(\,103.46\pm 1.80\)  \(\,126.00\pm 1.66\)  \(\,113.38\pm 1.40\)  \(\,122.11\pm 0.60\) 
P&GEO5  \(\,37.63\pm 0.01\)  \(\,88.38\pm 1.48\)  \(\,112.08\pm 0.82\)  \(\,108.49\pm 0.24\)  \(\,114.69\pm 3.52\) 
P&GEO10  \(\,37.64\pm 0.13\)  \(\,95.31\pm 1.30\)  \(\,110.24\pm 0.76\)  \(\,108.48\pm 0.64\)  \(\,120.40\pm 0.91\) 
P&CD1  \(\,39.11\pm 0.12\)  \(\,102.28\pm 0.70\)  \(\,127.12\pm 2.28\)  \(\,113.95\pm 0.70\)  \(\,119.66\pm 0.49\) 
P&CD5  \(\,37.60\pm 0.19\)  \(\,90.79\pm 0.77\)  \(\,110.19\pm 0.21\)  \(\,108.65\pm 0.69\)  \(\,114.65\pm 1.39\) 
P&CD10  \(\,37.66\pm 0.10\)  \(\,90.59\pm 0.51\)  \(\,110.20\pm 0.34\)  \(\,108.17\pm 0.70\)  \(\,114.98\pm 0.31\) 
Average time in milliseconds per minibatch for different datasets (in all cases size of images are provided)
Method  CD  P&GEO  P&CD  

K  1  5  10  1  5  10  1  5  10 
MNIST (\(28\times 28)\)  2.2  3.6  5.6  2.5  3.9  5.8  2.6  3.8  5.4 
FREY FACE (\(20\times 28)\)  1.1  1.7  2.6  1.6  2.0  2.7  1.5  2.0  2.6 
LETTERS \((16 \times 10)\)  0.9  1.2  1.6  0.9  1.2  1.5  1.0  1.2  1.5 
TOY \((4 \times 4)\)  0.4  0.5  0.7  0.5  0.6  0.7  0.5  0.6  0.7 
Discussion In Table 2 we see that the PM approach performs slightly better than CD. The wallclock times of the considered methods are almost the same (see Table 3), i.e., the number of Gibbs sampler steps and the number of optimization steps take almost exactly the same amount of time, therefore we focused on the results of the same time complexity for K equal 1, 5 and 10. We found that, in three out of five datasets, the PM approach performs better than the CD method (see Fig. 4b, c, d), the CD slightly dominates in one case (see Fig. 4a) and there is one draw (see Fig. 4e). Hence, the PM approach seems to be favorable to the CD since it obtains better results within similar wallclock time.
An interesting result was obtained in the case of the Omniglot dataset where CD performed worst than P&GEO and P&CD by about 5 nats. We hypothesize that the reason for this significant difference lies in the data itself. Notice that Omniglot contains images of 1623 different characters and in the training data we have only about 12 images per character. Moreover, there are 1623 modes in this data, which is drastically different from MNIST (10 modes), Letters (26 modes) or HCR (36 modes). Possibly, the PM approach better handles cases where there is less data for a mode than the CD. In other words, the PM approach may, to some extent, generate samples from a mode for which only a few examples are given, while the Gibbs sampler may ”jump” between modes and so generate false samples. Similarly, in the case of Frey Face the difference in favor of the PM approach is about 4.5 nats. However, it is difficult to estimate a number of modes for Frey Face because this dataset contains faces representing various emotions that are shown from different angles. Nonetheless, we can say that once again this dataset is characterized by a multimodality property and this gives another evidence for our presumption. We leave thorough investigation of this hypothesis for future research.
Overall, we found that the proposed P&GEO and the P&CD perform very similarly. In fact, the P&GEO obtains better results than the P&CD only on MNIST by about 2 nats. Therefore, we cannot clearly conclude that the proposed optimization technique is indeed favorable. Nonetheless, we see two potential qualitative advantages over the coordinate descent method. First, an application of the greedy strategy to DBM is straightforward since we are interested in calculating all variables at once, while the coordinate descent technique requires us to determine the order of updates, which could be cumbersome for DBMs. Second, there are other variants and similar heuristics for finding a solution for unconstrained BQP problem [21]. Here we have shown that the greedy method gives competitive results, which suggests following this line of thinking to develop new methods for the MAP step in the PerturbandMAP framework for learning RBMs.
4.2.2 Supervised Evaluation
Setting In the third experiment we aim at evaluating latent representation given by \(p({\mathbf {h}}{\mathbf {x}})\) using labeled data. For this purpose we used two out of the considered image datasets that contain also labels, namely, MNIST and Omniglot. Notice that MNIST consists 10 labels while Omniglot possesses 1622 classes. We used exactly the same training procedure as described in Sect. 4.2.
We utilized knearest neighbor classifier (kNN) with \(k \in \{1, 3, 5, 7, 9\}\) and the Euclidean metric. The application of the kNN classifier, which is a nonparametric method, does not introduce any bias of a parametric classifer. This approach is a common practice in assessing unsupervised methods [25].
In order to compare the three considered learning algorithms we used three metrics, namely, average precision score, classification accuracy and normalized mutual information score. The average precision score (AvgPrec) corresponds to the area under the precisionrecall curve in oneversusall classification setting. The classification accuracy (ClassAcc) represents the total number of correctly classified instances divided by the number of all examples. Finally, the normalized mutual information score (NMI) that is a normalized version of the a measure of the similarity between two labels of the same data, which is extensively used in clustering. Obtaining a value of NMI closer to 0 results in a random assignment of an example to a class and 1 corresponds to an opposite case. In the experiment we used the implementation of the considered statistics in scikitlearn package^{8}. The experiment was repeated 3 times and the best results are reported.
The average precision score (AvgPrec), the classification accuracu (ClassAcc) and the normalized mutual information score (NMI) for the three considered learning algorithms and two datasets
Method  MNIST  OMNIGLOT  

AvgPrec  ClassAcc  NMI  AvgPrec  ClassAcc  NMI  
CD  0.476  0.969  0.921  0.052  0.203  0.794 
P&GEO  0.481  0.969  0.923  0.052  0.205  0.796 
P&CD  0.483  0.969  0.922  0.051  0.201  0.794 
In Fig. 5 the tSNE 2D visualizations are presented. As expected, all representations are similar and each cluster is almost ideally associated with a single class label. Moreover, we notice that some classes tend to be close to each other, e.g., 3s with 8s and 6s that are depicted by red, yellow and pink color, respectively, in Fig. 5. We notice, however, that for the CD and the P&CD zeros denoted by the blue color in Fig. 5 are grouped in two separate clusters while for the P&GEO they form one cluster. This result may indicate that the P&GEO indeed produces good latent representation. Nevertheless, this result may follow from limiting capabilities of the tSNE method and in high dimensions the latent representation are grouped properly.
4.3 Text Dataset
4.3.1 Unsupervised Evaluation
Setting In this experiment, we evaluate the proposed approach on the 20newsgroups dataset [13], 20Newsgroups^{9} for short, that contains 8,500 training, 1245 validation, and 6,497 test text documents of blog posts. Each text document is represented as a binary vector of 100 most frequent words among all documents. The problem is the text analysis and the document classification to one of four newsgroup metatopics (classes).
Average test loglikelihood with one standard deviation calculated using AIS on 20Newsgroups
\(K=1\)  \(K=5\)  \(K=10\)  

CD  \(\,13.72\pm 0.02\)  \(\,13.73 \pm 0.03\)  \(\,13.70 \pm 0.02\) 
P&GEO  \(\,18.19 \pm 0.19\)  \(\,13.70 \pm 0.01\)  \(\,13.72\pm 0.02\) 
P&CD  \(\,18.38\pm 0.18\)  \(\,13.71 \pm 0.02\)  \(\,13.72 \pm 0.02 \) 
Discussion In Table 5 we notice that there is no significant difference between the CD and the PerturbandMAP approach. However, interestingly for the CD method there was almost no difference in the generative performance with respect to a varying number sampling steps while for the PM method it mattered a lot. Nevertheless, in this experiments the CD and the PM performed on a par.
4.3.2 Supervised Evaluation
The average precision score (AvgPrec), the classification accuracu (ClassAcc) and the normalized mutual information score (NMI) for the three considered learning algorithms on 20Newsgroups
Method  AvgPrec  ClassAcc  NMI 

CD  0.497  0.799  0.485 
P&GEO  0.503  0.801  0.490 
P&CD  0.502  0.801  0.491 
Discussion Contrary to the unsupervised evaluation, we notice that the PM approach allows to obtain better values of all performance metrics than the CD method, see Table 6. Moreover, the P&GEO method slightly outperforms P&CD in terms of the average precision but achieves a slightly worse result in terms of the normalized mutual information. These results suggest that the PM method allows to train better discriminative representation while maintaining similar generative capabilities as the CD method.
In Fig. 6 we present a twodimensional visualization of the latent representations using tSNE. The CD method tend to learn a representation that in some regions seems to be hard to assign a single class label. There are consistent clusters of a single class, however, in many regions classes are completely mixed. The PM approach, on the other hand, tends to group objects of the same class into more coherent clusters.
We also notice that there is an interesting effect of aggregating many latent representations in very small regions, see Fig. 6. A possible explanation for this phenomenon is that multiple documents are expressed by the same or almost the same sparse binary vectors. Therefore, they start formulating extremely dense clusters of points.
5 Conclusion
The current work focuses on the lowdimensional perturbations in the Perturb Step. Although our results indicate that this approach is efficient, it is worth considering higherorder perturbations. Naturally, while computational complexity grows exponentially with the order of the perturbations, this future direction might lead to much better results. We are currently investigating other optimization methods to find the ground state of the perturbed energy that may fit to the considered framework for learning RBMs. In the near future we would like to utilize our approach to deep models, where the much more complex form of the energy function poses a greater challenge.
In this paper, we introduced a novel application of the PerturbandMAP approach in the context of learning parameters of RBM. Since theoretical considerations indicate that samples obtained within the lowdimensional PM framework are approximate samples from the RBM (see Theorem2), our method works by perturbing unary potentials (i.e., bias terms) and further finding configurations of visible variables that minimize the perturbed energy. For this purpose we proposed a greedy optimization technique to find MAP solutions of the perturbed energy. Using both toy datasets, real image benchmark datasets and a text analysis dataset, we showed empirically that our method is competitive to the wellknown Contrastive Divergence algorithm. Moreover, we indicated some advantages of the proposed greedy method over the coordinate descent.
Footnotes
 1.
We use the following notation: for given matrix \({\mathbf {A}}\), \({\mathbf {A}}_{ij}\) is its element at location (i, j), \({\mathbf {A}}_{\cdot j}\) denotes its \(j{\text {th}}\) column , \({\mathbf {A}}_{i\cdot }\) denotes its \(i{\text {th}}\) row, and for given vector \({\mathbf {a}}\), \({\mathbf {a}}_i\) is its \(i{\text {th}}\) element.
 2.
\(\sigma ({\mathbf {x}}) = \big [ \frac{1}{1+\mathrm {exp}(x_1)},\ldots ,\frac{1}{1+\mathrm {exp}(x_D)} \big ]^{\mathrm {T}}\) is the elementwise sigmoid function.
 3.
Here we use a general notation for variables and energy as it refers to visibles and hiddens, and the energy function Eq. 1 in RBM. Notice that in order to be consistent with the literature of the PerturbandMAP we use an energy without the minus sign to solve the maximization problem instead of the minimization problem.
 4.
This corresponds to using meanfield approximation for the hidden variables (rather than using samples) in the parameter update.
 5.
 6.
The dataset was binarized according to [27].
 7.
The dataset was binarized using a fixed threshold 0.55.
 8.
 9.
Notes
Acknowledgements
The work conducted by J.M. Tomczak was partially cofinanced within the Project: Young Staff 2015.
References
 1.Bertasius G, Liu Q, Torresani L, Shi J (2016) Local PerturbandMAP for structured prediction. arXiv preprint arXiv:1605.07686
 2.Desjardins G, Courville A, Bengio Y, Vincent P, Delalleau O (2010) Tempered markov chain monte carlo for training of restricted boltzmann machines. In: AISTATS, pp 145–152Google Scholar
 3.Gane A, Hazan T, Jaakkola T (2014) Learning with maximum aposteriori perturbation models. In: AISTATS, pp 247–256Google Scholar
 4.Gumbel E (1954) Statistical theory of extreme values and some practical applications: a series of lectures. Applied mathematics series. U. S. Govt. Print Office, Tinker Air Force BasezbMATHGoogle Scholar
 5.Hazan T, Jaakkola T (2012) On the partition function and random maximum aposteriori perturbations. In: ICMLGoogle Scholar
 6.Hazan T, Maji S, Jaakkola T (2013) On sampling from the gibbs distribution with random maximum aposteriori perturbations. In: NIPS, pp 1268–1276Google Scholar
 7.Hinton G (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800CrossRefGoogle Scholar
 8.Kappes JH, Swoboda P, Savchynskyy B, Hazan T, Schnörr C (2016) Multicuts and perturb & MAP for probabilistic graph clustering. J Math Imaging Vis 56:1–17MathSciNetCrossRefGoogle Scholar
 9.Kochenberger G, Hao JK, Glover F, Lewis M, Lü Z, Wang H, Wang Y (2014) The unconstrained binary quadratic programming problem: a survey. J Comb Optim 28(1):58–81MathSciNetCrossRefGoogle Scholar
 10.Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, CambridgezbMATHGoogle Scholar
 11.Kolmogorov V, Zabin R (2004) What energy functions can be minimized via graph cuts? IEEE Trans Pattern Anal Mach Intell 26(2):147–159CrossRefGoogle Scholar
 12.Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Humanlevel concept learning through probabilistic program induction. Science 350(6266):1332–1338MathSciNetCrossRefGoogle Scholar
 13.Lang K (1995) Newsweeder: learning to filter netnews. In: Machine learning proceedings, pp 331–339Google Scholar
 14.LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradientbased learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
 15.Li K, Swersky K, Zemel R (2013) Efficient feature learning using PerturbandMAP. In: NIPS 2013 workshop on perturbationsGoogle Scholar
 16.Lvd Maaten, Hinton G (2008) Visualizing data using tSNE. J Mach Learn Res 9:2579–2605zbMATHGoogle Scholar
 17.Maddison C, Tarlow D, Minka T (2014) A* Sampling. In: NIPS, pp 3086–3094Google Scholar
 18.Maji S, Hazan T, Jaakkola T (2014) Active boundary annotation using random map perturbations. In: AISTATS, pp 604–613Google Scholar
 19.Marlin B, Swersky K, Chen B, Freitas N.De (2010) Inductive principles for restricted boltzmann machine learning. In: AISTATS, pp 509–516Google Scholar
 20.Meier R, Knecht U, Jungo A, Wiest R, Reyes M (2017) PerturbandMPM: quantifying segmentation uncertainty in dense multilabel CRFs. arXiv preprint arXiv:1703.00312
 21.Merz P, Freisleben B (2002) Greedy and local search heuristics for unconstrained binary quadratic programming. J Heuristics 8(2):197–213CrossRefGoogle Scholar
 22.Nadarajah S (2008) Exact distribution of the linear combination of p Gumbel random variables. Int J Comput Math 85(9):1355–1362MathSciNetCrossRefGoogle Scholar
 23.Papandreou G, Yuille A (2011) Perturbandmap random fields: Using discrete optimization to learn and sample from energy models. In: ICCV, pp 193–200Google Scholar
 24.Papandreou G, Yuille A (2014) PerturbandMAP random fields: reducing random sampling to optimization, with applications in computer vision. In: Nowozin S, Gehler P, Jancsary J, Lampert C (eds) Advanced structured prediction. MIT Press, Cambridge, pp 159–185Google Scholar
 25.Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434
 26.Ravanbakhsh S, Greiner R, Frey B (2014) Training restricted boltzmann machine by perturbation. arXiv preprint arXiv:1405.1436
 27.Salakhutdinov R, Murray I (2008) On the quantitative analysis of deep belief networks. In: ICML, pp 872–879Google Scholar
 28.Shimony S (1994) Finding MAPs for belief networks is NPhard. Artif Intell 68(2):399–410CrossRefGoogle Scholar
 29.Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. In: Rumelhart DE, McClelland JL (eds) Parallel Distributed processing: explorations in the microstructure of cognition, vol. 1, pp 194–281. MIT PressGoogle Scholar
 30.Tarlow D, Adams R, Zemel R (2012) Randomized optimum models for structured prediction. In: AISTATS, pp 1221–1229Google Scholar
 31.Taskar B, Guestrin C, Koller D (2003) Maxmargin Markov networks. Adv Neural Inf Process Syst 16:25Google Scholar
 32.Tieleman T (2008) Training restricted Boltzmann machines using approximations to the likelihood gradient. In: ICML, pp 1064–1071Google Scholar
 33.Tomczak J (2016) On some properties of the lowdimensional Gumbel perturbations in the PerturbandMAP model. Stat Probab Lett. https://doi.org/10.1016/j.spl.2016.03.019 MathSciNetCrossRefzbMATHGoogle Scholar
 34.Van der Maaten L (2009) A new benchmark dataset for handwritten character recognition. Tilburg University, Tilburg, pp 2–5Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.