Keywords

1 Introduction

Restricted Boltzmann machines (RBMs) are popular probability graph models for representing dependency structure between random variables [1]. It is very known that RBMs are energy-based models and powerful tools for representational learning. By modifying energy functions, RBMs can be widely used in artificial intelligence and machine learning fields [2]. RBMs have been developed for real-valued data modelling [3], sequential data modelling [4], noisy data modelling [5], document modelling [6], multimodal learning [7], and other applications. RBMs also are basic building blocks for creating deep belief networks (DBNs) [1] and deep Boltzmann machines (DBMs) [8]. Contrasting with RBMs, these two deep networks show better representational learning and classification abilities.

The general RBM and many RBM variants are only suitable for addressing the single view data. Actually, there are many data coming from multiple views, where each view may be a feature vector or a domain. Therefore, many researchers focus on the multi-view learning task [9]. Recently, there are many efficient multi-view algorithms for classification, such as multi-view Gaussian process with posterior Consistency (MvGP) [10] and consensus and complementarity based maximum entropy discrimination (MED-2C) [11]. The MvGP and the MED-2C are posterior-consistency style and margin-consistency style algorithms, respectively. These two algorithms both balance the relationship between the multi-view data and the model and achieve the state-of-the-art classification accuracy. It is very known that RBMs are powerful tools in machine learning, but RBMs have few applications in multi-view learning. Our work focuses on the consistency among view-specific hidden layers and balances the relationship between the multi-view data and the model for classification.

In this paper, we first propose a new RBM model, named as the RBM with posterior consistency (PCRBM), for multi-view classification. The PCRBM models each separated view as a RBM. The weights of the original RBM are optimized by maximizing the log likelihood function. Unlike the general RBM, the PCRBM updates weights by maximizing the log likelihood function on each view and maximizing the consistency among the hidden layer conditional distributions on each view. In addition, original RBMs only deal with the binary data, so we extend PCRBMs to exponential family RBMs (Exp-RBMs) [12] and propose exponential family RBMs with posterior consistency (Exp-PCRBMs). In the Exp-PCRBM, activation functions in visible or hidden units can be any smooth monotonic non-linearity function, such as Gaussian function and ReLU function.

The remainder of the paper is organized as follows. Section 2 details PCRBMs, including the inference and learning procedures. Section 3 gives extensions of PCRBMs for multi-view data and real data. In Sect. 4, experiment results prove the feasibilities of the proposed methods. Finally, some conclusions and the intending work are given in the last section.

2 Restricted Boltzmann Machines with Posterior Consistency for Two-View Classification

2.1 Restricted Boltzmann Machines with Posterior Consistency for Two-View Data

It well known that the general RBM is only suitable for addressing single view data. We propose a new RBM model to deal with two-view data and call it the RBM with posterior consistency (PCRBM). The PCRBM first makes use of a general RBM to model each view of data. That is, the conditional probabilities of visible or hidden units in the PCRBM are similar to the general RBM. And then, the PCRBM utilizes the stochastic approximation method to update network weights by maximizing the log likelihood function on each view and maximizing the consistency between the conditional probabilities of hidden units given visible data on each view. In the PCRBM, the negative of the distance between two conditional probabilities is used to measure the consistency between two conditional probabilities. The PCRBM is also a generative model, and it contains two layers of visible units \( {\mathbf{v}}^{1} = \{ v_{i}^{1} \}_{i = 1}^{D1} ,{\mathbf{v}}^{2} = \{ v_{i}^{2} \}_{i = 1}^{D2} \) and two layers of hidden units \( {\mathbf{h}}^{1} = \{ h_{j}^{1} \}_{j = 1}^{J} ,{\mathbf{h}}^{2} = \{ h_{j}^{2} \}_{j = 1}^{J} \) corresponding to the two-view data with the connection weights \( \theta = \{ {\mathbf{W}}^{1} ,{\mathbf{b}}^{1} ,{\mathbf{c}}^{1} ,{\mathbf{W}}^{2} ,{\mathbf{b}}^{2} ,{\mathbf{c}}^{2} \} \). The energy function of the PCRBM is composed of two general RBM models, then conditional probabilities on two views can be given by:

$$ P(h_{j}^{1} = 1|{\mathbf{v}}^{1} ) = \sigma \left( {\sum\nolimits_{i} {v_{i}^{1} W_{ij}^{1} } + b_{j}^{1} } \right),\,\,\,\,\,P(v_{i}^{1} = 1|{\mathbf{h}}^{1} ) = \sigma \left( {\sum\nolimits_{j} {W_{ij}^{1} h_{j}^{1} } + c_{i}^{1} } \right), $$
(1)
$$ P(h_{j}^{2} = 1|{\mathbf{v}}^{2} ) = \sigma \left( {\sum\nolimits_{i} {v_{i}^{2} W_{ij}^{2} } + b_{j}^{2} } \right), \, P(v_{i}^{2} = 1|{\mathbf{h}}^{2} ) = \sigma \left( {\sum\nolimits_{j} {W_{ij}^{2} h_{j}^{2} } + c_{i}^{2} } \right), $$
(2)

where \( \sigma (x) = {1 \mathord{\left/ {\vphantom {1 {\left( {1 + \exp ( - x)} \right)}}} \right. \kern-0pt} {\left( {1 + \exp ( - x)} \right)}} \).

Assume that the two views training sample set \( {\mathbf{X}}1 = \left\{ {{\mathbf{v}}^{1 \, (n)} } \right\}_{n = 1}^{N} ,{\mathbf{X}}2 = \left\{ {{\mathbf{v}}^{{ 2 { }(n)}} } \right\}_{n = 1}^{N} , \) \( {\mathbf{Y}} = \left\{ {{\mathbf{Y}}^{(n)} } \right\}_{n = 1}^{N} \), where \( {\mathbf{X}}1 \) and \( {\mathbf{X}}2 \) are two-view data, and \( {\mathbf{Y}} \) is the corresponding label. In order to maximize the consistency between hidden layer conditional probabilities of two views, the objective of the PCRBM can be expressed as:

$$ \begin{aligned} \mathop {\hbox{max} }\limits_{\theta } \quad & \sum\nolimits_{n} {\ln P\left( {{\mathbf{v}}^{1 \, (n)} ;\theta } \right)} \,\, + \sum\nolimits_{n} {\ln P\left( {{\mathbf{v}}^{{ 2 { }(n)}} ;\theta } \right)} \\ & +\, \lambda consistency\left( {\sum\nolimits_{n} {P\left( {{\mathbf{h}}^{{ 1 { }(n)}} |{\mathbf{v}}^{{ 1 { }(n)}} ;\theta } \right)} ,\sum\nolimits_{n} {P\left( {{\mathbf{h}}^{{ 2 { }(n)}} |{\mathbf{v}}^{{ 2 { }(n)}} ;\theta } \right)} } \right), \\ \end{aligned} $$
(3)

where \( \lambda \) is a parameter to balance the log likelihood function. We can make use of the stochastic approximation algorithm and the derivation of the posterior consistency to maximize the objective function, and the details is given in next section. After the pre-training, the PCRBM utilizes the data with labels and the gradient descent method to fine-tune the weights for classification. In the general RBM, the weights connecting visible units and hidden units are also fine-tuned. However, in the PCRBM, the weights connecting visible units and hidden units contain the posterior consistency between two views and the conditional probabilities over hidden units given visible units should remain unchanged. Define \( {\mathbf{H}}^{{ 1 { }(n)}} = P\left( {{\mathbf{h}}^{{ 1 { }(n)}} |{\mathbf{v}}^{{ 1 { }(n)}} ;\theta } \right) \), and \( {\mathbf{H}}^{{ 2 { }(n)}} = P\left( {{\mathbf{h}}^{{ 2 { }(n)}} |{\mathbf{v}}^{{ 2 { }(n)}} ;\theta } \right) \) (\( {\mathbf{H}}^{1} ,{\mathbf{H}}^{2} \in \Re^{N \times J} \)), and then the objective function of the classification model can be expressed as:

$$ \begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{{\theta^{\prime}}} } & {\frac{a}{2}\sum\nolimits_{n} {\left\| {{\mathbf{Y}}^{(n)} - P\left( {{\hat{\mathbf{Y}}}^{(n)} |{\mathbf{H}}^{1 \, (n)} ;\theta^{\prime}} \right)} \right\|^{2} } } \\ \end{array} + \frac{{\left( {1 - a} \right)}}{2}\sum\nolimits_{n} {\left\| {{\mathbf{Y}}^{(n)} - P\left( {{\hat{\mathbf{Y}}}^{(n)} |{\mathbf{H}}^{{ 2 { }(n)}} ;\theta^{\prime}} \right)} \right\|^{2} } , $$
(4)

where \( a \in \left[ {0,1} \right] \) is a parameter to balance two views, and

$$ \begin{aligned} P\left( {\hat{Y}_{l}^{(n)} |{\mathbf{H}}^{1 \, (n)} ;\theta^{\prime}} \right) = {{\exp \left( {\sum\nolimits_{j} {H_{j}^{1 \, (n)} W2_{jl}^{1} } + b2_{l}^{1} } \right)} \mathord{\left/ {\vphantom {{\exp \left( {\sum\nolimits_{j} {H_{j}^{1 \, (n)} W2_{jl}^{1} } + b2_{l}^{1} } \right)} {\sum\nolimits_{l} {\exp \left( {\sum\nolimits_{j} {H_{j}^{1 \, (n)} W2_{jl}^{1} } + b2_{l}^{1} } \right)} }}} \right. \kern-0pt} {\sum\nolimits_{l} {\exp \left( {\sum\nolimits_{j} {H_{j}^{1 \, (n)} W2_{jl}^{1} } + b2_{l}^{1} } \right)} }}, \hfill \\ P\left( {\hat{Y}_{l}^{(n)} |{\mathbf{H}}^{{ 2 { }(n)}} ;\theta^{\prime}} \right) = {{\exp \left( {\sum\nolimits_{j} {H_{j}^{{ 2 { }(n)}} W2_{jl}^{2} } + b2_{l}^{2} } \right)} \mathord{\left/ {\vphantom {{\exp \left( {\sum\nolimits_{j} {H_{j}^{{ 2 { }(n)}} W2_{jl}^{2} } + b2_{l}^{2} } \right)} {\sum\nolimits_{l} {\exp \left( {\sum\nolimits_{j} {H_{j}^{{ 2 { }(n)}} W2_{jl}^{2} } + b2_{l}^{2} } \right)} }}} \right. \kern-0pt} {\sum\nolimits_{l} {\exp \left( {\sum\nolimits_{j} {H_{j}^{{ 2 { }(n)}} W2_{jl}^{2} } + b2_{l}^{2} } \right)} }}. \hfill \\ \end{aligned} $$
(5)

Therefore, we use the gradient descent method to fine-tune the weights connecting hidden units and label units [13]. The PCRBM is not only suitable for two-class classification data but also for multi-class classification data.

2.2 Inference and Learning Procedure for Two-View Data

For each view, the gradient with respect to a weight can be divided into two parts, the gradient of the posterior consistency and the gradient of the log likelihood function. the consistency between \( {\mathbf{H}}^{1} \) and \( {\mathbf{H}}^{2} \) can be defined as the negative of the distance between two conditional probabilities

$$ consistency\left( {{\mathbf{H}}^{1} ,{\mathbf{H}}^{2} } \right) = \frac{1}{N}\sum\nolimits_{n} {\left( { - \frac{1}{2}\frac{{\left\| {{\mathbf{H}}^{{ 1 { }(n)}} - {\mathbf{H}}^{{ 2 { }(n)}} } \right\|^{2} }}{{\left\| {{\mathbf{H}}^{{ 1 { }(n)}} } \right\|^{2} + \left\| {{\mathbf{H}}^{{ 2 { }(n)}} } \right\|^{2} }}} \right)} = \frac{1}{N}\sum\nolimits_{n} {\left( {\frac{{{\mathbf{H}}^{{ 1 { }(n)}} \odot {\mathbf{H}}^{{ 2 { }(n)}} }}{{\left\| {{\mathbf{H}}^{{ 1 { }(n)}} } \right\|^{2} + \left\| {{\mathbf{H}}^{{ 2 { }(n)}} } \right\|^{2} }}} \right)} - \frac{1}{2}, $$
(6)

where \( \odot \) denotes element-wise multiplication. We have used the mean-field variational inference method to obtain \( {\mathbf{H}}^{{ 1 { }(n)}} = P\left( {{\mathbf{h}}^{{ 1 { }(n)}} |{\mathbf{v}}^{{ 1 { }(n)}} ;\theta } \right) \) and \( {\mathbf{H}}^{{ 2 { }(n)}} = P\left( {{\mathbf{h}}^{{ 2 { }(n)}} |{\mathbf{v}}^{{ 2 { }(n)}} ;\theta } \right) \). To compute the gradient of the consistency with respect to a weight, we can compute the gradient of the consistency with respect to \( {\mathbf{H}}^{1} \) and \( {\mathbf{H}}^{2} \) and then use backpropagation. Take as an example the gradient with respect to \( {\mathbf{W}}^{1} \) in the first view. The gradient of the posterior consistency with respect to \( {\mathbf{W}}^{1} \) can be given by:

$$ \Delta {\mathbf{W}}_{{^{consistency} }}^{1} = \frac{1}{N}{\mathbf{X}}1^{T} \left( {{\mathbf{H}}^{1} \odot \left( {1 - {\mathbf{H}}^{1} } \right) \odot \left( {\frac{{{\mathbf{H}}^{ 2} }}{{\left\| {{\mathbf{H}}^{ 1} } \right\|^{2} + \left\| {{\mathbf{H}}^{2} } \right\|^{2} }} - \frac{{2{\mathbf{H}}^{ 1} \odot \left( {{\mathbf{H}}^{ 1} \odot {\mathbf{H}}^{ 2} } \right)}}{{\left\| {\left\| {{\mathbf{H}}^{ 1} } \right\|^{2} + \left\| {{\mathbf{H}}^{ 2} } \right\|^{2} } \right\|^{2} }}} \right)} \right). $$
(7)

In addition, the gradient of the log likelihood function with respect to a weight can be simplified to the difference between data-dependent statistic and model-dependent statistic. Moreover, the CD-k or other stochastic approximation algorithms provide an effective way to estimate the mode-dependent statistic. The gradient of the log likelihood function with respect to \( {\mathbf{W}}^{1} \) can be given by:

$$ \Delta {\mathbf{W}}_{{^{{{\text{log-likelihood}}}}}}^{1} = {{\left({{\mathbf{E}}_{{P_{\text{data}} }} \left[{{\mathbf{X}}1^{T}{\mathbf{H}}^{1} } \right] - {\mathbf{E}}_{{P_{\text{model}} }}\left[ {{\mathbf{X}}1^{T} {\mathbf{H}}^{1} } \right]} \right)}\mathord{\left/ {\vphantom{{\left({{\mathbf{E}}_{{P_{\text{data}} }} \left[{{\mathbf{X}}1^{T}{\mathbf{H}}^{1} } \right] - {\mathbf{E}}_{{P_{\text{model}} }}\left[ {{\mathbf{X}}1^{T}{\mathbf{H}}^{1} } \right]} \right)} N}}\right.\kern-0pt} N}, $$
(8)

This way, the gradient of the objective function with respect to \( {\mathbf{W}}^{1} \) can be given by:

$$ \begin{aligned} \Delta {\mathbf{W}}^{1} = \Delta {\mathbf{W}}_{{^{{{\text{log-likelihood}}}} }}^{1} + \lambda \Delta {\mathbf{W}}_{{^{\text{correlation}} }}^{1} = \frac{1}{N}\left( {{\mathbf{E}}_{{P_{\text{data}} }} \left[ {{\mathbf{X}}1^{T} {\mathbf{H}}^{1} } \right] - {\mathbf{E}}_{{P_{\text{model}} }} \left[ {{\mathbf{X}}1^{T} {\mathbf{H}}^{1} } \right]} \right) \\ + \frac{\lambda }{N}{\mathbf{X}}1^{T} \left( {{\mathbf{H}}^{1} \odot \left( {1 - {\mathbf{H}}^{1} } \right) \odot \left( {\frac{{{\mathbf{H}}^{ 2} }}{{\left\| {{\mathbf{H}}^{ 1} } \right\|^{2} + \left\| {{\mathbf{H}}^{2} } \right\|^{2} }} - \frac{{2{\mathbf{H}}^{ 1} \odot \left( {{\mathbf{H}}^{ 1} \odot {\mathbf{H}}^{ 2} } \right)}}{{\left\| {\left\| {{\mathbf{H}}^{ 1} } \right\|^{2} + \left\| {{\mathbf{H}}^{ 2} } \right\|^{2} } \right\|^{2} }}} \right)} \right). \\ \end{aligned} $$
(9)

Likewise, the gradients of the objective function with respect to other weights are computed by using the similar method.

3 Extensions of Restricted Boltzmann Machines with Posterior Consistency

3.1 Extensions for Multi-view Data

By taking two views as an example, we detail the model of restricted Boltzmann machines with posterior consistency (PCRBMs) in the above section. The PCRBM has two objective functions corresponding to two-stage tasks, the objective for maximizing the log likelihood function and the correlation and the objective for classification. The reason the PCRBM can be extend to address multi-view data is that each objective function can be express as an elegant formulation. In the first-stage task, the objective for multiple views also can divided into two parts, the log likelihood function on each view and maximizing the posterior consistency among multiple views. The PCRBM also model each view of data as a general RBM, and the conditional probabilities of hidden units given visible units is easily sampled. Moreover, the posterior consistency between two conditional probabilities can be calculated by the negative of the distance between two conditional probabilities. For a multiple views training set of N samples \( {\mathbf{X}}1 = \{ {\mathbf{v}}^{1 \, (n)} \}_{n = 1}^{N} , \cdots ,{\mathbf{X}}K = \{ {\mathbf{v}}^{K \, (n)} \}_{n = 1}^{N} , \) \( {\mathbf{Y}} = \left\{ {{\mathbf{Y}}^{(n)} } \right\}_{n = 1}^{N} \), the objective for maximizing the log likelihood function and the posterior consistency in multiple views can be expressed as:

$$ \begin{aligned} \mathop {\hbox{max} }\limits_{\theta } \quad & \sum\nolimits_{k} {\sum\nolimits_{n} {\ln P\left( {{\mathbf{v}}^{k \, (n)} ;\theta } \right)} } \\ & + \sum\nolimits_{i = 1}^{K} {\sum\nolimits_{j > i}^{K} {\sum\nolimits_{n} {\lambda_{ij} consistency\left( {P\left( {{\mathbf{h}}^{i \, (n)} |{\mathbf{v}}^{i \, (n)} ;\theta } \right),P\left( {{\mathbf{h}}^{j \, (n)} |{\mathbf{v}}^{j \, (n)} ;\theta } \right)} \right)} } } . \\ \end{aligned} $$
(10)

We can find that the objective in k-view is that

$$ \begin{aligned} \mathop {\hbox{max} }\limits_{\theta } \quad & \sum\nolimits_{n} {\ln P\left( {{\mathbf{v}}^{k \, (n)} ;\theta } \right)} \\ & + \sum\nolimits_{i \ne k}^{K} {\sum\nolimits_{n} {\lambda_{ik} consistency\left( {P\left( {{\mathbf{h}}^{i \, (n)} |{\mathbf{v}}^{i \, (n)} ;\theta } \right),P\left( {{\mathbf{h}}^{k \, (n)} |{\mathbf{v}}^{k \, (n)} ;\theta } \right)} \right)} } , \\ \end{aligned} $$
(11)

and utilize the stochastic approximation algorithm and the derivation of the correlation to maximize the objective function in k-view. In the second-stage task, we utilize the data with labels and the gradient descent method to fine-tune the weights connecting hidden units and label units. The objective function for classification in multiple views can be expressed as:

$$ \begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{{\theta^{\prime}}} } & {\frac{1}{2}\sum\nolimits_{k} {a_{k} \sum\nolimits_{n} {\left\| {{\mathbf{Y}}^{(n)} - P\left( {{\hat{\mathbf{Y}}}^{(n)} |{\mathbf{H}}^{k \, (n)} ;\theta^{\prime}} \right)} \right\|^{2} } } } \\ \end{array} . $$
(12)

3.2 Exponential Family Restricted Boltzmann Machines with Posterior Consistency for Real Data

Ravanbakhsh et al. proposed exponential family RBMs (Exp-RBMs) where each unit can choose any smooth monotonic non-linearity function as the activation function. Regardless of the activation function, each visible (hidden) unit receives an input \( \upsilon_{i} = \sum\nolimits_{j} {W_{ij} h_{j} } + c_{i} \) (\( \eta_{j} = \sum\nolimits_{i} {v_{i} W_{ij} } + b_{j} \)). Consider an Exp-RBM with variables \( \{ {\mathbf{v}},{\mathbf{h}}\} \), and the energy function is defined as:

$$ E({\mathbf{v}},{\mathbf{h}};\theta ) = - \sum\limits_{i = 1}^{D} {\sum\limits_{j = 1}^{J} {v_{i} W_{ij} h_{j} } } - \sum\limits_{j = 1}^{J} {b_{j} h_{j} } - \sum\limits_{i = 1}^{D} {c_{i} v_{i} } + \sum\limits_{j = 1}^{J} {\left( {R^{*} (h_{j} ) + s(h_{j} )} \right)} + \sum\limits_{i = 1}^{D} {\left( {F^{*} (v_{i} ) + g(v_{i} )} \right)} , $$
(13)

where \( F^{*} \) and \( g \) are functions of \( v_{i} \), the derivative of \( F^{*} \) is \( f^{ - 1} \) (\( f^{ - 1} \) is the inverse function of \( f \) and the anti-derivative of \( f \) is \( F \)), and similarly \( R^{*} \) and \( s \) are functions of \( h_{j} \).

Like the general RBM, the proposed PCRBM is also only suitable for binary data. Each unit of the Exp-RBM can choose any smooth monotonic non-linearity function as the activation function. Therefore, we propose the exponential family restricted Boltzmann machines with posterior consistency (Exp-PCRBM) for multi-view learning. The proposed Exp-PCRBM is suitable for binary and real-valued data, where the activation function of each hidden unit can choose any smooth monotonic non-linearity function not just the sigmoid function. in this paper, we choose the sigmoid function as the activation function of each hidden unit in the Exp-PCRBM.

Assume that all the hidden units of the Exp-PCRBM are binary, the solution of two objective functions in the Exp-PCRBM is similar to the PCRBM. For each binary visible unit, the conditional probability is strictly the sigmoid function, and then we have \( F\left( {\upsilon_{i} } \right) = \log \left( {1 + \exp (\upsilon_{i} )} \right) \), \( F^{*} \left( {v_{i} } \right) = \left( {1 - v_{i} } \right)\log \left( {1 - v_{i} } \right) + v_{i} \log \left( {v_{i} } \right) = 0 \), and \( g(v_{i} ) \) is a constant. Thus, if each visible unit is binary, then the energy of the Exp-PCRBM is same as the PCRBM. For each visible unit obeying Gaussian conditional distribution, this distribution can be expressed as a Gaussian approximation \( \left( {f(\upsilon_{i} ),f^{\prime}(\upsilon_{i} )} \right) \), where \( f(\upsilon_{i} ) = \sigma_{i}^{2} \upsilon_{i} \) is the mean and \( f^{\prime}(\upsilon_{i} ) = \sigma_{i}^{2} \) is the variance. Then, \( F\left( {\upsilon_{i} } \right) = {{\left( {\sigma_{i}^{2} \upsilon_{i}^{2} } \right)} \mathord{\left/ {\vphantom {{\left( {\sigma_{i}^{2} \upsilon_{i}^{2} } \right)} 2}} \right. \kern-0pt} 2} \), \( F^{*} \left( {v_{i} } \right) = {{v_{i}^{2} } \mathord{\left/ {\vphantom {{v_{i}^{2} } {\left( {2\sigma_{i}^{2} } \right)}}} \right. \kern-0pt} {\left( {2\sigma_{i}^{2} } \right)}} \), and \( g(v_{i} ) \) is a constant. Thus, if each visible unit obeying Gaussian conditional distribution, then the Exp-PCRBM is same as the PCRBM except conditional distributions over visible units. Therefore, in this paper, we choose the activation function of each hidden unit in the Exp-PCRBM according to the input data from each view.

4 Experiments

In order to test the performance of the algorithms, the proposed algorithms are compared with state-of-the-art classification algorithms, the multi-view Gaussian process with posterior consistency (MvGP) and consensus and complementarity based maximum entropy discrimination (MED-2C). All these algorithms are carried out in a work station with a core i7 DMI2-Intel 3.6 GHz processor and 18 GB RAM running MATLAB 2017a.

4.1 Learning Results on Two-Class Data Sets

Advertisement:

The Advertisement is a binary data set, and it contains 3279 examples (459 ads and 2820 non-ads). The first view describes the image itself, while the other view contains all other features [11]. The dimensions of the two views are 587 and 967, respectively.

WDBC:

The WDBC contains 569 examples (357 benign and 212 malignant). The first view contains 10 features which are computed for each cell nucleus, while the other view contains all other 20 features which is the mean and the standard error of the first view.

Z-Alizadeh sani:

The Z-Alizadeh sani contains 303 examples (216 cad and 87 normal). The first view contains the patients’ demographic characteristics and symptoms, while the other view contains the results of physical examinations, electrocardiography, echocardiography, and laboratory tests. The dimensions of the two views are 31 and 24, respectively.

We make use of the 5-fold cross-validation method to evaluate the proposed methods on two-class data sets, where three folds are used for training and the rest two folds for testing. In addition, we also divide the above training set into a training set and a validation set, where each of the folds is used as the validation set once (10-fold cross-validation). In the MvGP, the value of parameters a and b is determined by cross-validation from {0, 0.1, …, 1} and {2−18, 2−12, 2−8, 2, 23, 28}, respectively [10]. In the MED-2C, the value of parameter c is determined by cross-validation from {2−5, 2−4, …, 25} [11]. Therefore, in the Exp-PCRBM, the value of parameters a and \( \lambda \) is determined by cross-validation from {0, 0.1, …, 1} and {2−18, 2−12, 2−8, 2, 23, 28}, respectively. In the Exp-PCRBM, the number of hidden layer units corresponding to each view is set to 100. We also run the Exp-RBM for each view, and Exp-RBM1 and Exp-RBM2 correspond to the first view and the second view, respectively. Moreover, the Exp-RBM1, the Exp-RBM2 and the Exp-PCRBM use mini-batch learning, and 100 samples are randomly selected in every iteration.

The average accuracies and standard deviations of all the algorithms are given in Table 1. We can see that the Exp-PCRBM outperforms the other algorithms on all the data sets. From Table 1, we can also find that: (1) the Exp-PCRBM outperforms the MvGP and the MED-2C on all the data sets, which demonstrates the effectiveness of the Exp-PCRBM; (2) the MvGP performs worst on all the data sets, this is because that the point selection scheme is not used and this scheme can also be used in other algorithms; (3) the Exp-PCRBM outperforms the Exp-RBM1 and the Exp-RBM2 on all the data sets, which demonstrates that the representations from two views are perfectly used for classification in the Exp-PCRBM. We can make conclusion that the Exp-PCRBM is an effective classification method for multi-view two-class data sets.

Table 1. Performance comparison of proposed algorithms on two-class data sets

4.2 Results and Evaluation

The multi-class data sets used in this paper are two UCI data sets including Dermatology and ForestTypes.

Dermatology:

The Dermatology contains 358 examples (111 psoriasis, 60 seboreic dermatitis, 71 lichen planus, 48 pityriasis rosea, 48 cronic dermatitis, and 20 pityriasis rubra pilaris). The first view describes clinical features, while the other view contains histopathological features. The dimensions of the two views are 12 and 22, respectively.

ForestTypes:

The ForestTypes contains 523 examples (195 Sugi, 83 Hinoki, 159 Mixed deciduous, and 86 Other). The first view describes ASTER image bands, while the other view contains all other features. The dimensions of the two views are 9 and 18, respectively.

We make use of the 5-fold cross-validation method to evaluate the proposed methods on multi-class data sets, too. Like one-versus-rest support vector machines (OvR SVMs) [14], we extend the MvGP and the MED-2C to deal with multi-class data, and name they as the one-versus-rest MvGP (OvR MvGP) and one-versus-rest MED-2C (OvR MED-2C). The parameters of the OvR MvGP, the OvR MED-2C, the Exp-RBM1, Exp-RBM2, and the Exp-PCRBM are determined by cross-validation from, too.

Table 2 shows the average accuracies and standard deviations of all the algorithms on the multi-class data sets. We can see that the Exp-PCRBM outperforms the other algorithms on all the data sets. From Table 2, we can also find that: (1) the Exp-PCRBM outperforms the MvGP and the MED-2C on all the data sets, which demonstrates the effectiveness of the Exp-CRBM on multi-class data sets; (2) the Exp-PCRBM also outperforms the Exp-RBM1 and the Exp-RBM2 on all the data sets, which demonstrates that the representations from two views are perfectly used for classification in the Exp-PCRBM. We can make conclusion that the Exp-PCRBM is an effective classification method for multi-view multi-class data sets.

Table 2. Performance comparison of proposed algorithms on multi-class data sets

5 Conclusions

Restricted Boltzmann Machines (RBMs) are effectively probability graph models for representational learning. On this basis, this paper extends RBMs to deal with multi-view learning and names it as RBMs with posterior consistency (PCRBMs). PCRBMs utilize the negative of the distance between two conditional probabilities to measure the posterior consistency between two views and maximize this posterior consistency. Then, this paper proposes correlation RBMs with posterior consistency (Exp-PCRBMs), which are suitable for binary and real-valued data. In addition, activation functions of Exp-PCRBMs can be any smooth monotonic non-linearity function. Finally, experimental results show that Exp-PCRBM is effective multi-view classification method for two-class and multi-class data.