Fig. 1.
figure 1

Example of the multiview distributions hierarchy with 3 views. For all views \(v\in \{1,2,3\}\), we have a set of voters \(\mathcal{H}_v=\{h_1^v,\ldots ,h_{n_v}^v\}\) on which we consider prior \(P_v\) view-specific distribution (in blue), and we consider a hyper-prior \(\pi \) distribution (in green) over the set of 3 views. The objective is to learn a posterior \(Q_v\) (in red) view-specific distributions and a hyper-posterior \(\rho \) distribution (in orange) leading to a good model. The length of a rectangle represents the weight (or probability) assigned to a voter or a view. (Color figure online)

1 Introduction

With the ever-increasing observations produced by more than one source, multiview learning has been expanding over the past decade, spurred by the seminal work of Blum and Mitchell [4] on co-training. Most of the existing methods try to combine multimodal information, either by directly merging the views or by combining models learned from the different viewsFootnote 1 [28], in order to produce a model more reliable for the considered task. Our goal is to propose a theoretically grounded criteria to “correctly” combine the views. With this in mind we propose to study multiview learning through the PAC-Bayesian framework (introduced in [21]) that allows to derive generalization bounds for models that are expressed as a combination over a set of voters. When faced with learning from one view, the PAC-Bayesian theory assumes a prior distribution over the voters involved in the combination, and aims at learning—from the learning sample—a posterior distribution that leads to a well-performing combination expressed as a weighted majority vote. In this paper we extend the PAC-Bayesian theory to multiview with more than two views. Concretely, given a set of view-specific classifiers, we define a hierarchy of posterior and prior distributions over the views, such that (i) for each view v, we consider prior \(P_v\) and posterior \(Q_v\) distributions over each view-specific voters’ set, and (ii) a prior \(\pi \) and a posterior \(\rho \) distribution over the set of views (see Fig. 1), respectively called hyper-prior and hyper-posteriorFootnote 2. In this way, our proposed approach encompasses the one of Amini et al. [1] that considered uniform distribution to combine the view-specific classifiers’ predictions. Moreover, compared to the PAC-Bayesian work of Sun et al. [29], we are interested here to the more general and natural case of multiview learning with more than two views. Note also that Lecué and Rigollet [18] proposed a non-PAC-Bayesian theoretical analysis of a combination of voters (called Q-Aggregation) that is able to take into account a prior and a posterior distribution but in a single-view setting.

Our theoretical study also includes a notion of disagreement between all the voters, allowing to take into account a notion of diversity between them which is known as a key element in multiview learning [1, 6, 13, 20]. Finally, we empirically evaluate a two-level learning approach on the Reuters RCV1/RCV2 corpus to show that our analysis is sound.

In the next section, we recall the general PAC-Bayesian setup, and present PAC-Bayesian expectation bounds—while most of the usual PAC-Bayesian bounds are probabilistic bounds. In Sect. 3, we then discuss the problem of multiview learning, adapting the PAC-Bayesian expectation bounds to the specificity of the two-level multiview approach. In Sect. 4, we discuss the relation between our analysis and previous works. Before concluding in Sect. 6, we present experimental results obtained on a collection of the Reuters RCV1/RCV2 corpus in Sect. 5.

2 The Single-View PAC-Bayesian Theorem

In this section, we state a new general mono-view PAC-Bayesian theorem, inspired by the work of Germain et al. [10], that we extend to multiview learning in Sect. 3.

2.1 Notations and Setting

We consider binary classification tasks on data drawn from a fixed yet unknown distribution \(\mathcal{D}\) over \(\mathcal{X}\times \mathcal{Y}\), where \(\mathcal {X} \subseteq \mathbb {R}^d\) is a d-dimensional input space and \(\mathcal {Y} = \{-1,+1\}\) the label/output set. A learning algorithm is provided with a training sample of m examples denoted by \(S=\{ (x_i,y_i ) \}_{i=1}^{m} \in (\mathcal{X}\times \mathcal{Y})^m\), that is assumed to be independently and identically distributed (i.i.d.) according to \(\mathcal{D}\). The notation \(\mathcal{D}^m\) stands for the distribution of such a m-sample, and \(\mathcal{D}_\mathcal{X}\) for the marginal distribution on \(\mathcal{X}\). We consider a set \(\mathcal{H}\) of classifiers or voters such that \(\forall h\in \mathcal{H},\ h:\mathcal{X}\rightarrow \mathcal{Y}\). In addition, PAC-Bayesian approach requires a prior distribution \(P\) over \(\mathcal{H}\) that models a priori belief on the voters from \(\mathcal{H}\) before the observation of the learning sample S. Given \(S\sim \mathcal{D}^m\), t a posterior distribution \(Q\) over \(\mathcal{H}\) leading to an accurate \(Q\)-weighted majority vote \(B_Q(x)\) defined as

$$\begin{aligned} B_Q(x) = {\text {sign}}\left[ \mathop {\mathbb {E}}\limits _{ h \sim Q} h(x)\right] . \end{aligned}$$

In other words, one wants to learn \(Q\) over \(\mathcal{H}\) such that it minimizes the true risk \(R_{\mathcal{D}}(B_{Q})\) of \(B_Q(x)\):

$$\begin{aligned} R_{\mathcal {D}}(B_{Q}) = \mathop {\mathbb {E}}\limits _{(x,y) \sim \mathcal {D}} \mathbbm {1}_{[B_{Q}(x) \ne y]}, \end{aligned}$$

where \(\mathbbm {1}_{[\pi ]} =1\) if predicate \(\pi \) holds, and 0 otherwise. However, a PAC-Bayesian generalization bound does not directly focus on the risk of the deterministic \(Q\)-weighted majority vote \(B_{Q}\). Instead, it upper-bounds the risk of the stochastic Gibbs classifier \(G_{Q}\), which predicts the label of an example x by drawing h from \(\mathcal{H}\) according to the posterior distribution \(Q\) and predicts h(x). Therefore, the true risk \(R_{D}(G_Q)\) of the Gibbs classifier on a data distribution \(\mathcal{D}\), and its empirical risk \(R_{S}(G_Q)\) estimated on a sample \(S \sim \mathcal{D}^m\) are respectively given by

$$\begin{aligned} R_{\mathcal{D}}(G_Q) \ =\ \mathop {\mathbb {E}}\limits _{(x,y) \sim \mathcal {D}} \mathop {\mathbb {E}}\limits _{h \sim Q} \mathbbm {1}_{[h(x) \ne y]}\,,\,\text {and}\,\, R_{S}(G_Q) \ =\ \frac{1}{m} \sum _{i=1}^m \mathop {\mathbb {E}}\limits _{h \sim Q} \mathbbm {1}_{[h(x_i) \ne y_i]}\,. \end{aligned}$$

The above Gibbs classifier is closely related to the \(Q\)-weighted majority vote \(B_{Q}\). Indeed, if \(B_{Q}\) misclassifies \(x \in \mathcal{X}\), then at least half of the classifiers (under measure \(Q\)) make an error on x. Therefore, we have

$$\begin{aligned} R_{\mathcal {D}}(B_Q) \le 2R_{\mathcal {D}}(G_Q). \end{aligned}$$
(1)

Thus, an upper bound on \(R_\mathcal{D}(G_Q)\) gives rise to an upper bound on \(R_\mathcal{D}(B_Q)\). Other tighter relations exist [10, 14, 16], such as the so-called C-Bound [14] that involves the expected disagreement \(d_{\mathcal{D}}(Q)\) between all the pair of voters, and that can be expressed as follows (when \(R_{\mathcal{D}}(G_Q)\le \frac{1}{2}\)):

$$\begin{aligned} R_{\mathcal {D}}(B_{Q}) \le 1 - \frac{\displaystyle \left( 1-2R_{\mathcal{D}}(G_Q)\right) ^2}{\displaystyle 1-2 d_{\mathcal{D}}(Q)}\,,\text {where}\ d_{\mathcal{D}}(Q)= \mathop {\mathbb {E}}\limits _{x \sim \mathcal{D}_\mathcal{X}}\, \mathop {\mathbb {E}}\limits _{(h,h') \sim Q^2} \mathbbm {1}_{[h(x) {\ne } h'(x)]}\,. \end{aligned}$$
(2)

Moreover, Germain et al. [10] have shown that the Gibbs classifier’s risk can be rewritten in terms of \(d_{\mathcal{D}}(Q)\) and expected joint error \(e_{\mathcal{D}}(Q)\) between all the pair of voters as

$$\begin{aligned} R_{\mathcal {D}}(G_Q) \ =\ \frac{1}{2} d_{\mathcal{D}}(Q)+e_{\mathcal{D}}(Q),\\ \nonumber \text {where}\ e_{\mathcal{D}}(Q)\ =\ \mathop {\mathbb {E}}\limits _{(x,y) \sim \mathcal {D}}\ \mathop {\mathbb {E}}\limits _{(h,h') \sim Q^2}\ \mathbbm {1}_{[h(x) {\ne } y]}\, \mathbbm {1}_{[h'(x){\ne } y]}\,. \end{aligned}$$
(3)

It is worth noting that from multiview learning standpoint where the notion of diversity among voters is known to be important [1, 2, 13, 20, 29], Eqs. (2) and (3) directly capture the trade-off between diversity and accuracy. Indeed, \(d_{\mathcal{D}}(Q)\) involves the diversity between voters [23], while \(e_{\mathcal{D}}(Q)\) takes into account the errors. Note that the principle of controlling the trade-off between diversity and accuracy through the C-bound of Eq. (2) has been exploited by Laviolette et al. [17] and Roy et al. [26] to derive well-performing PAC-Bayesian algorithms that aims at minimizing it. For our experiments in Sect. 5, we make use of CqBoost [26]—one of these algorithms—for multiview learning.

Last but not least, PAC-Bayesian generalization bounds take into account the given prior distribution \(P\) on \(\mathcal{H}\) through the Kullback-Leibler divergence between the learned posterior distribution \(Q\) and \(P\):

$$\begin{aligned} {\text {KL}}(Q\Vert P)\ =\ \mathop {\mathbb {E}}\limits _{h \sim Q} \ln \frac{Q(h)}{P(h)}\,. \end{aligned}$$

2.2 A New PAC-Bayesian Theorem as an Expected Risk Bound

In the following we introduce a new variation of the general PAC-Bayesian theorem of Germain et al. [9, 10]; it takes the form of an upper bound on the “deviation” between the true risk \( R_{\mathcal{D}}(G_Q)\) and empirical risk \( R_{S}(G_Q) \) of the Gibbs classifier, according to a convex function \(D {:} [0, 1] \times [0, 1]{\rightarrow }\mathbb {R}\). While most of the PAC-Bayesian bounds are probabilistic bounds, we state here an expected risk bound. More specifically, Theorem 1 below is a tool to upper-bound \(\mathbb E_{{S\sim \mathcal{D}^m}} R_{\mathcal{D}}(G_{Q_S})\)—where \({Q_S}\) is the posterior distribution outputted by a given learning algorithm after observing the learning sample S—while PAC-Bayes usually bounds \(R_{\mathcal{D}}(G_Q)\) uniformly for all distribution \(Q\), but with high probability over the draw of \(S \sim \mathcal{D}^m\). Since by definition posterior distributions are data dependent, this different point of view on PAC-Bayesian analysis has the advantage to involve an expectation over all the possible learning samples (of a given size) in bounds itself.

Theorem 1

For any distribution \(\mathcal {D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of voters \(\mathcal{H}\), for any prior distribution \(P\) on \(\mathcal {H}\), for any convex function \(D: [0, 1] \times [0, 1] \rightarrow \mathbb {R}\), we have

where \(R_\mathcal{D}(h)\) and \(R_S(h)\) are respectively the true and the empirical risks of individual voters.

Similarly to Germain et al. [9, 10], by selecting a well-suited deviation function D and by upper-bounding \(\mathbb E_{S}\, \mathbb E_{h} e^{m\,D(R_S(h),R_\mathcal{D}(h))}\), we can prove the expected bound counterparts of the classical PAC-Bayesian theorems of Catoni [5], McAllester [21], Seeger [27]. The proof presented below borrows the straightforward proof technique of Bégin et al. [3]. Interestingly, this approach highlights that the expectation bounds are obtained simply by replacing the Markov inequality by the Jensen inequality (respectively Theorems 5 and 6, in Appendix).

Proof of Theorem 1

The last three inequalities below are obtained by applying Jensen’s inequality on the convex function D, the change of measure inequality [as stated by [3], Lemma 3], and Jensen’s inequality on the concave function \(\ln \).

$$\begin{aligned} m D \left( \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} R_{S}(G_{Q_S}) , \mathop {\mathbb {E}}\limits _{S\sim \mathcal{D}^m}R_{\mathcal{D}}(G_{Q_S}) \right) \qquad \qquad \qquad \qquad \\ = m D \left( \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \mathop {\mathbb {E}}\limits _{h\sim {Q_S}} R_{S}(h) , \mathop {\mathbb {E}}\limits _{S\sim \mathcal{D}^m} \mathop {\mathbb {E}}\limits _{h\sim {Q_S}}R_{\mathcal{D}}(h) \right) \qquad \qquad \qquad \\ \le \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \mathop {\mathbb {E}}\limits _{h\sim {Q_S}}m D \left( R_{S}(h) , R_{\mathcal{D}}(h) \right) \qquad \qquad \qquad \qquad \qquad \qquad \\ \le \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}}\left[ {\text {KL}}({Q_S}\Vert P) + \ln \bigg (\mathop {\mathbb {E}}\limits _{h \sim P} e^{m\,D\left( R_S(h), R_\mathcal{D}(h) \right) } \bigg ) \right] \qquad \quad \,\,\,\,\\ \le \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} {\text {KL}}({Q_S}\Vert P) + \ln \bigg (\mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \mathop {\mathbb {E}}\limits _{h \sim P} e^{m\,D\left( R_S(h), R_\mathcal{D}(h) \right) } \bigg ).\qquad \end{aligned}$$

   \(\square \)

Since the C-bound of Eq. (2) involves the expected disagreement \(d_{\mathcal{D}}(Q)\), we also derive below the expected bound that upper-bounds the deviation between \(\mathbb {E}_{{S\sim \mathcal{D}^m}} d_S(Q_S)\) and \(\mathbb {E}_{S\sim \mathcal{D}^m}d_\mathcal{D}(Q_S)\) under a convex function D. Theorem 2 can be seen as the expectation version of probabilistic bounds over \(d_S(Q_S)\) proposed by Germain et al. [10], Lacasse et al. [14].

Theorem 2

For any distribution \(\mathcal {D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of voters \(\mathcal {H}\), for any prior distribution \(P\) on \(\mathcal {H}\), for any convex function \(D : [0, 1] \times [0, 1] \rightarrow \mathbb {R}\), we have

where \(d_\mathcal{D}(h,h') = \mathbb {E}_{x \sim \mathcal{D}_\mathcal{X}}\, \mathbbm {1}_{[h(x) {\ne } h'(x)]}\) is the disagreement of voters h and \(h'\) on the distribution \(\mathcal{D}\), and \(d_S(h,h')\) is its empirical counterpart.

Proof

First, we apply the exact same steps as in the proof of Theorem 1:

$$\begin{aligned}&m D \left( \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} d_S(Q_S) , \mathop {\mathbb {E}}\limits _{S\sim \mathcal{D}^m}d_\mathcal{D}(Q_S) \right) \\ =&m D \left( \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}}\ \mathop {\mathbb {E}}\limits _{(h,h')\sim {Q_S^2}}\ d_S(h,h'), \mathop {\mathbb {E}}\limits _{S\sim \mathcal{D}^m}\ \mathop {\mathbb {E}}\limits _{(h,h')\sim {Q_S^2}}\ d_\mathcal{D}(h,h') \right) \\ \quad \vdots \\ \le&\ \ \mathop {\mathbb {E}}\limits _{S\sim \mathcal{D}^m} {\text {KL}}({Q_S^2}\Vert P^2) + \ln \mathop {\mathbb {E}}\limits _{S \sim \mathcal{D}^m}\mathop {\mathbb {E}}\limits _{(h,h') \sim P^2} e^{mD\left( d_S(h,h'), d_\mathcal{D}(h,h') \right) }. \end{aligned}$$

Then, we use the fact that \({\text {KL}}({Q_S^2}\Vert P^2) = 2{\text {KL}}({Q_S}\Vert P) \) [see [10], Theorem 25].    \(\square \)

In the following we provide an extension of this PAC-Bayesian framework to multiview learning with more than two views.

3 Multiview PAC-Bayesian Approach

3.1 Notations and Setting

We consider binary classification problems where the multiview observations \(\mathbf{x}= (x^1,\ldots ,x^V)\) belong to a multiview input set \(\mathcal{X} = \mathcal {X}_1\times \ldots \times \mathcal {X}_V\), where \(V\ge 2\) is the number of views of not-necessarily the same dimension. We denote \(\mathcal{V}\) the set of the V views. In binary classification, we assume that examples are pairs \((\mathbf{x}, y)\), with \(y\in \mathcal{Y} = \{-1,+1\}\), drawn according to an unknown distribution \(\mathcal{D}\) over \(\mathcal{X}\times \mathcal {Y}\). To model the two-level multiview approach, we follow the next setting. For each view \(v \in \mathcal{V}\), we consider a view-specific set \(\mathcal{H}_v\) of voters \(h: \mathcal {X}_v \rightarrow \mathcal{Y}\), and a prior distribution \(P_v\) on \(\mathcal{H}_v\). Given a hyper-prior distribution \(\pi \) over the views \(\mathcal{V}\), and a multiview learning sample \(S = \{(\mathbf{x}_i,y_i)\}_{i=1}^m{\sim }(\mathcal{D})^m\), our PAC-Bayesian learner objective is twofold: (i) finding a posterior distribution \(Q_v\) over \(\mathcal {H}_v\) for all views \(v \in \mathcal{V}\); (ii) finding a hyper-posterior distribution \(\rho \) on the set of views \(\mathcal{V}\). This hierarchy of distributions is illustrated by Fig. 1. The learned distributions express a multiview weighted majority vote \(B_{\rho }^{\textsc {mv}}\) defined as

$$B_{\rho }^{\textsc {mv}}(\mathbf {x}) = {\text {sign}}\left[ \mathop {\mathbb {E}}\limits _{ v \sim \rho }\ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} h(x^v) \right] .$$

Thus, the learner aims at constructing the posterior and hyper-posterior distributions that minimize the true risk \(R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}})\) of the multiview weighted majority vote:

$$ R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}}) = \mathop {\mathbb {E}}\limits _{(\mathbf {x},y) \sim \mathcal {D}} \mathbbm {1}_{[B_{\rho }^{\textsc {mv}}(\mathbf {x}) \ne y]}. $$

As pointed out in Sect. 2, the PAC-Bayesian approach deals with the risk of the stochastic Gibbs classifier \(G_{\rho }^{\textsc {mv}}\) defined as follows in our multiview setting, and that can be rewritten in terms of expected disagreement \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) and expected joint error \(e_{\mathcal{D}}^{\textsc {mv}}(\rho )\):

$$\begin{aligned} R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})&= \mathop {\mathbb {E}}\limits _{(\mathbf{x},y) \sim \mathcal{D}} \ \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \mathop {\mathbb {E}}\limits _{h \sim Q_v} \mathbbm {1}_{[h(x^v) \ne y]} \nonumber \\&= \ \tfrac{1}{2}\, d_{\mathcal{D}}^{\textsc {mv}}(\rho )+ e_{\mathcal{D}}^{\textsc {mv}}(\rho ), \\ \text {where}\, \nonumber d_{\mathcal{D}}^{\textsc {mv}}(\rho )&= \mathop {\mathbb {\mathbb {E}}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}} \mathop {\mathbb {\mathbb {E}}}\limits _ {v \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{v' \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{h \sim Q_v} \mathop {\mathbb {\mathbb {E}}}\limits _{h' \sim Q_{v'}} \mathbbm {1}_{[ h(x^v) {\ne } h'(x^{v'})]}, \\ \text {and}\,\nonumber e_{\mathcal{D}}^{\textsc {mv}}(\rho )&= \mathop {\mathbb {E}}\limits _{(\mathbf{x},y) \sim \mathcal {D}} \mathop {\mathbb {\mathbb {E}}}\limits _ {v \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{v' \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{h \sim Q_v} \mathop {\mathbb {\mathbb {E}}}\limits _{h' \sim Q_{v'}} \mathbbm {1}_{[ h(x^v) {\ne } y ]} \mathbbm {1}_{[ h'(x^{v'}) {\ne } y ]}. \end{aligned}$$
(4)

Obviously, the empirical counterpart of the Gibbs classifier’s risk \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) is

$$\begin{aligned} R_{S}(G_{\rho }^{\textsc {mv}})&= \frac{1}{m} \sum _{i=1}^m \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \mathbbm {1}_{[h(x_i^v) \ne y_i]}\\&= \frac{1}{2}d_{S}^{\textsc {mv}}(\rho )+ e_{S}^{\textsc {mv}}(\rho )\, \end{aligned}$$

where \(d_{S}^{\textsc {mv}}(\rho )\) and \(e_{S}^{\textsc {mv}}(\rho )\) are respectively the empirical estimations of \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) and \(e_{\mathcal{D}}^{\textsc {mv}}(\rho )\) on the learning sample S. As in the single-view PAC-Bayesian setting, the multiview weighted majority vote \(B_{\rho }^{\textsc {mv}}\) is closely related to the stochastic multiview Gibbs classifier \(G_{\rho }^{\textsc {mv}}\), and a generalization bound for \(G_{\rho }^{\textsc {mv}}\) gives rise to a generalization bound for \(B_{\rho }^{\textsc {mv}}\). Indeed, it is easy to show that \(R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}})\le 2R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\), meaning that an upper bound over \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) gives an upper bound for the majority vote. Moreover the C-Bound of Eq. (2) can be extended to our multiview setting by Lemma 1 below. Equation (5) is a straightforward generalization of the single-view C-bound of Eq. (2). Afterward, Eq. (6) is obtained by rewriting \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) as the \(\rho \)-average of the risk associated to each view, and lower-bounding \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) by the \(\rho \)-average of the disagreement associated to each view.

Lemma 1

Let \(V \ge 2\) be the number of views. For all posterior \(\{Q_v\}_{v=1}^V\) and hyper-posterior \(\rho \) distribution, if , then we have

$$\begin{aligned} R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}})\le & {} 1- \frac{\displaystyle \big (1-2R_{\mathcal{D}}(G_{\rho }^{\textsc {mv}})\big )^2}{\displaystyle 1-2 d_{\mathcal{D}}^{\textsc {mv}}(\rho )} \end{aligned}$$
(5)
$$\begin{aligned}\le & {} 1- \frac{ \Big (1-2 \mathbb {E}_ {v \sim \rho } R_{\mathcal{D}}(G_{Q_v})\Big )^2}{ 1-2 \mathbb {E}_ {v \sim \rho } d_{\mathcal{D}}(Q_v) } \,. \end{aligned}$$
(6)

Proof

Equation (5) follows from the Cantelli-Chebyshev’s inequality (Theorem 7, in Appendix). To prove Eq. (6), we first notice that in the binary setting where \(y\in \{-1,1\}\) and \(h:\mathcal{X}\rightarrow \{-1,1\}\), we have \(\mathbbm {1}_{[h(x^v) \ne y]} = \frac{1}{2} (1-y\,h(x^v))\), and

$$\begin{aligned} R_{\mathcal {D}}(G_{\rho }^\textsc {mv})&=\mathop {\mathbb {E}}\limits _{(\mathbf {x},y) \sim \mathcal {D}} \ \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \mathop {\mathbb {E}}\limits _{h \sim Q_v} \mathbbm {1}_{[h(x^v) \ne y]} \nonumber \\&= \frac{1}{2}\bigg (1- \mathop {\mathbb {E}}\limits _{(\mathbf{x},y) \sim \mathcal {D}} \ \mathop {\mathbb {E}}\limits _{v \sim \rho } \ \mathop {\mathbb {E}}\limits _{h \sim Q_v} y\,h(x^v) \bigg ) \nonumber \\&=\mathop {\mathbb {E}}\limits _{v \sim \rho } R_{\mathcal {D}}(G_{Q^v})\,.\nonumber \end{aligned}$$

Moreover, we have

$$\begin{aligned} d_{\mathcal{D}}^{\textsc {mv}}(\rho )&= \mathop {\mathbb {\mathbb {E}}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}} \mathop {\mathbb {\mathbb {E}}}\limits _ {v \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{v' \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{h \sim Q_v} \mathop {\mathbb {\mathbb {E}}}\limits _{h' \sim Q_{v'}} \mathbbm {1}_{[ h(x^v) {\ne } h'(x^{v'})]} \\&=\frac{1}{2} \bigg ( 1- \mathop {\mathbb {\mathbb {E}}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}} \mathop {\mathbb {\mathbb {E}}}\limits _ {v \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{v' \sim \rho } \mathop {\mathbb {\mathbb {E}}}\limits _{h \sim Q_v} \mathop {\mathbb {\mathbb {E}}}\limits _{h \sim Q_{v'}} h(x^v) \times h'(x^{v'}) \bigg ) \nonumber \\&= \frac{1}{2} \bigg ( 1-\mathop {\mathbb {E}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}}\bigg [ \mathop {\mathbb {E}}\limits _{v \sim \rho } \, \mathop {\mathbb {E}}\limits _{h \sim Q_v} h(x^v) \bigg ]^2 \bigg ) \,. \end{aligned}$$

From Jensen’s inequality (Theorem 6, in Appendix) it comes

$$\begin{aligned} d_{\mathcal{D}}^{\textsc {mv}}(\rho )&\ge \frac{1}{2} \bigg ( 1- \mathop {\mathbb {E}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}} \, \mathop {\mathbb {E}}\limits _{v \sim \rho }\bigg [ \mathop {\mathbb {E}}\limits _{h \sim Q_v} h(x^v) \bigg ]^2 \bigg ) \\&= \mathop {\mathbb {E}}\limits _{v \sim \rho } \Bigg [\frac{1}{2} \bigg ( 1- \mathop {\mathbb {E}}\limits _{\mathbf{x}\sim \mathcal {D}_{\mathcal {X}}}\bigg [ \mathop {\mathbb {E}}\limits _{h \sim Q_v} h(x^v) \bigg ]^2 \bigg ) \Bigg ]\\&= \mathop {\mathbb {E}}\limits _{v \sim \rho } d_{\mathcal{D}}(Q_v) \,. \end{aligned}$$

By replacing \(R_{\mathcal {D}}(G_{\rho }^\textsc {mv})\) and \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) in Eq. (5), we obtain

   \(\square \)

Similarly than for the mono-view setting, Eqs. (4) and (5) suggest that a good trade-off between the risk of the Gibbs classifier and the disagreement \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) between pairs of voters will lead to a well-performing majority vote. Equation (6) exhibits the role of diversity among the views thanks to the disagreement’s expectation over the views \(\mathbb {E}_ {v \sim \rho } d_{\mathcal{D}}(Q_v)\).

3.2 General Multiview PAC-Bayesian Theorems

Now we state our general PAC-Bayesian theorem suitable for the above multiview learning setting with a two-level hierarchy of distributions over views (or voters). A key step in PAC-Bayesian proofs is the use of a change of measure inequality [22], based on the Donsker-Varadhan inequality [8]. Lemma 2 below extends this tool to our multiview setting.

Lemma 2

For any set of priors \(\{P_v\}_{v=1}^V\) and any set of posteriors \(\{Q_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) on views \(\mathcal{V}\) and hyper-posterior distribution \(\rho \) on \(\mathcal{V}\), and for any measurable function \(\phi : \mathcal {H}_v \rightarrow \mathbb {R}\), we have

$$\begin{aligned}&\mathop {\mathbb {E}}\limits _{ v \sim \rho }\ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \phi (h) \le \ \mathop {\mathbb {E}}\limits _{ v \sim \rho } {\text {KL}}(Q_v \Vert P_v) + {\text {KL}}(\rho \Vert \pi ) + \ln \left( \mathop {\mathbb {E}}\limits _{ v \sim \pi }\ \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{\phi (h)} \right) . \end{aligned}$$

Proof

We have

$$\begin{aligned} \mathop {\mathbb {E}}\limits _{ v \sim \rho } \, \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \phi (h)&= \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \ln e^{\phi (h)} \\&=\mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \ln \bigg ( \frac{Q_v(h)}{P_v(h)} \frac{P_v(h)}{Q_v(h)} e^{\phi (h)} \bigg ) \\&=\mathop {\mathbb {E}}\limits _{ v \sim \rho }\ \bigg [ \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \ln \bigg ( \frac{Q_v(h)}{P_v(h)} \bigg ) + \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \ln \bigg ( \frac{P_v(h)}{Q_v(h)} e^{\phi (h)} \bigg ) \bigg ]. \end{aligned}$$

According to the Kullback-Leibler definition, we have

$$\begin{aligned} \mathop {\mathbb {E}}\limits _{ v \sim \rho } \, \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \phi (h) = \mathop {\mathbb {E}}\limits _{ v \sim \rho } \bigg [ {\text {KL}}(Q_v \Vert P_v) + \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \ln \bigg ( \frac{P_v(h)}{Q_v(h)} e^{\phi (h)} \bigg ) \bigg ]. \end{aligned}$$

By applying Jensen’s inequality (Theorem 6, in Appendix) on the concave function \(\ln \), we have

$$\begin{aligned} \mathop {\mathbb {E}}\limits _{ v \sim \rho } \, \mathop {\mathbb {E}}\limits _{ h \sim Q_v} \phi (h)&\le \ \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ \bigg [ {\text {KL}}(Q_v \Vert P_v) + \ln \bigg ( \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{\phi (h)} \bigg ) \bigg ] \\&=\mathop {\mathbb {E}}\limits _{ v \sim \rho } {\text {KL}}(Q_v \Vert P_v) + \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ln \bigg ( \frac{\rho (v)}{\pi (v)} \frac{\pi (v)}{\rho (v)} \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{\phi (h)} \bigg ) \\&= \mathop {\mathbb {E}}\limits _{ v \sim \rho } {\text {KL}}(Q_v \Vert P_v) + {\text {KL}}(\rho \Vert \pi ) + \mathop {\mathbb {E}}\limits _{ v \sim \rho } \ln \bigg ( \frac{\pi (v)}{\rho (v)} \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{\phi (h)} \bigg ). \end{aligned}$$

Finally, we apply again the Jensen inequality (Theorem 6) on \(\ln \) to obtain the lemma.    \(\square \)

Based on Lemma 2, the following theorem can be seen as a generalization of Theorem 1 to multiview. Note that we still rely on a general convex function \(D: [0,1] \times [0,1] \rightarrow \mathbb {R}\), that measures the “deviation” between the empirical disagreement/joint error and the true risk of the Gibbs classifier.

Theorem 3

Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), for any convex function \(D : [0,1] \times [0,1] \rightarrow \mathbb {R}\), we have

Proof

We follow the same steps as in Theorem 1 proof.

$$\begin{aligned} \begin{array}{l} \quad m D\Big (\mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m} R_S(G_{\rho _S}^{\textsc {mv}}), \mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m} R_{\mathcal {D}}(G_{\rho _S}^{\textsc {mv}})\Big )\\ = m D\Big (\mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m} \mathop {\mathbb {E}}\limits _{ v \sim \rho _S} \mathop {\mathbb {E}}\limits _{ h \sim {Q_{v,S}}} R_S(h), \mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m} \mathop {\mathbb {E}}\limits _{ v \sim \rho _S} \mathop {\mathbb {E}}\limits _{ h \sim {Q_{v,S}}} R_{\mathcal {D}}(h)\Big )\\ \le \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \mathop {\mathbb {E}}\limits _{ v \sim \rho _S} \mathop {\mathbb {E}}\limits _{ h \sim {Q_{v,S}}}m D \left( R_{S}(h), R_{\mathcal{D}}(h) \right) \\ \le \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \bigg [ \mathop {\mathbb {E}}\limits _{ v \sim \rho _S} \, {\text {KL}}({Q_{v,S}}\Vert P_v) + {\text {KL}}(\rho _S\Vert \pi ) + \ln \left( \mathop {\mathbb {E}}\limits _{ v \sim \pi } \, \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m D\left( R_S(h), R_\mathcal{D}(h)\right) } \right) \bigg ], \end{array} \end{aligned}$$

where the last inequality is obtained using Lemma 2. After distributing the expectation of \(S\sim \mathcal{D}^m\), the final statement follows from Jensen’s inequality (Theorem 6)

$$\begin{aligned} \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}}&\ln \left( \mathop {\mathbb {E}}\limits _{ v \sim \pi } \, \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m D\left( R_S(h), R_\mathcal{D}(h)\right) } \right)&\le \ln \left( \mathop {\mathbb {E}}\limits _{{S\sim \mathcal{D}^m}} \,\mathop {\mathbb {E}}\limits _{ v \sim \pi } \, \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m D\left( R_S(h), R_\mathcal{D}(h)\right) } \right) , \end{aligned}$$

and from Eq. (3): \(R_S(G_{\rho _S}^{\textsc {mv}}) = \tfrac{1}{2} d_{S}^{\textsc {mv}}(\rho _S)+ e_{S}^{\textsc {mv}}(\rho _S)\).    \(\square \)

It is interesting to compare this generalization bound to Theorem 1. The main difference relies on the introduction of view-specific prior and posterior distributions, which mainly leads to an additional term \(\mathbf {E}_{v \sim \rho } {\text {KL}}(Q_v \Vert P_v)\), expressed as the expectation of the view-specific Kullback-Leibler divergence term over the views \(\mathcal{V}\) according to the hyper-posterior distribution \(\rho \). We also introduce the empirical disagreement allowing us to directly highlight the presence of the diversity between voters and between views. As Theorems 1 and 3 provides a tool to derive PAC-Bayesian generalization bounds for a multiview supervised learning setting. Indeed, by making use of the same trick as Germain et al. [9, 10], the generalization bounds can be derived from Theorem 3 by choosing a suitable convex function D and upper-bounding \(\mathbb {E}_S \mathbb {E}_v \mathbb {E}_h e^{ m\, D(R_S(h), R_\mathcal{D}(h))} \). We provide an example of such a specialization in Sect. 3.3, by following McAllester’s [21] point of view. Note that we provide the specialization to the two other classical PAC-Bayesian approaches of Catoni [5], Langford [15], Seeger [27] in our research report Goyal et al. [11, Sect. 3.3.].

Following the same approach, we can obtain a mutiview bound for the expected disagreement.

Theorem 4

Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), for any convex function \(D : [0,1] \times [0,1] \rightarrow \mathbb {R}\), we have

Proof

The result is obtained straightforwardly by following the proof steps of Theorem 3, using the disagreement instead of the Gibbs risk. Then, similarly at what we have done to obtain Theorem 2, we substitute \({\text {KL}}({Q_{v,S}^2}\Vert P^2_v)\) by \(2{\text {KL}}({Q_{v,S}}\Vert P_v)\), and \({\text {KL}}(\rho _S^2 \Vert \pi ^2)\) by \(2{\text {KL}}(\rho _S\Vert \pi )\).    \(\square \)

3.3 Specialization of Our Theorem to the McAllester’s Approach

We derive here the specialization of our multiview PAC-Bayesian theorem to the McAllester [22]’s point of view. To do so, we follow the same principle as Germain et al. [9, 10] to obtain Corollary 1.

Corollary 1

Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_{v}\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), we have

Proof

To prove the above result, we apply Theorem 3 with \(D(a,b)\, =\, 2(a-b)^2\), and we upper-bound \(\mathop {\mathbb {E}}\limits _{S \sim \mathcal{D}^m} \mathop {\mathbb {E}}\limits _{ v \sim \pi }\ \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m\, D(R_S(h), R_\mathcal{D}(h))}\). According to Pinsker’s inequality, we have \(D(a,b) \le {\text {kl}}(a,b) =\ a\ln \frac{a}{b}+(1-a)\ln \frac{1-a}{1-b}\). Then by considering \(R_S(h)\) as a random variable which follows a binomial distribution of m trials with a probability of success R(h), we obtain

$$\begin{aligned}&\mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m}\, \mathop {\mathbb {E}}\limits _{ v \sim \pi }\, \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m\, D(R_S(h) , R_\mathcal{D}(h))}\ \le \ \mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m}\, \mathop {\mathbb {E}}\limits _{ v \sim \pi }\, \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m\, {\text {kl}}(R_S(h) , R_\mathcal{D}(h))} \\ =&\mathop {\mathbb {E}}\limits _{ v \sim \pi } \, \mathop {\mathbb {E}}\limits _{ h \sim P_v} \, \mathop {\mathbb {E}}\limits _{ S \sim \mathcal{D}^m} \left[ \frac{R_S(h)}{R_\mathcal{D}(h)} \right] ^{mR_S(h)} \left[ \frac{1 - R_S(h)}{1 - R_\mathcal{D}(h)} \right] ^{m(1 - R_S(h))}\\ =&\mathop {\mathbb {E}}\limits _{ v \sim \pi }\, \mathop {\mathbb {E}}\limits _{ h \sim P_v} \sum _{k=0}^{m} \underset{S \sim \mathcal{D}^m}{\Pr } \big [ \,R_S(h) = \tfrac{k}{m} \big ] \left[ \frac{k/m}{R_\mathcal{D}(h)} \right] ^{ k} \left[ \frac{1 - k/m}{1 - R_\mathcal{D}(h)} \right] ^{ m - k}\\ =&\sum _{k=0}^m \left( {\begin{array}{c}m\\ k\end{array}}\right) \left[ \frac{k}{m} \right] ^k \left[ 1- \frac{k}{m} \right] ^{m-k}\\ \le \;&\;2\sqrt{m}. \end{aligned}$$

   \(\square \)

4 Discussion on Related Work

In this section, we discuss two related theoretical studies of multiview learning related to the notion of Gibbs classifier.

Amini et al. [1] proposed a Rademacher analysis of the risk of the stochastic Gibbs classifier over the view-specific models (for more than two views) where the distribution over the views is restricted to the uniform distribution. In their work, each view-specific model is found by minimizing the empirical risk: \(\displaystyle h_v^* \ = \mathop {\text{ argmin }}\limits _{h \in \mathcal {H}_v} \frac{1}{m}\sum _{(\mathbf {x},y) \in S} \mathbbm {1}_{[h(x^v) \ne y]}.\) The prediction for a multiview example \(\mathbf {x}\) is then based over the stochastic Gibbs classifier defined according to the uniform distribution, i.e., \(\forall v\in V,\ \rho (v)=\frac{1}{V}\). The risk of the multiview classifier Gibbs is hence given by

$$\begin{aligned} R_{\mathcal {D}}(G_{\rho ={1/V}}^{\textsc {mv}}) = \mathop {\mathbb {E}}\limits _{(\mathbf {x},y) \sim \mathcal {D}} \ \frac{1}{V} \sum _{v=1}^V \mathbbm {1}_{[h_v^*(x^v) \ne y]}. \end{aligned}$$

Moreover, Sun et al. [29] proposed a PAC-Bayesian analysis for multiview learning over the concatenation of the views, where the number of views is set to two, and deduced a SVM-like learning algorithm from this framework. The key idea of their approach is to define a prior distribution that promotes similar classification among the two views, and the notion of diversity among the views is handled by a different strategy than ours. We believe that the two approaches are complementary, as in the general case of more than two views that we consider in our work, we can also use a similar informative prior as the one proposed by Sun et al. [29] for learning.

Table 1. Accuracy and F1-score averages for all the classes over 20 random sets. Note that the results are obtained for different sizes m of the learning sample and are averaged over the six one-vs-all classification problems. Along the columns, best results are in bold. \(^{\downarrow }\) indicates statistically significantly worse performance than the best result, according to Wilcoxon rank sum test (\(p < 0.02\)) [19].

5 Experiments

In this section, we present experiments to highlight the usefulness of our theoretical analysis by following a two-level hierarchy strategy. To do so, we learn a multiview model in two stages by following a classifier late fusion approach [28] (sometimes referred as stacking [30]). Concretely, we first learn view-specific classifiers for each view at the base level of the hierarchy. Each view-specific classifier is expressed as a majority vote of kernel functions. Then, we learn weighted combination based on predictions of view-specific classifiers. It is worth noting that this is the procedure followed by Morvant et al. [23] in a PAC-Bayesian fashion, but without any theoretical justifications and in a ranking setting.

We consider a publicly available multilingual multiview text categorization corpus extracted from the Reuters RCV1/RCV2 corpus [1]Footnote 3, which contains more than 110, 000 documents from five different languages (English, German, French, Italian, Spanish) distributed over six classes. To transform the dataset into a binary classification task, we consider six one-versus-all classification problems: For each class, we learn a multiview binary classification model by considering all documents from that class as positive examples and all others as negative examples. We then split the dataset into training and testing sets: we reserve a test sample containing \(30 \%\) of total documents. In order to highlight the benefits of the information brought by multiple views, we train the models with small learning sets by randomly choosing the learning sample S from the remaining set of the documents; the number of learning examples m considered are: 150, 200, 250 and 300. For each fusion-based approach, we split the learning sample S into two parts: \(S_1\) for learning the view-specific classifier at the first level and \(S_2\) for learning the final multiview model at the second level; such that \(|S_1|=\frac{3}{5} m\) and \(|S_2|=\frac{2}{5} m\) (with \(m=|S|\)). In addition, the reported results are averaged on 20 runs of experiments, each run being done with a new random learning sample. Since the classes are highly unbalanced, we report in Table 1 the accuracy along with the F1-measure, which is the harmonic average of precision and recall, computed on the test sample.

To assess that multiview learning with late fusion makes sense for our task, we consider as baselines the four following one-step learning algorithms (provided with the learning sample S). First, we learn a view-specific model on each view and report, as \(\texttt {Mono}_v\), their average performance. We also follow an early fusion procedure, referred as \(\texttt {Concat}_{\texttt {SVM}}\), consisting of learning one single model using SVM [7] over the simple concatenation of the features of five views. Moreover, we look at two simple voters’ combinations, respectively denoted by \(\texttt {Aggreg}_{\texttt {P}}\) and \(\texttt {Aggreg}_{\texttt {L}}\), for which the weights associated with each view follow the uniform distribution. Concretely, \(\texttt {Aggreg}_{\texttt {P}}\), respectively \(\texttt {Aggreg}_{\texttt {L}}\), combines the real-valued prediction, respectively the labels, returned by the view-specific classifiers. In other words, we have

$$\begin{aligned} \texttt {Aggreg}_{\texttt {P}}(\mathbf{x})\ =\ \textstyle \frac{1}{5} \sum _{v=1}^5 h^v(x^v), \text {and } \texttt {Aggreg}_{\texttt {L}}(\mathbf{x})\ =\ \textstyle \frac{1}{5} \sum _{v=1}^5 {\text {sign}}\left[ h^v(x^v)\right] , \end{aligned}$$

with \(h^v(x^v)\) the real-valued prediction of the view-specific classifier learned on view v.

We compare the above one-step methods to the two following late fusion approaches that only differ at the second level. Concretely, at the first level we construct from \(S_1\) different view-specific majority vote expressed as linear SVM modelsFootnote 4 with different hyperparameter C values (12 values between \(10^{-8}\) and \(10^{3}\)): We do not perform cross-validation at the first level. This has the advantage to (i) lighten the first level learning process, since we do not need to validate models, and (ii) to potentially increase the expressivity of the final model.

At the second level, as it is often done for late fusion, we learn from \(S_2\) the final weighted combination over the view specific voters using a RBF kernel. The methods referred as \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\), respectively \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\), make use of SVM, respectively the PAC-Bayesian algorithm CqBoost [26]. Note that, as recalled in Sect. 2, CqBoost is an algorithm that tends to minimize the C-Bound of Eq. (2): it directly captures a trade-off between accuracy and disagreement.

We follow a 5-fold cross-validation procedure for selecting the hyperparameters of each learning algorithm. For \(\texttt {Mono}_v\), \(\texttt {Concat}_{\texttt {SVM}}\), \(\texttt {Aggreg}_{\texttt {P}}\) and \(\texttt {Aggreg}_{\texttt {L}}\) the hyperparameter C is chosen over a set of 12 values between \(10^{-8}\) and \(10^{3}\). For \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) and \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) the hyperparameter \(\gamma \) of the RBF kernel is chosen over 9 values between \(10^{-6}\) and \(10^{2}\). For \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\), the hyperparameter C is chosen over a set of 12 values between \(10^{-8}\) and \(10^{3}\). For \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\), the hyperparameter \(\mu \) is chosen over a set of 8 values between \(10^{-8}\) and \(10^{-1}\). Note that we made use of the scikit-learn [24] implementation for learning our SVM models.

First of all, from Table 1, the two-step approaches provide the best results on average. Secondly, according to a Wilcoxon rank sum test [19] with \(p < 0.02\), the PAC-Bayesian late fusion based approach \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) is significantly the best method—in terms of accuracy, and except for the smallest learning sample size (\(m=150\)), \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) and \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) produce models with similar F1-measure. We can also remark that \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) is more “stable” than \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) according to the standard deviation values. These results confirm the potential of using PAC-Bayesian approaches for multiview learning where we can control a trade-off between accuracy and diversity among voters.

6 Conclusion and Future Work

In this paper, we proposed a first PAC-Bayesian analysis of weighted majority vote classifiers for multiview learning when observations are described by more than two views. Our analysis is based on a hierarchy of distributions, i.e. weights, over the views and voters: (i) for each view v a posterior and prior distributions over the view-specific voter’s set, and (ii) a hyper-posterior and hyper-prior distribution over the set of views. We derived a general PAC-Bayesian theorem tailored for this setting, that can be specialized to any convex function to compare the empirical and true risks of the stochastic Gibbs classifier associated with the weighted majority vote. We also presented a similar theorem for the expected disagreement, a notion that turns out to be crucial in multiview learning. Moreover, while usual PAC-Bayesian analyses are expressed as probabilistic bounds over the random choice of the learning sample, we presented here bounds in expectation over the data, which is very interesting from a PAC-Bayesian standpoint where the posterior distribution is data dependent. According to the distributions’ hierarchy, we evaluated a simple two-step learning algorithm (based on late fusion) on a multiview benchmark. We compared the accuracies while using SVM and the PAC-Bayesian algorithm CqBoost for weighting the view-specific classifiers. The latter revealed itself as a better strategy, as it deals nicely with accuracy and the disagreement trade-off promoted by our PAC-Bayesian analysis of the multiview hierarchical approach.

We believe that our theoretical and empirical results are a first step toward the goal of theoretically understanding the multiview learning issue through the PAC-Bayesian point of view, and toward the objective of deriving new multiview learning algorithms. It gives rise to exciting perspectives. Among them, we would like to specialize our result to linear classifiers for which PAC-Bayesian approaches are known to lead to tight bounds and efficient learning algorithms [9]. This clearly opens the door to derive theoretically founded algorithms for multiview learning. Another possible algorithmic direction is to take into account a second statistical moment information to link it explicitly to important properties between views, such as diversity or agreement [1, 13]. A first direction is to deal with our multiview PAC-Bayesian C-Bound of Lemma 1—that already takes into account such a notion of diversity [23]—in order to derive an algorithm as done in a mono-view setting by Laviolette et al. [17, 26]. Another perspective is to extend our bounds to diversity-dependent priors, similarly to the approach used by Sun et al. [29], but for more than two views. This would allow to additionally consider an a priori knowledge on the diversity. Moreover, we would like to explore the semi-supervised multiview learning where one has access to unlabeled data \(S_u=\{\mathbf{x}_j\}_{j=1}^{m_u}\) along with labeled data \(S_l=\{(\mathbf{x}_i,y_i)\}_{i=1}^{m_l}\) during training. Indeed, an interesting behaviour of our theorem is that it can be easily extended to this situation: the bound will be a concatenation of a bound over \(\tfrac{1}{2} d_{S_u}^{\textsc {mv}}(\rho )\) (depending on \(m_u\)) and a bound over \(e_{S_l}^{\textsc {mv}}(\rho )\) (depending on \(m_s\)). The main difference with the supervised bound is that the Kullback-Leibler divergence will be multiplied by a factor 2.