Abstract
We study a two-level multiview learning with more than two views under the PAC-Bayesian framework. This approach, sometimes referred as late fusion, consists in learning sequentially multiple view-specific classifiers at the first level, and then combining these view-specific classifiers at the second level. Our main theoretical result is a generalization bound on the risk of the majority vote which exhibits a term of diversity in the predictions of the view-specific classifiers. From this result it comes out that controlling the trade-off between diversity and accuracy is a key element for multiview learning, which complements other results in multiview learning. Finally, we experiment our principle on multiview datasets extracted from the Reuters RCV1/RCV2 collection.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
With the ever-increasing observations produced by more than one source, multiview learning has been expanding over the past decade, spurred by the seminal work of Blum and Mitchell [4] on co-training. Most of the existing methods try to combine multimodal information, either by directly merging the views or by combining models learned from the different viewsFootnote 1 [28], in order to produce a model more reliable for the considered task. Our goal is to propose a theoretically grounded criteria to “correctly” combine the views. With this in mind we propose to study multiview learning through the PAC-Bayesian framework (introduced in [21]) that allows to derive generalization bounds for models that are expressed as a combination over a set of voters. When faced with learning from one view, the PAC-Bayesian theory assumes a prior distribution over the voters involved in the combination, and aims at learning—from the learning sample—a posterior distribution that leads to a well-performing combination expressed as a weighted majority vote. In this paper we extend the PAC-Bayesian theory to multiview with more than two views. Concretely, given a set of view-specific classifiers, we define a hierarchy of posterior and prior distributions over the views, such that (i) for each view v, we consider prior \(P_v\) and posterior \(Q_v\) distributions over each view-specific voters’ set, and (ii) a prior \(\pi \) and a posterior \(\rho \) distribution over the set of views (see Fig. 1), respectively called hyper-prior and hyper-posteriorFootnote 2. In this way, our proposed approach encompasses the one of Amini et al. [1] that considered uniform distribution to combine the view-specific classifiers’ predictions. Moreover, compared to the PAC-Bayesian work of Sun et al. [29], we are interested here to the more general and natural case of multiview learning with more than two views. Note also that Lecué and Rigollet [18] proposed a non-PAC-Bayesian theoretical analysis of a combination of voters (called Q-Aggregation) that is able to take into account a prior and a posterior distribution but in a single-view setting.
Our theoretical study also includes a notion of disagreement between all the voters, allowing to take into account a notion of diversity between them which is known as a key element in multiview learning [1, 6, 13, 20]. Finally, we empirically evaluate a two-level learning approach on the Reuters RCV1/RCV2 corpus to show that our analysis is sound.
In the next section, we recall the general PAC-Bayesian setup, and present PAC-Bayesian expectation bounds—while most of the usual PAC-Bayesian bounds are probabilistic bounds. In Sect. 3, we then discuss the problem of multiview learning, adapting the PAC-Bayesian expectation bounds to the specificity of the two-level multiview approach. In Sect. 4, we discuss the relation between our analysis and previous works. Before concluding in Sect. 6, we present experimental results obtained on a collection of the Reuters RCV1/RCV2 corpus in Sect. 5.
2 The Single-View PAC-Bayesian Theorem
In this section, we state a new general mono-view PAC-Bayesian theorem, inspired by the work of Germain et al. [10], that we extend to multiview learning in Sect. 3.
2.1 Notations and Setting
We consider binary classification tasks on data drawn from a fixed yet unknown distribution \(\mathcal{D}\) over \(\mathcal{X}\times \mathcal{Y}\), where \(\mathcal {X} \subseteq \mathbb {R}^d\) is a d-dimensional input space and \(\mathcal {Y} = \{-1,+1\}\) the label/output set. A learning algorithm is provided with a training sample of m examples denoted by \(S=\{ (x_i,y_i ) \}_{i=1}^{m} \in (\mathcal{X}\times \mathcal{Y})^m\), that is assumed to be independently and identically distributed (i.i.d.) according to \(\mathcal{D}\). The notation \(\mathcal{D}^m\) stands for the distribution of such a m-sample, and \(\mathcal{D}_\mathcal{X}\) for the marginal distribution on \(\mathcal{X}\). We consider a set \(\mathcal{H}\) of classifiers or voters such that \(\forall h\in \mathcal{H},\ h:\mathcal{X}\rightarrow \mathcal{Y}\). In addition, PAC-Bayesian approach requires a prior distribution \(P\) over \(\mathcal{H}\) that models a priori belief on the voters from \(\mathcal{H}\) before the observation of the learning sample S. Given \(S\sim \mathcal{D}^m\), t a posterior distribution \(Q\) over \(\mathcal{H}\) leading to an accurate \(Q\)-weighted majority vote \(B_Q(x)\) defined as
In other words, one wants to learn \(Q\) over \(\mathcal{H}\) such that it minimizes the true risk \(R_{\mathcal{D}}(B_{Q})\) of \(B_Q(x)\):
where \(\mathbbm {1}_{[\pi ]} =1\) if predicate \(\pi \) holds, and 0 otherwise. However, a PAC-Bayesian generalization bound does not directly focus on the risk of the deterministic \(Q\)-weighted majority vote \(B_{Q}\). Instead, it upper-bounds the risk of the stochastic Gibbs classifier \(G_{Q}\), which predicts the label of an example x by drawing h from \(\mathcal{H}\) according to the posterior distribution \(Q\) and predicts h(x). Therefore, the true risk \(R_{D}(G_Q)\) of the Gibbs classifier on a data distribution \(\mathcal{D}\), and its empirical risk \(R_{S}(G_Q)\) estimated on a sample \(S \sim \mathcal{D}^m\) are respectively given by
The above Gibbs classifier is closely related to the \(Q\)-weighted majority vote \(B_{Q}\). Indeed, if \(B_{Q}\) misclassifies \(x \in \mathcal{X}\), then at least half of the classifiers (under measure \(Q\)) make an error on x. Therefore, we have
Thus, an upper bound on \(R_\mathcal{D}(G_Q)\) gives rise to an upper bound on \(R_\mathcal{D}(B_Q)\). Other tighter relations exist [10, 14, 16], such as the so-called C-Bound [14] that involves the expected disagreement \(d_{\mathcal{D}}(Q)\) between all the pair of voters, and that can be expressed as follows (when \(R_{\mathcal{D}}(G_Q)\le \frac{1}{2}\)):
Moreover, Germain et al. [10] have shown that the Gibbs classifier’s risk can be rewritten in terms of \(d_{\mathcal{D}}(Q)\) and expected joint error \(e_{\mathcal{D}}(Q)\) between all the pair of voters as
It is worth noting that from multiview learning standpoint where the notion of diversity among voters is known to be important [1, 2, 13, 20, 29], Eqs. (2) and (3) directly capture the trade-off between diversity and accuracy. Indeed, \(d_{\mathcal{D}}(Q)\) involves the diversity between voters [23], while \(e_{\mathcal{D}}(Q)\) takes into account the errors. Note that the principle of controlling the trade-off between diversity and accuracy through the C-bound of Eq. (2) has been exploited by Laviolette et al. [17] and Roy et al. [26] to derive well-performing PAC-Bayesian algorithms that aims at minimizing it. For our experiments in Sect. 5, we make use of CqBoost [26]—one of these algorithms—for multiview learning.
Last but not least, PAC-Bayesian generalization bounds take into account the given prior distribution \(P\) on \(\mathcal{H}\) through the Kullback-Leibler divergence between the learned posterior distribution \(Q\) and \(P\):
2.2 A New PAC-Bayesian Theorem as an Expected Risk Bound
In the following we introduce a new variation of the general PAC-Bayesian theorem of Germain et al. [9, 10]; it takes the form of an upper bound on the “deviation” between the true risk \( R_{\mathcal{D}}(G_Q)\) and empirical risk \( R_{S}(G_Q) \) of the Gibbs classifier, according to a convex function \(D {:} [0, 1] \times [0, 1]{\rightarrow }\mathbb {R}\). While most of the PAC-Bayesian bounds are probabilistic bounds, we state here an expected risk bound. More specifically, Theorem 1 below is a tool to upper-bound \(\mathbb E_{{S\sim \mathcal{D}^m}} R_{\mathcal{D}}(G_{Q_S})\)—where \({Q_S}\) is the posterior distribution outputted by a given learning algorithm after observing the learning sample S—while PAC-Bayes usually bounds \(R_{\mathcal{D}}(G_Q)\) uniformly for all distribution \(Q\), but with high probability over the draw of \(S \sim \mathcal{D}^m\). Since by definition posterior distributions are data dependent, this different point of view on PAC-Bayesian analysis has the advantage to involve an expectation over all the possible learning samples (of a given size) in bounds itself.
Theorem 1
For any distribution \(\mathcal {D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of voters \(\mathcal{H}\), for any prior distribution \(P\) on \(\mathcal {H}\), for any convex function \(D: [0, 1] \times [0, 1] \rightarrow \mathbb {R}\), we have
where \(R_\mathcal{D}(h)\) and \(R_S(h)\) are respectively the true and the empirical risks of individual voters.
Similarly to Germain et al. [9, 10], by selecting a well-suited deviation function D and by upper-bounding \(\mathbb E_{S}\, \mathbb E_{h} e^{m\,D(R_S(h),R_\mathcal{D}(h))}\), we can prove the expected bound counterparts of the classical PAC-Bayesian theorems of Catoni [5], McAllester [21], Seeger [27]. The proof presented below borrows the straightforward proof technique of Bégin et al. [3]. Interestingly, this approach highlights that the expectation bounds are obtained simply by replacing the Markov inequality by the Jensen inequality (respectively Theorems 5 and 6, in Appendix).
Proof of Theorem 1
The last three inequalities below are obtained by applying Jensen’s inequality on the convex function D, the change of measure inequality [as stated by [3], Lemma 3], and Jensen’s inequality on the concave function \(\ln \).
\(\square \)
Since the C-bound of Eq. (2) involves the expected disagreement \(d_{\mathcal{D}}(Q)\), we also derive below the expected bound that upper-bounds the deviation between \(\mathbb {E}_{{S\sim \mathcal{D}^m}} d_S(Q_S)\) and \(\mathbb {E}_{S\sim \mathcal{D}^m}d_\mathcal{D}(Q_S)\) under a convex function D. Theorem 2 can be seen as the expectation version of probabilistic bounds over \(d_S(Q_S)\) proposed by Germain et al. [10], Lacasse et al. [14].
Theorem 2
For any distribution \(\mathcal {D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of voters \(\mathcal {H}\), for any prior distribution \(P\) on \(\mathcal {H}\), for any convex function \(D : [0, 1] \times [0, 1] \rightarrow \mathbb {R}\), we have
where \(d_\mathcal{D}(h,h') = \mathbb {E}_{x \sim \mathcal{D}_\mathcal{X}}\, \mathbbm {1}_{[h(x) {\ne } h'(x)]}\) is the disagreement of voters h and \(h'\) on the distribution \(\mathcal{D}\), and \(d_S(h,h')\) is its empirical counterpart.
Proof
First, we apply the exact same steps as in the proof of Theorem 1:
Then, we use the fact that \({\text {KL}}({Q_S^2}\Vert P^2) = 2{\text {KL}}({Q_S}\Vert P) \) [see [10], Theorem 25]. \(\square \)
In the following we provide an extension of this PAC-Bayesian framework to multiview learning with more than two views.
3 Multiview PAC-Bayesian Approach
3.1 Notations and Setting
We consider binary classification problems where the multiview observations \(\mathbf{x}= (x^1,\ldots ,x^V)\) belong to a multiview input set \(\mathcal{X} = \mathcal {X}_1\times \ldots \times \mathcal {X}_V\), where \(V\ge 2\) is the number of views of not-necessarily the same dimension. We denote \(\mathcal{V}\) the set of the V views. In binary classification, we assume that examples are pairs \((\mathbf{x}, y)\), with \(y\in \mathcal{Y} = \{-1,+1\}\), drawn according to an unknown distribution \(\mathcal{D}\) over \(\mathcal{X}\times \mathcal {Y}\). To model the two-level multiview approach, we follow the next setting. For each view \(v \in \mathcal{V}\), we consider a view-specific set \(\mathcal{H}_v\) of voters \(h: \mathcal {X}_v \rightarrow \mathcal{Y}\), and a prior distribution \(P_v\) on \(\mathcal{H}_v\). Given a hyper-prior distribution \(\pi \) over the views \(\mathcal{V}\), and a multiview learning sample \(S = \{(\mathbf{x}_i,y_i)\}_{i=1}^m{\sim }(\mathcal{D})^m\), our PAC-Bayesian learner objective is twofold: (i) finding a posterior distribution \(Q_v\) over \(\mathcal {H}_v\) for all views \(v \in \mathcal{V}\); (ii) finding a hyper-posterior distribution \(\rho \) on the set of views \(\mathcal{V}\). This hierarchy of distributions is illustrated by Fig. 1. The learned distributions express a multiview weighted majority vote \(B_{\rho }^{\textsc {mv}}\) defined as
Thus, the learner aims at constructing the posterior and hyper-posterior distributions that minimize the true risk \(R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}})\) of the multiview weighted majority vote:
As pointed out in Sect. 2, the PAC-Bayesian approach deals with the risk of the stochastic Gibbs classifier \(G_{\rho }^{\textsc {mv}}\) defined as follows in our multiview setting, and that can be rewritten in terms of expected disagreement \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) and expected joint error \(e_{\mathcal{D}}^{\textsc {mv}}(\rho )\):
Obviously, the empirical counterpart of the Gibbs classifier’s risk \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) is
where \(d_{S}^{\textsc {mv}}(\rho )\) and \(e_{S}^{\textsc {mv}}(\rho )\) are respectively the empirical estimations of \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) and \(e_{\mathcal{D}}^{\textsc {mv}}(\rho )\) on the learning sample S. As in the single-view PAC-Bayesian setting, the multiview weighted majority vote \(B_{\rho }^{\textsc {mv}}\) is closely related to the stochastic multiview Gibbs classifier \(G_{\rho }^{\textsc {mv}}\), and a generalization bound for \(G_{\rho }^{\textsc {mv}}\) gives rise to a generalization bound for \(B_{\rho }^{\textsc {mv}}\). Indeed, it is easy to show that \(R_{\mathcal {D}}(B_{\rho }^{\textsc {mv}})\le 2R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\), meaning that an upper bound over \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) gives an upper bound for the majority vote. Moreover the C-Bound of Eq. (2) can be extended to our multiview setting by Lemma 1 below. Equation (5) is a straightforward generalization of the single-view C-bound of Eq. (2). Afterward, Eq. (6) is obtained by rewriting \(R_{\mathcal {D}}(G_{\rho }^{\textsc {mv}})\) as the \(\rho \)-average of the risk associated to each view, and lower-bounding \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) by the \(\rho \)-average of the disagreement associated to each view.
Lemma 1
Let \(V \ge 2\) be the number of views. For all posterior \(\{Q_v\}_{v=1}^V\) and hyper-posterior \(\rho \) distribution, if , then we have
Proof
Equation (5) follows from the Cantelli-Chebyshev’s inequality (Theorem 7, in Appendix). To prove Eq. (6), we first notice that in the binary setting where \(y\in \{-1,1\}\) and \(h:\mathcal{X}\rightarrow \{-1,1\}\), we have \(\mathbbm {1}_{[h(x^v) \ne y]} = \frac{1}{2} (1-y\,h(x^v))\), and
Moreover, we have
From Jensen’s inequality (Theorem 6, in Appendix) it comes
By replacing \(R_{\mathcal {D}}(G_{\rho }^\textsc {mv})\) and \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) in Eq. (5), we obtain
\(\square \)
Similarly than for the mono-view setting, Eqs. (4) and (5) suggest that a good trade-off between the risk of the Gibbs classifier and the disagreement \(d_{\mathcal{D}}^{\textsc {mv}}(\rho )\) between pairs of voters will lead to a well-performing majority vote. Equation (6) exhibits the role of diversity among the views thanks to the disagreement’s expectation over the views \(\mathbb {E}_ {v \sim \rho } d_{\mathcal{D}}(Q_v)\).
3.2 General Multiview PAC-Bayesian Theorems
Now we state our general PAC-Bayesian theorem suitable for the above multiview learning setting with a two-level hierarchy of distributions over views (or voters). A key step in PAC-Bayesian proofs is the use of a change of measure inequality [22], based on the Donsker-Varadhan inequality [8]. Lemma 2 below extends this tool to our multiview setting.
Lemma 2
For any set of priors \(\{P_v\}_{v=1}^V\) and any set of posteriors \(\{Q_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) on views \(\mathcal{V}\) and hyper-posterior distribution \(\rho \) on \(\mathcal{V}\), and for any measurable function \(\phi : \mathcal {H}_v \rightarrow \mathbb {R}\), we have
Proof
We have
According to the Kullback-Leibler definition, we have
By applying Jensen’s inequality (Theorem 6, in Appendix) on the concave function \(\ln \), we have
Finally, we apply again the Jensen inequality (Theorem 6) on \(\ln \) to obtain the lemma. \(\square \)
Based on Lemma 2, the following theorem can be seen as a generalization of Theorem 1 to multiview. Note that we still rely on a general convex function \(D: [0,1] \times [0,1] \rightarrow \mathbb {R}\), that measures the “deviation” between the empirical disagreement/joint error and the true risk of the Gibbs classifier.
Theorem 3
Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), for any convex function \(D : [0,1] \times [0,1] \rightarrow \mathbb {R}\), we have
Proof
We follow the same steps as in Theorem 1 proof.
where the last inequality is obtained using Lemma 2. After distributing the expectation of \(S\sim \mathcal{D}^m\), the final statement follows from Jensen’s inequality (Theorem 6)
and from Eq. (3): \(R_S(G_{\rho _S}^{\textsc {mv}}) = \tfrac{1}{2} d_{S}^{\textsc {mv}}(\rho _S)+ e_{S}^{\textsc {mv}}(\rho _S)\). \(\square \)
It is interesting to compare this generalization bound to Theorem 1. The main difference relies on the introduction of view-specific prior and posterior distributions, which mainly leads to an additional term \(\mathbf {E}_{v \sim \rho } {\text {KL}}(Q_v \Vert P_v)\), expressed as the expectation of the view-specific Kullback-Leibler divergence term over the views \(\mathcal{V}\) according to the hyper-posterior distribution \(\rho \). We also introduce the empirical disagreement allowing us to directly highlight the presence of the diversity between voters and between views. As Theorems 1 and 3 provides a tool to derive PAC-Bayesian generalization bounds for a multiview supervised learning setting. Indeed, by making use of the same trick as Germain et al. [9, 10], the generalization bounds can be derived from Theorem 3 by choosing a suitable convex function D and upper-bounding \(\mathbb {E}_S \mathbb {E}_v \mathbb {E}_h e^{ m\, D(R_S(h), R_\mathcal{D}(h))} \). We provide an example of such a specialization in Sect. 3.3, by following McAllester’s [21] point of view. Note that we provide the specialization to the two other classical PAC-Bayesian approaches of Catoni [5], Langford [15], Seeger [27] in our research report Goyal et al. [11, Sect. 3.3.].
Following the same approach, we can obtain a mutiview bound for the expected disagreement.
Theorem 4
Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_v\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), for any convex function \(D : [0,1] \times [0,1] \rightarrow \mathbb {R}\), we have
Proof
The result is obtained straightforwardly by following the proof steps of Theorem 3, using the disagreement instead of the Gibbs risk. Then, similarly at what we have done to obtain Theorem 2, we substitute \({\text {KL}}({Q_{v,S}^2}\Vert P^2_v)\) by \(2{\text {KL}}({Q_{v,S}}\Vert P_v)\), and \({\text {KL}}(\rho _S^2 \Vert \pi ^2)\) by \(2{\text {KL}}(\rho _S\Vert \pi )\). \(\square \)
3.3 Specialization of Our Theorem to the McAllester’s Approach
We derive here the specialization of our multiview PAC-Bayesian theorem to the McAllester [22]’s point of view. To do so, we follow the same principle as Germain et al. [9, 10] to obtain Corollary 1.
Corollary 1
Let \(V \ge 2\) be the number of views. For any distribution \(\mathcal{D}\) on \(\mathcal{X}\times \mathcal{Y}\), for any set of prior distributions \(\{P_{v}\}_{v=1}^V\), for any hyper-prior distribution \(\pi \) over \(\mathcal{V}\), we have
Proof
To prove the above result, we apply Theorem 3 with \(D(a,b)\, =\, 2(a-b)^2\), and we upper-bound \(\mathop {\mathbb {E}}\limits _{S \sim \mathcal{D}^m} \mathop {\mathbb {E}}\limits _{ v \sim \pi }\ \mathop {\mathbb {E}}\limits _{ h \sim P_v} e^{ m\, D(R_S(h), R_\mathcal{D}(h))}\). According to Pinsker’s inequality, we have \(D(a,b) \le {\text {kl}}(a,b) =\ a\ln \frac{a}{b}+(1-a)\ln \frac{1-a}{1-b}\). Then by considering \(R_S(h)\) as a random variable which follows a binomial distribution of m trials with a probability of success R(h), we obtain
\(\square \)
4 Discussion on Related Work
In this section, we discuss two related theoretical studies of multiview learning related to the notion of Gibbs classifier.
Amini et al. [1] proposed a Rademacher analysis of the risk of the stochastic Gibbs classifier over the view-specific models (for more than two views) where the distribution over the views is restricted to the uniform distribution. In their work, each view-specific model is found by minimizing the empirical risk: \(\displaystyle h_v^* \ = \mathop {\text{ argmin }}\limits _{h \in \mathcal {H}_v} \frac{1}{m}\sum _{(\mathbf {x},y) \in S} \mathbbm {1}_{[h(x^v) \ne y]}.\) The prediction for a multiview example \(\mathbf {x}\) is then based over the stochastic Gibbs classifier defined according to the uniform distribution, i.e., \(\forall v\in V,\ \rho (v)=\frac{1}{V}\). The risk of the multiview classifier Gibbs is hence given by
Moreover, Sun et al. [29] proposed a PAC-Bayesian analysis for multiview learning over the concatenation of the views, where the number of views is set to two, and deduced a SVM-like learning algorithm from this framework. The key idea of their approach is to define a prior distribution that promotes similar classification among the two views, and the notion of diversity among the views is handled by a different strategy than ours. We believe that the two approaches are complementary, as in the general case of more than two views that we consider in our work, we can also use a similar informative prior as the one proposed by Sun et al. [29] for learning.
5 Experiments
In this section, we present experiments to highlight the usefulness of our theoretical analysis by following a two-level hierarchy strategy. To do so, we learn a multiview model in two stages by following a classifier late fusion approach [28] (sometimes referred as stacking [30]). Concretely, we first learn view-specific classifiers for each view at the base level of the hierarchy. Each view-specific classifier is expressed as a majority vote of kernel functions. Then, we learn weighted combination based on predictions of view-specific classifiers. It is worth noting that this is the procedure followed by Morvant et al. [23] in a PAC-Bayesian fashion, but without any theoretical justifications and in a ranking setting.
We consider a publicly available multilingual multiview text categorization corpus extracted from the Reuters RCV1/RCV2 corpus [1]Footnote 3, which contains more than 110, 000 documents from five different languages (English, German, French, Italian, Spanish) distributed over six classes. To transform the dataset into a binary classification task, we consider six one-versus-all classification problems: For each class, we learn a multiview binary classification model by considering all documents from that class as positive examples and all others as negative examples. We then split the dataset into training and testing sets: we reserve a test sample containing \(30 \%\) of total documents. In order to highlight the benefits of the information brought by multiple views, we train the models with small learning sets by randomly choosing the learning sample S from the remaining set of the documents; the number of learning examples m considered are: 150, 200, 250 and 300. For each fusion-based approach, we split the learning sample S into two parts: \(S_1\) for learning the view-specific classifier at the first level and \(S_2\) for learning the final multiview model at the second level; such that \(|S_1|=\frac{3}{5} m\) and \(|S_2|=\frac{2}{5} m\) (with \(m=|S|\)). In addition, the reported results are averaged on 20 runs of experiments, each run being done with a new random learning sample. Since the classes are highly unbalanced, we report in Table 1 the accuracy along with the F1-measure, which is the harmonic average of precision and recall, computed on the test sample.
To assess that multiview learning with late fusion makes sense for our task, we consider as baselines the four following one-step learning algorithms (provided with the learning sample S). First, we learn a view-specific model on each view and report, as \(\texttt {Mono}_v\), their average performance. We also follow an early fusion procedure, referred as \(\texttt {Concat}_{\texttt {SVM}}\), consisting of learning one single model using SVM [7] over the simple concatenation of the features of five views. Moreover, we look at two simple voters’ combinations, respectively denoted by \(\texttt {Aggreg}_{\texttt {P}}\) and \(\texttt {Aggreg}_{\texttt {L}}\), for which the weights associated with each view follow the uniform distribution. Concretely, \(\texttt {Aggreg}_{\texttt {P}}\), respectively \(\texttt {Aggreg}_{\texttt {L}}\), combines the real-valued prediction, respectively the labels, returned by the view-specific classifiers. In other words, we have
with \(h^v(x^v)\) the real-valued prediction of the view-specific classifier learned on view v.
We compare the above one-step methods to the two following late fusion approaches that only differ at the second level. Concretely, at the first level we construct from \(S_1\) different view-specific majority vote expressed as linear SVM modelsFootnote 4 with different hyperparameter C values (12 values between \(10^{-8}\) and \(10^{3}\)): We do not perform cross-validation at the first level. This has the advantage to (i) lighten the first level learning process, since we do not need to validate models, and (ii) to potentially increase the expressivity of the final model.
At the second level, as it is often done for late fusion, we learn from \(S_2\) the final weighted combination over the view specific voters using a RBF kernel. The methods referred as \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\), respectively \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\), make use of SVM, respectively the PAC-Bayesian algorithm CqBoost [26]. Note that, as recalled in Sect. 2, CqBoost is an algorithm that tends to minimize the C-Bound of Eq. (2): it directly captures a trade-off between accuracy and disagreement.
We follow a 5-fold cross-validation procedure for selecting the hyperparameters of each learning algorithm. For \(\texttt {Mono}_v\), \(\texttt {Concat}_{\texttt {SVM}}\), \(\texttt {Aggreg}_{\texttt {P}}\) and \(\texttt {Aggreg}_{\texttt {L}}\) the hyperparameter C is chosen over a set of 12 values between \(10^{-8}\) and \(10^{3}\). For \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) and \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) the hyperparameter \(\gamma \) of the RBF kernel is chosen over 9 values between \(10^{-6}\) and \(10^{2}\). For \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\), the hyperparameter C is chosen over a set of 12 values between \(10^{-8}\) and \(10^{3}\). For \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\), the hyperparameter \(\mu \) is chosen over a set of 8 values between \(10^{-8}\) and \(10^{-1}\). Note that we made use of the scikit-learn [24] implementation for learning our SVM models.
First of all, from Table 1, the two-step approaches provide the best results on average. Secondly, according to a Wilcoxon rank sum test [19] with \(p < 0.02\), the PAC-Bayesian late fusion based approach \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) is significantly the best method—in terms of accuracy, and except for the smallest learning sample size (\(m=150\)), \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) and \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) produce models with similar F1-measure. We can also remark that \(\texttt {Fusion}_{\texttt {Cq}}^{\texttt {all}}\) is more “stable” than \(\texttt {Fusion}_{\texttt {SVM}}^{\texttt {all}}\) according to the standard deviation values. These results confirm the potential of using PAC-Bayesian approaches for multiview learning where we can control a trade-off between accuracy and diversity among voters.
6 Conclusion and Future Work
In this paper, we proposed a first PAC-Bayesian analysis of weighted majority vote classifiers for multiview learning when observations are described by more than two views. Our analysis is based on a hierarchy of distributions, i.e. weights, over the views and voters: (i) for each view v a posterior and prior distributions over the view-specific voter’s set, and (ii) a hyper-posterior and hyper-prior distribution over the set of views. We derived a general PAC-Bayesian theorem tailored for this setting, that can be specialized to any convex function to compare the empirical and true risks of the stochastic Gibbs classifier associated with the weighted majority vote. We also presented a similar theorem for the expected disagreement, a notion that turns out to be crucial in multiview learning. Moreover, while usual PAC-Bayesian analyses are expressed as probabilistic bounds over the random choice of the learning sample, we presented here bounds in expectation over the data, which is very interesting from a PAC-Bayesian standpoint where the posterior distribution is data dependent. According to the distributions’ hierarchy, we evaluated a simple two-step learning algorithm (based on late fusion) on a multiview benchmark. We compared the accuracies while using SVM and the PAC-Bayesian algorithm CqBoost for weighting the view-specific classifiers. The latter revealed itself as a better strategy, as it deals nicely with accuracy and the disagreement trade-off promoted by our PAC-Bayesian analysis of the multiview hierarchical approach.
We believe that our theoretical and empirical results are a first step toward the goal of theoretically understanding the multiview learning issue through the PAC-Bayesian point of view, and toward the objective of deriving new multiview learning algorithms. It gives rise to exciting perspectives. Among them, we would like to specialize our result to linear classifiers for which PAC-Bayesian approaches are known to lead to tight bounds and efficient learning algorithms [9]. This clearly opens the door to derive theoretically founded algorithms for multiview learning. Another possible algorithmic direction is to take into account a second statistical moment information to link it explicitly to important properties between views, such as diversity or agreement [1, 13]. A first direction is to deal with our multiview PAC-Bayesian C-Bound of Lemma 1—that already takes into account such a notion of diversity [23]—in order to derive an algorithm as done in a mono-view setting by Laviolette et al. [17, 26]. Another perspective is to extend our bounds to diversity-dependent priors, similarly to the approach used by Sun et al. [29], but for more than two views. This would allow to additionally consider an a priori knowledge on the diversity. Moreover, we would like to explore the semi-supervised multiview learning where one has access to unlabeled data \(S_u=\{\mathbf{x}_j\}_{j=1}^{m_u}\) along with labeled data \(S_l=\{(\mathbf{x}_i,y_i)\}_{i=1}^{m_l}\) during training. Indeed, an interesting behaviour of our theorem is that it can be easily extended to this situation: the bound will be a concatenation of a bound over \(\tfrac{1}{2} d_{S_u}^{\textsc {mv}}(\rho )\) (depending on \(m_u\)) and a bound over \(e_{S_l}^{\textsc {mv}}(\rho )\) (depending on \(m_s\)). The main difference with the supervised bound is that the Kullback-Leibler divergence will be multiplied by a factor 2.
Notes
- 1.
The fusion of descriptions, resp. of models, is sometimes called Early Fusion, resp. Late Fusion.
- 2.
Our notion of hyper-prior and hyper-posterior distributions is different than the one proposed for lifelong learning [25], where they basically consider hyper-prior and hyper-posterior over the set of possible priors: The prior distribution P over the voters’ set is viewed as a random variable.
- 3.
- 4.
We use linear SVM model as it is usually done for text classification tasks [e.g., 12].
References
Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views - an application to multilingual text categorization. In: NIPS, pp. 28–36 (2009)
Atrey, P.K., Hossain, M.A., El-Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)
Bégin, L., Germain, P., Laviolette, F., Roy, J.-F.: PAC-Bayesian bounds based on the Rényi divergence. In: AISTATS, pp. 435–444 (2016)
Blum, A., Mitchell, T.M.: Combining Labeled and Unlabeled Data with Co-training. In: COLT, pp. 92–100 (1998)
Catoni, O.: PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, vol. 56. Institute of Mathematical Statistic, Shaker Heights (2007)
Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning, 1st edn. The MIT Press, Cambridge (2010). ISBN 0262514125, 9780262514125
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Donsker, M.D., Varadhan, S.S.: Asymptotic evaluation of certain markov process expectations for large time, I. Commun. Pure Appl. Math. 28(1), 1–47 (1975)
Germain, P., Lacasse, A., Laviolette, F., Marchand, M.: PAC-Bayesian learning of linear classifiers. In: ICML, pp. 353–360 (2009)
Germain, P., Lacasse, A., Laviolette, F., Marchand, M., Roy, J.: Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm. JMLR 16, 787–860 (2015)
Goyal, A., Morvant, E., Germain, P., Amini, M.-R.: PAC-Bayesian analysis for a two-step hierarchical multiview learning approach. arXiv preprint arXiv:1606.07240 (2016)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683. ISBN 3-540-64417-2
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004). ISBN 0471210781
Lacasse, A., Laviolette, F., Marchand, M., Germain, P., Usunier, N.: PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In: NIPS, pp. 769–776 (2006)
Langford, J.: Tutorial on practical prediction theory for classification. JMLR 6, 273–306 (2005)
Langford, J., Shawe-Taylor, J.: PAC-Bayes & margins. In: NIPS, pp. 423–430. MIT Press (2002)
Laviolette, F., Marchand, M., Roy, J.-F.: From PAC-Bayes bounds to quadratic programs for majority votes. In: ICML (2011)
Lecué, G., Rigollet, P.: Optimal learning with Q-aggregation. Ann. Statist. 42(1), 211–224 (2014). https://doi.org/10.1214/13-AOS1190
Lehmann, E.: Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York (1975)
Maillard, O.-A., Vayatis, N.: Complexity versus agreement for many views. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS (LNAI), vol. 5809, pp. 232–246. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04414-4_21
McAllester, D.A.: Some PAC-Bayesian theorems. Mach. Learn. 37, 355–363 (1999)
McAllester, D.A.: PAC-Bayesian stochastic model selection. Mach. Learn. 51, 5–21 (2003)
Morvant, E., Habrard, A., Ayache, S.: Majority vote of diverse classifiers for late fusion. In: Fränti, P., Brown, G., Loog, M., Escolano, F., Pelillo, M. (eds.) S+SSPR 2014. LNCS, vol. 8621, pp. 153–162. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44415-3_16
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pentina, A., Lampert, C.H.: A PAC-Bayesian bound for lifelong learning. In: ICML, pp. 991–999 (2014)
Roy, J.-F., Marchand, M., Laviolette, F.: A column generation bound minimization approach with PAC-Bayesian generalization guarantees. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 1241–1249 (2016)
Seeger, M.W.: PAC-Bayesian generalisation error bounds for gaussian process classification. JMLR 3, 233–269 (2002)
Snoek, C., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: ACM Multimedia, pp. 399–402 (2005)
Sun, S., Shawe-Taylor, J., Mao, L.: PAC-Bayes analysis of multi-view learning. CoRR, abs/1406.5614 (2016)
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Acknowledgments
This work was partially funded by the French ANR project LIVES ANR-15-CE23-0026-03, the “Région Rhône-Alpes”, and the CIFAR program in Learning in Machines & Brains.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix—Mathematical Tools
Appendix—Mathematical Tools
Theorem 5
(Markov’s ineq.). For any random variable X s.t. \(\mathbb {E}(|X|)\,=\, \mu \), for any \(a\,>\,0\), we have \(\displaystyle \mathbb {P}(|X| \ge a) \le \frac{\mu }{a}.\)
Theorem 6
(Jensen’s ineq.). For any random variable X, for any concave function g, we have \(\displaystyle g(\mathop {\mathbb {\mathbb {E}}}\limits [X]) \ \ge \ \mathop {\mathbb {\mathbb {E}}}\limits [g(X)].\)
Theorem 7
(Cantelli-Chebyshev ineq.). For any random variable X s.t. \(\mathbb {E}(X)\,=\, \mu \) and \(\mathbf {Var}(X)=\sigma ^2\), and for any \(a\,>\,0\), we have \(\mathbb {P}(X - \mu \ge a) \le \frac{\sigma ^2}{\sigma ^2 + a^2}.\)
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Goyal, A., Morvant, E., Germain, P., Amini, MR. (2017). PAC-Bayesian Analysis for a Two-Step Hierarchical Multiview Learning Approach. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10535. Springer, Cham. https://doi.org/10.1007/978-3-319-71246-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-71246-8_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71245-1
Online ISBN: 978-3-319-71246-8
eBook Packages: Computer ScienceComputer Science (R0)