1 Introduction

Nowadays, multi-view data are often generated from multiple information channels continuously, e.g., hundreds of YouTube videos consisting of visual, audio and text features are uploaded every minute. Different views usually contain complementary information, and multi-view learning can exploit this information to learn representation that is more expressive than that of single-view learning method. Therefore, multi-view representation learning has become a very promising topic with wide applicability. Multi-view learning arouses amounts of interests in the past decades (Zhao et al. 2017; Quang et al. 2013; Sun and Chao 2013; Li et al. 2016; Ye et al. 2015; Liu et al. 2016; Chen and Zhou 2018). Nowadays, there are many multi-view learning approaches, e.g., multiple kernel learning (Gönen and Alpaydın 2011), disagreement-based multi-view learning (Blum and Mitchell 1998), late fusion methods which combine outputs of the models constructed from different view features (Ye et al. 2012) and subspace learning methods for multi-view data (Chen et al. 2012). Among them, the multi-view subspace learning approaches aim at obtaining a subspace shared by multiple views and then learning models in the shared subspace (Sharma et al. 2012; Hardoon et al. 2004; Guo and Xiao 2012; Zhang et al. 2017; Brbić and Kopriva 2018). They are very useful for cross-view classification and retrieval tasks. However, these approaches may overfit small training data. A max-margin harmonium model (MMH) (Chen et al. 2012) was proposed to avoid overfitting by introducing the max-margin principle to the latent subspace Markov network for multi-view data. But MMH is under the maximum entropy discrimination framework and cannot infer the penalty parameter of max-margin models in Bayesian style automatically. In Du et al. (2015), a posterior-regularized Bayesian approach is proposed to combine principal component analysis (PCA) with the max-margin learning, which can infer the penalty parameter of max-margin models but cannot address multi-view data.

On the other hand, multi-view data often cannot be collected in a single time due to temporal and spatial constrictions in applications, while the traditional multi-view algorithm needs store the entire training samples. Online learning is an efficient method to address this problem. Online learning starts from the Perceptron algorithm (Rosenblatt 1958) and many efforts have been made on the studies of online learning (Cesa-Bianchi and Lugosi 2006; Hazan et al. 2007; Chechik et al. 2010). Unfortunately, there are few studies about online multi-view learning methods. OPMV (Zhu et al. 2015) is one of the few online multi-view learning algorithms. OPMV jointly optimizes the composite objective functions with consistency linear constraints for different views. It doesn’t take the relevance of different views into account. OPMV is formulated as a point estimate by optimizing some deterministic objective function, without consideration of the model uncertainty. Online passive-aggressive (PA) learning provides a method for online large-margin learning (Crammer et al. 2006). Although it enjoys strong discriminative ability suitable for predictive tasks, it is also formulated as a point estimate by optimizing some deterministic objective function. The point estimate can be affected seriously by inappropriate regularization, outliers and noises, especially when the training data arrive sequentially. Based on the online PA learning, Shi and Zhu (2013) proposed a Bayesian PA learning method which infers a posterior under the Bayesian framework instead of a point estimate. Nevertheless, these online learning methods cannot process multi-view data. To the best of our knowledge, there have been few efforts focused on online multi-view learning under the Bayesian framework.

We address the aforementioned problems by developing an online Bayesian multi-view subspace learning method with the max-margin principle. Specifically, we first propose a predictive subspace learning method based on factor analysis and define a latent margin loss for classification in the subspace. Then we cast the learning problem into a variational Bayesian framework by exploiting the pseudo-likelihood and data augmentation idea (Zhu et al. 2014) which allows us to automatically infer the penalty parameter. With the variational approximate posterior inferred from the past samples, we can naturally combine historical knowledge with new arriving data, in a Bayesian passive-aggressive style. We update our model with the training data coming batch by batch, instead of storing all training data.

In the previous work, we propose a Bayesian multi-view learning algorithm (He et al. 2016) to learn predictive subspace with max-margin principle and then extend this model to the online scenario with streaming data. However, our preliminary work mainly focuses on classification tasks and is not available to predict the continuous response values like rating scores for hotel reviews and movie reviews.

In this paper, we further extend our Bayesian multi-view subspace learning model to solve multi-view regression problem which is based on the \(\epsilon \)-insensitive Support Vector Regression (SVR) (Zhu et al. 2009). Regression tasks have been successfully applied in many fields like finance (Jiang and He 2012; Kazem et al. 2013), transportation (Lin et al. 2013), meteorology (Kalteh 2013; Chen and Jie 2014), and other fields. We have applied our proposed regression model on many real multi-view regression data sets to predict the global rating score of the hotel, the rating of the movie and the relative location of the CT image on the axial axis, while our classification model can not predict the continuous values of regression tasks. Furthermore, we also extend our batch regression model to the online scenario which can deal with streaming data. Experiments on synthetic and various real classification and regression tasks show both our batch and online models have superior performance, compared with a number of competitors.

The paper is structured as follows. Section 2 introduces the related work. Section 3 presents the Bayesian subspace multi-view classification (regression) models and their online versions. Section 4 presents the variational inference for our models. Section 5 presents empirical results and Sect. 6 concludes.

2 Related work

The earliest works of multi-view learning are introduced by Blum and Mitchell (1998) and Yarowsky (1995). Nowadays, there are many multi-view learning approaches, e.g., multiple kernel learning (Gönen and Alpaydın 2011), disagreement-based multi-view learning (Blum and Mitchell 1998) and late fusion methods which combine outputs of the models constructed from different view features (Ye et al. 2012). Especially, the multi-view subspace learning algorithms learn the latent salient representation of multi-view data (Sharma et al. 2012; Hardoon et al. 2004; Zhang et al. 2017; Brbić and Kopriva 2018). This approach aims at obtaining a subspace shared by multiple views and then learning models in the shared subspace. However, most of these approaches are formulated as a point estimate by optimizing some deterministic objective function. The point estimate can be affected seriously by inappropriate regularization, outliers and noises.

The above methods mainly focus on multi-view classification tasks, while multi-view regression tasks of which the response values are continuous are also applied in many fields. Most of the existing multi-view regression models are based on a maximum likelihood approach or optimizing some deterministic objective function (Zheng et al. 2015; Lan et al. 2016; Wang et al. 2013; Merugu et al. 2006), whose objective is to learn an optimal value for each parameter involved in the model, thus can be affected seriously by outliers and noises. What’s worse, most existing multi-view regression solvers require user pre-specified the penalty parameter as input, but in many real machine learning applications, the optimal penalty parameter may be hard to be determined in advance. Though these solvers provide an automatic parameter selection mechanism by conducting cross-validation on the training samples, they may overfit small training sets.

Online learning starts from the Perceptron algorithm (Rosenblatt 1958) and has attracted much attention during the past years (Cesa-Bianchi and Lugosi 2006; Hazan et al. 2007; Grangier and Bengio 2008; Chechik et al. 2010; Chen et al. 2017). Crammer proposes the Online passive-aggressive (PA) learning which provides a general framework for online large-margin learning (Crammer et al. 2006), with many applications (Chiang et al. 2008). Online Bayesian passive-aggressive learning presents a generic framework of performing online learning for Bayesian max-margin models (Shi and Zhu 2013). Chen et al. (2017) proposes an online partial least square optimization method to study a non-convex formulation for multi-view representation learning, which can be efficiently solved by a simple stochastic gradient method. Xie et al. (2017) proposes dynamic multi-view hashing (DMVH), which can adaptively augment hash codes according to dynamic changes of image. And an online unsupervised multi-view model is proposed for feature selection (Shao et al. 2016).

Unfortunately, there are few online multi-view learning methods for multi-view data. OPMV (Zhu et al. 2015) is one of the few online multi-view classification models. This approach jointly optimizes the composite objective functions with consistency linear constraints for different views. It doesn’t take the cross view latent relationship into consideration. OPMV is formulated as a point estimate by optimizing some deterministic objective function, without consideration of the model uncertainty. For regression, there are some online regression models (Parrella 2007; Ting 2011; Nguyen-Tuong et al. 2009; Deng et al. 2016). These methods can deal with the multi-view learning by concatenating all views to form a new single view. But these methods don’t take the relevance of the different views into account and most of the existing models are based on point estimation, without consideration of the model uncertainty. As a result, their prediction performance can be effected seriously by online outliers and noises.

3 The proposed model

In this section, we firstly propose the max-margin subspace learning based on factor analysis. Then we develop a multi-view classification and a multi-view regression with max-margin subspace learning under the Bayesian framework. Finally, we extend the batch model to the online scenario which trains the model with the samples coming batch by batch.

3.1 Max-margin subspace learning

Suppose we have a set of N observations \({\mathbf {x}}^{(n)}, n=1,\ldots ,N\) in d-dimension feature space. Factor analysis projects an observation into a low dimensional space that captures the latent feature of data. In factor analysis, the generative process for each observation \({\mathbf {x}}\) is as follows:

$$\begin{aligned} \mathbf {z}\sim & {} \mathcal {N}({\mathbf {z}}|{\mathbf {0}}, \ {\mathbf {I}}_m)\nonumber \\ \varepsilon\sim & {} \mathcal {N}(\varepsilon |{\mathbf {0}}, \ \varPhi )\nonumber \\ \mathbf {x}= & {} {\varvec{\mu }}+{\mathbf {W}}{\mathbf {z}}+\varepsilon , \end{aligned}$$
(1)

where \(\varepsilon \in \mathbb {R}^{d\times 1}\) denotes the Gaussian noise, \(\varPhi \in \mathbb {R}^{d\times d}\) is a dimension variance matrix of \(\varepsilon \), \({\varvec{\mu }} \in \mathbb {R}^{d\times 1}\) is the mean value of \({\mathbf {x}}\), \({\mathbf {W}} \in \mathbb {R}^{d\times m}\) is the factor loading matrix, \({\mathbf {z}}\) is a m-dimensional latent variable.

So we can get the maximum likelihood \(\ell _s\) of \(\{ {\mathbf {x}}^{(n)} \}, n=1,\ldots ,N\)

$$\begin{aligned} \max \limits _{{\varvec{\mu }},{\mathbf {W}},\varPhi } \ell _s({\varvec{\mu }},{\mathbf {W}},\varPhi ) = \max \limits _{{\varvec{\mu }},{\mathbf {W}},\varPhi } \log \prod \limits _{n=1}^{N}\frac{\exp \left( -\frac{1}{2}({\mathbf {x}}^{(n)}-{\varvec{\mu }})^T ({\mathbf {W}}{{\mathbf {W}}}^T+\varPhi )^{-1}({\mathbf {x}}^{(n)}-{\varvec{\mu }})\right) }{(2\pi )^{d/2}| {\mathbf {W}}{{\mathbf {W}}}^T+\varPhi |^{1/2}}. \end{aligned}$$

However, factor analysis is an unsupervised model, which learns the latent variables of the observations without using any supervised side information. The max-margin principle can be introduced to incorporate supervised side information into the factor analysis model. Now, we need to devise a loss function that integrates the max-margin principle for prediction with latent subspace discovery. Suppose we have a \(1\times N\) response vector \({\mathbf {y}}\) with its element \(y^{(n)}, n=1,\ldots ,N\). If there is a classification task, we suppose \(y^{(n)}\in \{+1,-1\}\). Especially, if there is a regression task, \(y^{(n)}\) will be a continuous value. We define \(\tilde{{\mathbf {z}}}=[{\mathbf {z}}^T, 1]^T\) as the augmented latent representation of observation \({\mathbf {x}}\), and let \(f({\mathbf {x}};\tilde{{\mathbf {z}}},\varvec{\eta })=\varvec{\eta }^T\tilde{\mathbf {z}}\) be a discriminant function parameterized by \(\varvec{\eta }\). For classification, we can compute the following margin loss on training data \(({\mathbf {X}},{\mathbf {y}})\) with fixed values of \({\mathbf {Z}}\) and \(\varvec{\eta }\):

$$\begin{aligned} \ell _m({\mathbf {Z}},\eta )=\sum _{n=1}^{N}\max (0, 1 - y^{(n)} f(\mathbf {x}^{(n)})). \end{aligned}$$
(2)

For regression, we choose to minimize the \(\epsilon \)-insensitive loss, which is used in standard support vector regression (SVR) (Smola and Lkopf 2004):

$$\begin{aligned} \ell _m({\mathbf {Z}},\eta )=\sum _{n=1}^{N}\max (0, |y^{(n)}-f(\mathbf {x}^{(n)})|-\epsilon ), \end{aligned}$$
(3)

where \(\epsilon \in \mathbb {R}_{+}\) is the precision parameter, which is usually small.

The max-margin subspace learning (\(\hbox {M}^2\)SL) model can be formulated as follows:

$$\begin{aligned} \max \limits _{{\varvec{\mu }},{\mathbf {W}},\varPhi ,{\mathbf {Z}},\eta } \ell _s({\varvec{\mu }},{\mathbf {W}},\varPhi )-C\ell _m({\mathbf {Z}},\eta ), \end{aligned}$$
(4)

where C is the regularization parameter.

3.2 Multi-view classification with Bayesian \(\hbox {M}^2\)SL

Then we propose a Bayesian max-margin subspace multi-view learning (\(\hbox {BM}^{2}\)SMVL) model. In this section, we present the classification model. We consider the general multi-class classification. The binary case can be similarly derived.

We assume that \(N_v\) is the number of views, \(N_c\) is the number of classes, \(d_i\) is the dimension of the i-th view, the data matrix of the i-th view is \({{\mathbf {X}}}_i \in \mathbb {R}^{d_i\times N}\) consisting of N observations \({\mathbf {x}}^{(n)}_i\) in \(d_i\)-dimension feature space. We define \({\mathbf {y}}\), \(y^{(n)} \in \{1,\ldots ,N_c\}, n=1,\ldots ,N\) as a \(1\times N\) label vector and \({{\mathbf {x}}}^{(n)} = \{{\mathbf {x}}_i^{(n)}\},i=1,\ldots ,N_v\) as the n-th observation.

In our \(\hbox {BM}^{2}\)SMVL model, we learn the latent variable \({\mathbf {z}}_n\) from which the observation \({{\mathbf {x}}}^{(n)}\) is generated. For the n-th observation \({{\mathbf {x}}}^{(n)}\), each view \({\mathbf {x}}^{(n)}_i\) of \({{\mathbf {x}}}^{(n)}\) is generated from the latent variable \({\mathbf {z}}^{(n)}\). Next, we impose a prior distribution over the parameters in Eq. (1). So we can get the generative process for the n-th observations as follows:

$$\begin{aligned} {{{{\varvec{\mu }}}}}_i\sim & {} \mathcal {N}({{{{\varvec{\mu }}}}}_i|{\mathbf {0}}, \ {\beta ^{-1}_i}{{\mathbf {I}}}_{d_i})\\ {{{\alpha }}}_i\sim & {} \prod _{j=1}^{m}\varGamma ({\alpha }_{ij}|a_{\alpha _i},b_{\alpha _i})\\ {\mathbf {W}_i}|{{{\alpha }}}_i\sim & {} \prod _{j=1}^{d_i}\mathcal {N}(w_{ij}|{\mathbf {0}}, \ \mathrm diag({{{\alpha }}}_i))\\ \phi _i\sim & {} \varGamma (\phi _i|a_{\phi _i},b_{\phi _i})\\ {\mathbf {x}}^{(n)}_i|{{\mathbf {z}}}^{(n)}\sim & {} \mathcal {N}({\mathbf {x}}^{(n)}_i|{{\mathbf {W}}_{\mathbf {i}}{\mathbf {z}}^{(\mathbf {n})}}+{{\varvec{\mu }}_i}, \ \phi _i^{-1}{{\mathbf {I}}}_{d_i}), \end{aligned}$$

where \(\varGamma (\cdot )\) is the Gamma distribution,Footnote 1 and \(\beta \), \(a_{\alpha _i}\), \(b_{\alpha _i}\), \(a_{\phi _i}\), \(b_{\phi _i}\) are the hyper-parameters, \({\mathbf {W}_i} \in \mathbb {R}^{d_i\times m}\). The prior on \({\mathbf {W}_i}\) and \({{{\alpha }}}_i\) is introduced according to the automatic relevance determination (ARD) (Reents and Urbanczik 1998). In order to improve the efficiency of our algorithm, we define the variance matrix \(\varPhi _i\) of the \({\mathbf {x}}^{(n)}_i\) as a diagonal matrix \(\phi _i^{-1}{{\mathbf {I}}}_{d_i}\). Let \(\varOmega =({{{{\varvec{\mu }}}}}, {{{\alpha }}}, {\mathbf {W}}, \phi , {{\mathbf {z}}})\) denote all the parameters and latent variables. \(p_0(\varOmega )=p_0({{{{\varvec{\mu }}}}})p_0({\mathbf {W},\alpha })p_0(\phi )p_0({\mathbf {z}})\) is the prior of \(\varOmega \). According to the regularized Bayesian inference (Zhu et al. 2014), we re-express Eq. (1) as a Bayesian posterior distribution. It can be verified that the Bayesian posterior distribution \(p(\varOmega |{\mathbf {X}})=p_0(\varOmega )p({\mathbf {X}}|\varOmega )/p({\mathbf {X}})\) is equal to the solution of the following optimization problem:

$$\begin{aligned} \min _{q(\varOmega )\in \mathcal {P}}\text {KL}(q(\varOmega )\Vert p_0(\varOmega ))- \mathbb {E}_{q(\varOmega )}[\text {log}p({\mathbf {X}}|\varOmega )], \end{aligned}$$
(5)

where \(\text {KL}(q\Vert p)\) is the Kullback–Leibler (KL) divergence, and \(\mathcal {P}\) is the space of probability distributions. When the observations are given, \(p({\mathbf {X}})\) is a constant.

Then we introduce a multi-class classification by using the large-margin approach based on the one-VS-rest idea for SVM. We redefine the label \(y^{(n)}\) of the n-th observation as a \(1\times Nc\) label vector. If the n-th observation’s label \(y^{(c)}_n\) belongs to the c-th class, we define \(y^{(n)}_{c}=+1\) otherwise \(y^{(n)}_c=-1\). We have Nc classifiers, and take the c-th classification for an example: \(f_c({\mathbf {x}}^{(n)};\tilde{{\mathbf {z}}}^{(n)},\varvec{\eta }_c)=\varvec{\eta }_c^T \tilde{\mathbf {z}}^{(n)}\) denotes a discriminant function. Under the Bayesian framework, we impose a prior on \(\varvec{\eta }_c\) as follows:

$$\begin{aligned} \nu _c \sim p_0(\nu _c)= & {} \varGamma (\nu _c|a_{{\nu }_c},b_{{\nu }_c})\\ p(\varvec{\eta }_c|\nu _c)= & {} \mathcal {N}(\varvec{\eta }_c|{\mathbf {0}},\nu _c^{-1}{\mathbf {I}}_{(m+1)}), \end{aligned}$$

where \(a_{{\nu }_c}\) and \(b_{{\nu }_c}\) are hyper-parameters. \(\nu _c\) controls the inverse variance of \(\varvec{\eta }_c\), playing a similar role as the penalty parameter in conventional SVM. If \(\nu _c\) has a posterior distribution concentrated at large values, \(\varvec{\eta }_c\) will tend to be small, then the model will be simple and may not perform very well on training data. In conventional SVM, the penalty parameter \(\nu _c\) is learned by time-consuming cross-validation. Instead, it can be determined automatically as part of Bayesian inference in our model. So we can automatically infer the penalty parameter of the max-margin model. For simplify, let \(\varTheta =\{(\varvec{\eta }_c,\nu _c)\}_{c=1}^{Nc}\).

Then we can replace the margin loss with the expected margin loss for the classification. So we introduce

$$\begin{aligned} \varphi ({\mathbf {y}}|{\mathbf {Z}},\varvec{\eta }) =\prod _{n=1}^{N}\prod _{c=1}^{Nc}\exp \{-2C\cdot \max (0,1 - y^{(n)}_c \varvec{\eta }_c^T\tilde{\mathbf {z}}^{(n)})\} \end{aligned}$$
(6)

as the pseudo-likelihood of the n-th data’s label variable. Next we get our final model as follows:

$$\begin{aligned} \min \limits _{q(\varOmega ,\varTheta )\in \mathcal {P}}\text {KL}(q(\varOmega ,\varTheta )\Vert p_0(\varOmega , \varTheta ))-\mathbb {E}_{q(\varOmega )}[\text {log}p({\mathbf {X}}|\varOmega )]-\mathbb {E}_{q (\varOmega ,\varTheta )}[\log (\varphi ({\mathbf {y}}|{\mathbf {Z}},\varvec{\eta }))], \end{aligned}$$
(7)

where \(p_0(\varOmega ,\varTheta )\) is the prior, \(p_0(\varOmega ,\varTheta )=p_0(\varOmega )p_0(\varTheta )\), \(p_0(\varTheta _c)=p(\varvec{\eta }_c|\nu _c)p_0(\nu _c)\) and C is the regularization parameter. Solving problem (7), we can get the posterior distribution

$$\begin{aligned} q(\varOmega ,\varTheta ) =\frac{p_0(\varOmega ,\varTheta )p({\mathbf {X}}|\varOmega )\varphi ({\mathbf {y}} |{\mathbf {Z}},\varvec{\eta })}{\phi ({\mathbf {X}},{\mathbf {y}})}, \end{aligned}$$
(8)

where \(\phi ({\mathbf {X}},{\mathbf {y}})\) is the normalization constant. In order to approximate \(q(\varOmega ,\varTheta )\) we use variational approximate inference which is introduced in the Sect. 4.

3.3 Multi-view regression with Bayesian \(\hbox {M}^2\)SL

In this section, we present the max-margin latent subspace multi-view learning (\(\hbox {BM}^2\)SMVL) for regression.

The Bayesian multi-view subspace learning model is the same with the classification model, the difference is the loss function. We use the \(\epsilon \)-insensitive loss for regression (Zhu et al. 2009), so we introduce

$$\begin{aligned} \varphi _R({\mathbf {y}}|{\mathbf {Z}},\varvec{\eta },\epsilon ) = \prod \limits _{n=1}^{N}\exp \{-2C_R\cdot \max (|y^{(n)}-\varvec{\eta }^T\tilde{\mathbf {z}}^{(n)}|-\epsilon ,0)\} \end{aligned}$$

as the pseudo-likelihood of the n-th data. Under the Bayesian framework, we impose a prior on \(\varvec{\eta }\) as follows:

$$\begin{aligned} p_0(\nu )&= \varGamma (\nu |a_{\nu },b_{\nu })\\ p(\varvec{\eta }|\nu )&= \mathcal {N}(\varvec{\eta }|{\mathbf {0}},\nu ^{-1}{\mathbf {I}}_{(m+1)}). \end{aligned}$$

For simplify, let \(\varTheta =\{\varvec{\eta },\nu , \epsilon \}\) for regression. Similar to the classification model, we can get the posterior distribution for regression as:

$$\begin{aligned} q(\varOmega ,\varTheta ) =\frac{p_0(\varOmega ,\varTheta )p({\mathbf {X}}|\varOmega )\varphi _R ({\mathbf {y}}|{\mathbf {Z}},\varvec{\eta },\epsilon )}{\phi ({\mathbf {X}}, {\mathbf {y}})}. \end{aligned}$$
(9)

3.4 Online version of \(\hbox {BM}^{2}\)SMVL

The goal of online learning is to minimize the cumulative loss for a certain prediction task from the sequentially arriving training samples. In this section, we present an online \(\hbox {BM}^{2}\)SMVL (\(\hbox {OBM}^{2}\)SMVL) based on the online Bayesian passive-aggressive (BayesPA) learning framework for Bayesian max-margin models (Shi and Zhu 2013).

Assuming we have already got the posterior \(q_t(\varOmega ,\varTheta )\) at time t, when the new data \(({\mathbf {x}}^{(t+1)},y^{(t+1)})\) is coming, we need update the new posterior distribution \(q_{t+1}(\varOmega ,\varTheta )\). For simplify, we denote \({\mathbf {x}}^{(t+1)}=\{{\mathbf {x}}^{(t+1)}_i\}_{i=1}^{N_v}\). Generally, we define \(\omega \) as the parameterized model and \(\ell (\omega ;{\mathbf {x}}^{(t+1)},y^{(t+1)})\) as the loss for the new data \(({\mathbf {x}}^{(t+1)},y^{(t+1)})\).

Insted of updating a point estimate of \(\omega \), online BayesPA sequentially infers a new posterior distribution \(q_{t+1}(\omega )\). It sequentially infers a new posterior distribution \(q_{t+1}(\omega )\) on the arrival of new data \(({\mathbf {x}}^{(t+1)},y^{(t+1)})\) by solving the following optimization problem:

$$\begin{aligned} \min \limits _{q(\omega )\in \mathcal {P}}\text {KL}(q(\omega )\Vert q_{t}(\omega ))- \mathbb {E}_{q(\omega )}[\text {log}p({\mathbf {x}}^{(t+1)}|\omega )] +\ell (\omega ;{\mathbf {x}}^{(t+1)},y^{(t+1)}). \end{aligned}$$

The online model includes three main updating rules. Firstly, we hope \(\text {KL}(q(\omega )\Vert q_{t}(\omega ))\) is as small as possible. It means that \(q_{t+1}(\omega )\) is close to \(q_{t}(\omega )\). Secondly, the likelihood of the new data \(\mathbb {E}_{q(\omega )}[\text {log}p({\mathbf {x}}^{(t+1)}|\omega )]\) is high enough. Thirdly, the loss of the new data \(\ell (\omega ;{\mathbf {x}}^{(t+1)},y^{(t+1)})\) is as small as possible. It means that the new model \(q_{t+1}(\omega )\) suffers little loss from the new data. Note that the latent variable \(\mathbf z \) for each data is also included in \(\omega \), but the prior distribution of latent variable \(\mathbf{z }^{t+1}\) for the new arrival data \(({\mathbf {x}}^{(t+1)},y^{(t+1)})\) is not the posterior of \(\mathbf{z }^{t}\), but \(p_0({\mathbf {z}})\). So the updating rule for latent variables \(\mathbf z \) is different from other parameters.

Considering that in reality, sometimes we can obtain several training samples during a short moment, so we can use them as a mini-batch to learn the model, which is effective in reducing the noise in data and cutting the time for online learning. Suppose that we have a mini-batch of M samples at time t, for simplicity, we use \(\mathbf{X }^{(t+1)} = \{\mathbf{x }^{(l)}\},\mathbf{y }^{(t+1)} =\{{y}^{(l)}\} \, (l = 1,2,\ldots ,L) \) to denote the mini-batch observed in time \(t+1\).

To introduce the online idea to the above multi-view classification \(\hbox {BM}^2\)SMVL , we let \((\varOmega ,\varTheta )\) denote \(\omega \) and \(y^{(t+1)}=\{y^{(t+1)}_c\}_{c=1}^{N_c}\). A new posterior distribution \(q_{t+1}(\varOmega ,\varTheta )\) on the arrival of new data \((\mathbf {X} ^{(t+1)},\mathbf y ^{(t+1)})\) can be gotten by solving the following optimization problem:

$$\begin{aligned}&\min \limits _{q(\varOmega ,\varTheta )\in \mathcal {P}}\text {KL}(q(\varOmega ,\varTheta )\Vert q_{t}(\varOmega ,\varTheta ))-\mathbb {E}_{q(\varOmega ,\varTheta )}[\text {log}p({\mathbf {X}}^{(t+1)} |\varOmega ,\varTheta )]\\&\quad +\,\ell (\varOmega ,\varTheta ;\mathbf {X} ^{(t+1)},\mathbf y ^{(t+1)}). \end{aligned}$$

As above, we introduce \(\varphi (\cdot )\) function to replace the hinge loss as the pseudo-likelihood. So the formula is replaced by:

$$\begin{aligned}&\min \limits _{q(\varOmega ,\varTheta )\in \mathcal {P}}\text {KL}(q(\varOmega ,\varTheta )\Vert q_t (\varOmega ,\varTheta ))-\mathbb {E}_{q(\varOmega )}[\text {log}p({\mathbf {X}}^{(t+1)}|\varOmega )] \\&\quad -\,\mathbb {E}_{q(\varOmega ,\varTheta )}[\log (\varphi (\mathbf y ^{(t+1)}|\tilde{{\mathbf {z}}}, \varvec{\eta }))]. \end{aligned}$$

Similar to Eq. (8), we can get the posterior distribution:

$$\begin{aligned} q_{t+1}(\varOmega ,\varTheta ) =\frac{q_t(\varOmega ,\varTheta )p({\mathbf {X}}^{(t+1)}|\varOmega ) \varphi (\mathbf y ^{(t+1)}|\tilde{{\mathbf {z}}},\varvec{\eta })}{\phi ({\mathbf {X}}^{(t+1)},\mathbf y ^{(t+1)})}, \end{aligned}$$

where \(\phi ({\mathbf {X}}^{(t+1)},\mathbf y ^{(t+1)})\) is the normalization constant. Note that, the latent variable \({\mathbf {z}}^t\) is unrelated to the new posterior, because the variable \({\mathbf {z}}^{t+1}\)’s prior is \(p_0({\mathbf {z}})\). Let \((\varOmega ,\varTheta \backslash {\mathbf {z}}^{t})\) denote all variables in \(\varOmega \) and \(\varTheta \) except \({\mathbf {z}}^{t}\), then we can further get

$$\begin{aligned} q_{t+1}(\varOmega ,\varTheta ) = \frac{q_t(\varOmega ,\varTheta \backslash {\mathbf {z}^{t}})p_0({\mathbf {z}})p({\mathbf {X}}^{(t+1)}|\varOmega )\varphi (\mathbf y ^{(t+1)}|\tilde{{\mathbf {z}}},\varvec{\eta })}{\phi ({\mathbf {X}}^{(t+1)}, \mathbf y ^{(t+1)})}. \end{aligned}$$
(10)

The online \(\hbox {BM}^{2}\)SMVL for regression is similar to that for classification. So we can easily get the new posterior with the new arrival data for regression:

$$\begin{aligned} q_{t+1}(\varOmega ,\varTheta ) = \frac{q_t(\varOmega ,\varTheta \backslash {\mathbf {z}^{t}} )p_0 ({\mathbf {z}})p({\mathbf {X}}^{(t+1)}|\varOmega )\varphi _R(\mathbf y ^{(t+1)}| \tilde{{\mathbf {z}}},\varvec{\eta },\epsilon )}{\phi ({\mathbf {X}}^{(t+1)}, \mathbf y ^{(t+1)})}. \end{aligned}$$
(11)

In order to approximate \(q_{t+1}(\varOmega ,\varTheta )\) we use variational approximate inference which is introduced in Sect. 4.

4 Variational inference

Because the posterior is intractable to compute, we apply the variational inference method (Beal 2003) to approximate the posteriors in Eqs. (8) and (9) for \(\hbox {BM}^{2}\)SMVL and in Eqs. (10) and (11) for \(\hbox {OBM}^{2}\)SMVL. This method is much more efficient than sampling based methods (Gilks 2005).

4.1 Data augmentation

Since the pseudo-likelihood function \(\varphi (\cdot )\) involves a max operater which is difficult and inefficient for posterior inference. We re-express the pseudo-likelihood function into the integration of a function with augmented variable based on the data augmentation idea (Polson and Scott 2011; Zhu et al. 2014). For classification in \(\hbox {BM}^2\)SMVL, we replace the pseudo-likelihood \(\varphi (\cdot )\) with \(\varphi (y^{(n)}_c|\tilde{\mathbf {z}}^{(n)},\varvec{\eta }_c)\):

$$\begin{aligned} \varphi (\cdot )=\int _0^\infty \frac{\exp \left\{ \frac{-1}{2\lambda ^{(n)}_c}[\lambda ^{(n)}_c+C(1 - y^{(n)}_c\varvec{\eta }_c^T\tilde{\mathbf {z}}^{(n)})]^2\right\} d\lambda ^{(n)}_c}{\sqrt{2\pi \lambda ^{(n)}_c}}, \end{aligned}$$

where \(\lambda _c^{(n)} \ (n = 1,\ldots ,N)\) are the auxiliary variables introduced to deal with the \(\max \) function. Let \(\varvec{\lambda }=[\lambda _1,\ldots ,\lambda _N]^T\), then the augmented pseudo likelihood of \({\mathbf {y}},\varvec{\lambda }\) can be expressed as:

$$\begin{aligned} \varphi ({\mathbf {y}},\varvec{\lambda }|{\mathbf {Z}},\varvec{\eta })= \prod _{n=1}^{N}\prod _{c=1}^{N_c}\frac{\exp \left\{ \frac{-1}{2\lambda ^{(n)}_c} [\lambda ^{(n)}_c+C(1 - y^{(n)}_c\varvec{\eta }_c^T\tilde{\mathbf {z}}^{(n)})]^2\right\} }{\sqrt{2\pi \lambda ^{(n)}_c}}. \end{aligned}$$

Similarly, we introduce the augmented variable to the pseudo-likelihood function \(\varphi (\cdot )\) for classification in \(\hbox {OBM}^{2}\)SMVL:

$$\begin{aligned} \varphi (\mathbf y ^{(t+1)},\varvec{\lambda }^{(t+1)}|{\mathbf {Z}},\varvec{\eta }) =\prod \limits _{l=1}^{L}\prod \limits _{c=1}^{N_c}\frac{\exp \left\{ \frac{-1}{2\lambda ^{(l)}_c} [\lambda ^{(l)}_c+C(1-y^{(l)}_c\varvec{\eta }_c^T\tilde{\mathbf {z}}^{(l)})]^2\right\} }{\sqrt{2\pi \lambda ^{(l)}_c}}. \end{aligned}$$

For regression, the \(\epsilon \)-insensitive loss \(\varphi _R(y^{(n)}|{\mathbf {z}}^{(n)},\varvec{\eta })\) can be represented as a dual scale mixture of Gaussian distributions based on data augmentation:

$$\begin{aligned} \varphi _R(\cdot )= & {} \int _0^\infty \exp \left\{ \frac{-[\lambda ^{(n)}+ {C_R}(y^{(n)}- \varvec{\eta }^T\tilde{\mathbf {z}}^{(n)}-\epsilon ) ]^2}{2\lambda ^{(n)}}\right\} \cdot \frac{d\lambda ^{(n)}}{\sqrt{2\pi \lambda ^{(n)}}}\\&\quad \times \, \int _0^\infty \exp \left\{ \frac{-[\theta ^{(n)}+ {C_R}(\varvec{\eta }^T \tilde{\mathbf {z}}^{(n)}-y^{(n)}-\epsilon ) ]^2}{2\theta ^{(n)}}\right\} \cdot \frac{d\theta ^{(n)}}{\sqrt{2\pi \theta ^{(n)}}}, \end{aligned}$$

where \(\lambda ^{(n)}\) and \(\theta ^{(n)}\ (n = 1,\ldots ,N)\) are the auxiliary variables introduced to deal with the \(\max \) function. Let \(\varvec{\lambda }=[\lambda _1,\ldots ,\lambda _N]^T\) and \(\varvec{\theta }=[\theta _1,\ldots ,\theta _N]^T\), then the augmented pseudo likelihood \(\varphi _R(\mathbf y ,\varvec{\lambda },\varvec{\theta }|{\mathbf {Z}},\varvec{\eta }, \epsilon )\) of \(\mathbf y , \varvec{\lambda }, \varvec{\theta }\) can be expressed as:

$$\begin{aligned} \varphi _R(\cdot )= & {} \prod \limits _{n=1}^{N} \frac{ \exp \left\{ \frac{1}{-2\lambda ^{(n)}} [\lambda ^{(n)}+ {C_R}(y^{(n)}-\varvec{\eta }^T\tilde{\mathbf {z}}^{(n)}-\epsilon ) ]^2 \right\} }{\sqrt{2\pi \lambda ^{(n)}}}\\&\times \,\frac{ \exp \left\{ \frac{1}{-2\theta ^{(n)}} [\theta ^{(n)}+ {C_R}(\varvec{\eta }^T \tilde{\mathbf {z}}^{(n)}-y^{(n)}-\epsilon ) ]^2 \right\} }{\sqrt{2\pi \theta ^{(n)}}}. \end{aligned}$$

Similarly, we introduce the augmented variable to the pseudo-likelihood function \(\varphi _R\)\((\mathbf y ^{(t+1)},\varvec{\lambda }^{(t+1)},\varvec{\theta }^{(t+1)}|{\mathbf {Z}},\varvec{\eta }, \epsilon )\) for regression in \(\hbox {OBM}^{2}\)SMVL:

$$\begin{aligned} \varphi ^{(t+1)}_R(\cdot )= & {} \prod \limits _{l=1}^{L} \frac{ \exp \left\{ \frac{1}{-2\lambda ^{(l)}} [\lambda ^{(n)}+ {C_R}(y^{(n)}-\varvec{\eta }^T\tilde{\mathbf {z}}^{(n)}-\epsilon ) ]^2 \right\} }{\sqrt{2\pi \lambda ^{(l)}}}\\&\times \, \frac{ \exp \left\{ \frac{1}{-2\theta ^{(l)}} [\theta ^{(n)}+ {C_R}(\varvec{\eta }^T\tilde{\mathbf {z}}^{(n)}-y^{(n)}-\epsilon ) ]^2 \right\} }{\sqrt{2\pi \theta ^{(l)}}}. \end{aligned}$$

4.2 Variational approximate inference

Next, we apply the mean-field variational method to approximating the posterior distributions.

4.2.1 Variational inference in \(\hbox {BM}^{2}\)SMVL

In this section, we take the classification model as an example. Firstly, we define a family of factorized but free-form variational distributions:

$$\begin{aligned} V(\varOmega ,\varTheta ,\varvec{\lambda })=V({{{\varvec{\mu }}}})V({\mathbf {W}})V ({{\alpha }})V(\phi )V({\mathbf {Z}})V(\varvec{\eta })V(\varvec{\lambda }) V(\nu ). \end{aligned}$$

The main idea of variational Bayesian inference is that we need to minimize the KL divergence \(\text {KL}(V(\varOmega ,\varTheta ,\varvec{\lambda })\Vert q(\varOmega ,\varTheta ,\varvec{\lambda }))\) between the approximating distribution and the target posterior. Next, we initialize the distributions of \(V(\varOmega ,\varTheta ,\varvec{\lambda })\). Then we iteratively update each parameter of our model by fixing other parameters as the current estimates. Now, we give the joint distribution of data and parameters:

$$\begin{aligned} p(\varOmega ,\varTheta ,\varvec{\lambda },{\mathbf {X}},{\mathbf {y}})= & {} p_0({{{\varvec{\mu }}}}) p({\mathbf {W|\alpha }})p_0({{\alpha }})p_0(\phi )p_0({\mathbf {Z}}) p(\varvec{\eta }|\nu )\\&\cdot \ p_0(\nu )p({\mathbf {X}|{{\varvec{\mu }}}}, {\mathbf {W}}, \phi , {\mathbf {Z}})\varphi ({\mathbf {y}},\varvec{\lambda }|{\mathbf {Z}},\varvec{\eta }). \end{aligned}$$

It can be shown that when keeping all other factors fixed, the optimal distribution \(V^*({{\lambda }})\) satisfies

$$\begin{aligned} V^*({\varvec{\lambda }})\propto \exp \{\mathbb {E}_{-{\varvec{\lambda }}}[\log p(\varOmega ,\varTheta ,\varvec{\lambda },{\mathbf {X}},{\mathbf {y}})]\}, \end{aligned}$$

where \(\mathbb {E}_{-{\varvec{\lambda }}}\) denotes the expectation with respect to \(V(\varOmega ,\varTheta ,\varvec{\lambda })\) over all variables except for \({\varvec{\lambda }}\). Then we can get the updating formula for \(\mathbb {E}_{-{\varvec{\lambda }}}\):

$$\begin{aligned} V^*({{{\varvec{\lambda }}}})= & {} \prod _{c=1}^{N_c}\prod _{n=1}^{N} \mathcal {GIG}(\lambda _c^{(n)}|\frac{1}{2},1,\chi _c^{(n)})\nonumber \\ \chi _c^{(n)}= & {} C^2\langle (1-y_c^{(n)}\varvec{\eta }_c^T \tilde{{\mathbf {z}}}^{(n)})^2\rangle \nonumber \\ \mathbb {E}_{{\lambda }^{(n)}} [ {({{\lambda }}^{(n)})^{-1}} ]= & {} 1/\sqrt{\chi _{{{\lambda }}^{(n)}}}, \end{aligned}$$
(12)

where \(\langle \cdot \rangle \) represents the expectation, \(\mathcal {GIG}(\cdot )\) is the generalized inverse Gaussian distribution. Similarly, we can get the updating formulas for all other factors. And the main steps for mean-filed methods can be found in.Footnote 2 Since they are tedious and easy to derive, here we only provide the equations for \({\mathbf {Z}}\), other updating formulas can be found in the “Appendix”.

$$\begin{aligned} V^*({\mathbf {Z}})= & {} \prod _{n=1}^{N}\mathcal {N}({\mathbf {z}}^{(n)}|\mu _{\mathbf {z}} ^{(n)},\varSigma _{\mathbf {z}}^{(n)})\nonumber \\ \varSigma _{\mathbf {z}}^{(n)}= & {} \left\{ C^2{\sum _{c=1}^{Nc}\langle \tilde{{\varvec{\eta }}_c}\tilde{{\varvec{\eta }}_c}^T \rangle }{\langle {\lambda _c^{(n)}}^{-1} \rangle }+{\mathbf {I}}_m \right. \nonumber \\&+\,\left. \sum _{i=1}^{Nv}{\langle \phi _i\rangle \langle {{\mathbf {W}_i}}^T{{\mathbf {W}_i}} \rangle }\right\} ^{-1} \nonumber \\ \mu _{\mathbf {z}}^{(n)}= & {} \varSigma _{\mathbf {z}}^{(n)}\left\{ \sum _{i=1}^{Nv} {\langle \phi _i\rangle \langle {\mathbf {W}_i}^T\rangle }({\mathbf {x}}^{(n)}_{i} -{\langle {{{\varvec{\mu }}_i}}\rangle }) \right. \nonumber \\&+\,\{C(1 +C{\langle (\lambda ^{(n)})^{-1}\rangle })y^{(n)}{\langle \tilde{\varvec{\eta }}\rangle } \nonumber \\&-\,\left. \ C^2{\langle (\lambda ^{(n)})^{-1}\rangle }{\langle \eta _{(m+1)} \tilde{\varvec{\eta }}\rangle }\}\right\} \nonumber \\ \mathbb {E}_{{\mathbf {z}}^{(n)}} [{\mathbf {z}}^{(n)}]= & {} \mu _{\mathbf {z}}^{(n)}, \end{aligned}$$
(13)

where \(\tilde{\varvec{\eta }}\) denotes the first m dimensions of \(\varvec{\eta }\), i.e., \(\varvec{\eta }=[\tilde{\varvec{\eta }}, \eta _{(m+1)}]\).

For regression, only the variables \({\mathbf {Z}}\), \(\varvec{\eta }\), \({{{\varvec{\lambda }}}}\) and \(\varvec{\theta }\) are different from the classification model. The updating formulas for \({\mathbf {Z}}\), \(\varvec{\eta }\)\({{{\varvec{\lambda }}}}\) and \(\varvec{\theta }\) can also be found in the “Appendix”. We summarize the proposed \(\hbox {BM}^{2}\)SMVL model in Algorithm 1.

figure a

4.2.2 Variational inference in \(\hbox {OBM}^{2}\)SMVL

Now, we use variational inference to approximate \(q_{t+1}(\varOmega ,\varTheta )\) in \(\hbox {OBM}^{2}\)SMVL model. It is very similar to the posterior \(q(\varOmega ,\varTheta )\). Firstly, we give the joint distribution of data and parameters:

$$\begin{aligned} p(\varOmega ,\varTheta ,\varvec{\lambda }^{(t+1)},{\mathbf {x}}^{(t+1)},{\mathbf {y}}^{(t+1)})= & {} p_0({{{\varvec{\mu }}}})p({\mathbf {W|\alpha }})p_0({{\alpha }})p_0(\phi )p_0({\mathbf {z}}) p(\varvec{\eta }|\nu )p_0(\nu )\\&\cdot \, p({{\mathbf {x}}}^{(t+1)}|{{\varvec{\mu }}}, {\mathbf {W}}, \phi ,{\mathbf {Z}})\varphi ({\mathbf {y}}^{(t+1)},\varvec{\lambda } ^{(t+1)}|{\mathbf {z}},\varvec{\eta }). \end{aligned}$$

It can be shown that when keeping all other factors fixed, the optimal distribution \(V^*(\varvec{\lambda }^{(t+1)})\) satisfies

$$\begin{aligned} V^*(\varvec{\lambda }^{(t+1)})\propto \exp \{\mathbb {E}_{-\varvec{\lambda }^{(t+1)}}[\log p(\varOmega ,\varTheta ,\varvec{\lambda },{\mathbf {x}}^{(t+1)},{\mathbf {y}}^{(t+1)})]\}, \end{aligned}$$

where \(\mathbb {E}_{-\varvec{\lambda }^{(t+1)}}\) denotes the expectation with respect to \(V(\varOmega ,\varTheta ,\varvec{\lambda }^{(t+1)})\) over all variables except for \(\varvec{\lambda }^{(t+1)}\). Then we can get the updating formula for \(\mathbb {E}_{-\lambda ^{(t+1)}}\):

$$\begin{aligned} V^*({{{\varvec{\lambda }}}}^{(t+1)})= & {} \prod _{l=1}^{L}\prod _{c=1}^{N_c} \mathcal {GIG}\left( \lambda _c^{(l)}|\frac{1}{2},1,\chi _c^{(l)}\right) \nonumber \\ {\chi _{{{\lambda }}^{(l)}_c}}= & {} C^2\langle (1-y_c^{(l)}(\varvec{\eta }^{(t+1)}_c) ^T\tilde{{\mathbf {z}}}^{(l)})^2\rangle .\nonumber \\ \mathbb {E}_{{\lambda }^{(l)}_c} [ {({{\lambda }}^{(l)}_c)^{-1}} ]= & {} 1/\sqrt{\chi _{{{\lambda }}^{(l)}_c}}. \end{aligned}$$
(14)

Similarly, we can get the updating formulas for all other factors. Since they are tedious and easy to derive, here we only provide the equations for \({\mathbf {Z}}^{(t+1)}\), other updating formulas can be found in the “Appendix”.

$$\begin{aligned} V^*({\mathbf {Z}}^{(t+1)})= & {} \prod _{l=1}^{L}\mathcal {N} ({\mathbf {z}}^{(l)}| \mu _{\mathbf {z}}^{(l)},\varSigma _{\mathbf {z}}^{(l)})\nonumber \\ \varSigma _{\mathbf {z}}^{(l)}= & {} \left\{ C^2{\sum _{c=1}^{N_c}\langle {\tilde{{\varvec{\eta }}}}_c^{(t+1)} ({\tilde{\varvec{\eta }}}_c^{(t+1)})^T \rangle }{\langle {\lambda _c^{(l)}}^{-1} \rangle }+{\mathbf {I}}_m \right. \nonumber \\&+\,\left. \sum _{i=1}^{N_v}{\langle \phi ^{(t+1)}_i\rangle \langle ({\mathbf {W}^{(t+1)}_i})^T {{\mathbf {W}^{(t+1)}_i}}\rangle }\right\} ^{-1} \nonumber \\ \mu _{\mathbf {z}}^{(l)}= & {} \varSigma _{\mathbf {z}}^{(l)}\left\{ \sum _{i=1}^{N_v}{\langle \phi ^{(t+1)}_i\rangle \langle ({\mathbf {W}^{(t+1)}_i})^T\rangle }({\mathbf {x}}^{(l)}_{i} -{\langle {{{\varvec{\mu }}^{(t+1)}_i}}\rangle }) \right. \nonumber \\&+\,\{C(1 +C{\langle {(\lambda ^{(l)})}^{-1}\rangle })y^{(l)}{\langle \tilde{\varvec{\eta }}^{(t+1)}\rangle } \nonumber \\&-\,\left. \ C^2{\langle {(\lambda ^{(l)})}^{-1}\rangle }{\langle \eta ^{(t+1)}_{(m+1)}\tilde{\varvec{\eta }}^{(t+1)}\rangle }\}\right\} \nonumber \\ \mathbb {E}_{{\mathbf {z}}^{(l)}} [{\mathbf {z}}^{(l)}]= & {} \mu _{\mathbf {z}}^{(l)}, \end{aligned}$$
(15)

where \(\tilde{\varvec{\eta }}\) denotes the first m dimensions of \(\varvec{\eta }\), i.e., \(\varvec{\eta }_c=[\tilde{\varvec{\eta }}_c, \eta _c^{(m+1)}]\).

For regression, only the variables \({\mathbf {Z}}^{(t+1)}\), \(\varvec{\eta }^{(t+1)}\), \({{{\varvec{\lambda }}}}^{(t+1)}\) and \(\varvec{\theta }^{(t+1)}\) are different from the classification model, the updating formulas for \({\mathbf {Z}}^{(t+1)}\), \(\varvec{\eta }^{(t+1)}\)\({{{\varvec{\lambda }}}}^{(t+1)}\) and \(\varvec{\theta }^{(t+1)}\) can be found in the “Appendix”. A full description of the proposed \(\hbox {OBM}^{2}\)SMVL model is given in Algorithm 2. Here we use T to represent the total number of mini-batches. So the total number of training data is \(N = T \times L\). Obviously, by limiting L to 1, the algorithm can handle the case we first assumed, i.e., learning from samples one-by-one.

figure b

4.3 Prediction on unseen data

Suppose we have a set of test data that is unseen during the model training phase. To apply our models learned above, we have to first project the new data to the same low-dimensional feature space as that for training data. Given the optimal variational distributions \(V^*(\varvec{\eta })\), \(V^*({\mathbf {W}}_i)\), \(V^*(\varvec{\mu }_i)\), and \(V^*(\phi _i)\) learned in the training phase, we use a single step variational method to approximate the posterior latent representation \(p({\mathbf {z}}_{new}|{\mathbf {x}}_{new})\) for test data \({\mathbf {x}}_{new}\):

$$\begin{aligned} V^*({\mathbf {z}}_{new})= & {} \mathcal {N}({\mathbf {z}}_{new}|\mu _{\mathbf {z}}^{new}, \varSigma _{\mathbf {z}}^{new})\\ \varSigma _{\mathbf {z}}^{new}= & {} \left\{ {\mathbf {I}}_m+ \sum _{i=1}^{N_v}\mathbb {E}_{\phi _i} [\phi _i]\mathbb {E}_{{\mathbf {W}}_i}[{\mathbf {W}}_i^T{{\mathbf {W}}}]\right\} ^{-1}\\ \mu _{\mathbf {z}}^{new}= & {} \varSigma _{\mathbf {z}}^{new}\left\{ \sum _{i=1}^{N_v} \mathbb {E}_{\phi _i}[\phi _i]\mathbb {E}_{{\mathbf {W}}_i}[{\mathbf {W}}_i^T] ({\mathbf {x}}^{new}_i-\mathbb {E}_{\varvec{{\mu }}_i}[{\varvec{\mu }}_i])\right\} , \end{aligned}$$

where the expectations are taken over the optimal variational distributions of \(\varvec{\eta }\), \({\mathbf {W}}_i\), \({\varvec{\mu }}_i\) and \(\phi _i\).

Then with the optimal variational approximation \(V^*(\varvec{\eta })\) for the posterior distribution of classification parameter \(\varvec{\eta }\), we can predict the class label of \({\mathbf {x}}^{new}\) for classification by

$$\begin{aligned} \tilde{\mu }_{\mathbf {z}}^{new}= & {} [(\mu _{\mathbf {z}}^{new})^T, \ 1]^T \\ y_{new}= & {} \max _c (\mathbb {E}_{\varvec{\eta }_c,{\mathbf {z}_{\mathrm{new}}}} [\varvec{\eta }_c^T\tilde{{\mathbf {z}}}_{{new}}]) \\= & {} \max _c (\mu _{{\varvec{\eta }}_c}^T\tilde{\mu }_{\mathbf {z}}^{new}). \end{aligned}$$

For regression, we can directly predict the response value:

$$\begin{aligned} y_{new}= & {} \mathbb {E}_{\varvec{\eta },{\mathbf {z}_{\mathrm{new}}}}[\varvec{\eta }^T \tilde{{\mathbf {z}}}_{{new}}]\\= & {} \mu _{\varvec{\eta }}^T\tilde{\mu }_{\mathbf {z}}^{new}. \end{aligned}$$

4.4 Computational complexity

For each iteration of parameter updating in our batch learning \(\hbox {BM}^{2}\)SMVL, we need \(O(NN_v\)\(\bar{d}m^{2})\) computation, where \(\bar{d}\) is the average dimension of all \(N_v\) views. The most computation load is spent on the calculation of \(\varSigma _{\mathbf {z}}^{(n)}, n=1,\ldots ,N\) where the matrix multiplication \(\langle {{\mathbf {W}_i}}^T{{\mathbf {W}_i}}\rangle \) consumes \(d_im^{2}\) computation. And each iteration of parameter updating in our online learning \(\hbox {OBM}^{2}\)SMVL consumes \(O(LN_v\bar{d}m^{2})\) with a batch of L new arrival data.

5 Experiments

We evaluate the proposed batch learning model \(\hbox {BM}^{2}\)SMVL and online learning model \(\hbox {OBM}^{2}\)SMVL on various classification and regression tasks. For regression, we use the root-mean-square error (RMSE) to evaluate the results. RMSE is formulated as follows:

$$\begin{aligned} RMSE = \sqrt{\frac{\varSigma _{n=1}^{N_{test}}(\hat{y}^{(n)} - y^{(n)})^2}{N_{test}}}, \end{aligned}$$

where \({y^{(n)}}\) is the ground truth of the n-th sample, \(\hat{y}^{(n)}\) is the corresponding predicted value, and \(N_{test}\) is the total number of the testing samples.

5.1 Real data sets

There are seven data sets for classification tasks and three data sets for regression tasks in our experiments. Trecvid contains 1078 manually labeled video shots that belong to five categories (Chen et al. 2012). And each shot is represented by a 1894-dim binary vector of text features and a 165-dim vector of HSV color histogram. WebKB data set has two views, including the content features of the web pages and the link features exploited from the link structures. This data set consists of 877 web pages from computer science departments in four universities, i.e., Cornell, Washington, Wisconsin and Texas. And each university has five document classes, i.e., course, faculty, student, project and staff. We select the web pages from these four universities as our experimental data.Footnote 3 These four data sets have five classes with two views. 20Newsgroups data set is widely used for classification. This data set has approximately 20,000 newsgroup documents, which are divided into 20 categories. We follow the way in Long et al. (2008) to construct multi-view learning problems. The NUS-WIDE dataset is a subset selected from Chua et al. (2009). NUS-WIDE dataset contains 21,935 web images that belongs to three categories (‘water’, ‘vehicle’, ‘flowers’). Each image includes six types of low-level features (64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments). We use the tf-idf weighting scheme to represent the document, and the document frequency with the value of 5 is adopted to cut down the number of word features. The details of these data sets are shown in Table 1.

Table 1 Statistics of the multiclass data sets

The hotel review datasetFootnote 4 consists of 5000 hotel reviews randomly collected from TripAdvisor.Footnote 5 Each review document is associated with two-view features (i.e., 12,000-dim bag-of-word features and 14-dim contextual features) as well as a global rating score which ranks from 1 to 5. In our experiment, we predict the global rating scores for reviews.

Another regression task for rating prediction is studied on this MovieLens dataset by following the way in Lu et al. (2017). Each rating in this dataset has three views, i.e., users, movies and tags. The user view consists of binary feature vectors for user ids, and thus for each rating there is only one non-zero feature in the user view, i.e., the associated user id; the same for the movie view. The tags of the movies are used for the tag view.

The data CT-ImageFootnote 6 was retrieved from a set of 53,500 CT images from 74 different patients (43 male, 31 female). Each CT slice is described by two histograms in polar space. The first histogram (240-dim) describes the location of bone structures in the image and the second histogram (144-dim) describes the location of air inclusions inside of the body. The predicting values are the relative location of the image on the axial axis which are in the range (0-180) where 0 denotes the top of the head and 180 denotes the soles of the feet. The details of these regression data sets are shown in Table 2.

Table 2 Statistics of the regression data sets

5.2 Competitors

We compare our classification model with the following five competitors:

  • VMRML (Quang et al. 2013): it is a vector-valued manifold regularization multi-view learning model. The regularization parameters are set as the default value in their paper, and we choose the best parameter \(\sigma \) for ‘rbf’ from \(10^{[-5:5]}\) by fivefold cross-validation in each data set;

  • MVMED (Sun and Chao 2013): it presents a multi-view maximum entropy discrimination model. We use the model with one-VS-rest strategy for multiclass problem. According to the paper, we choose the best parameter from \(2^{[-5:5]}\) by executing fivefold cross-validation for each data set;

  • MMH (Chen et al. 2012): it is a large-margin predictive latent subspace learning method for multi-view data. Based on the parameters given in its code,Footnote 7 we tune the four paramters carefully to choose the best parameters for each data set;

  • SVM-FULL: it concatenates all views to form a new single view, and applies SVM for classification. We choose the linear kernel and execute fivefold cross-validation on training sets to decide the cost parameter c from \(10^{[-3:3]}\);

  • OPMV (Zhu et al. 2015): it is an online multi-view learning. According to the paper, the learning rate parameter are chose from \(2^{[-8:8]}\), the regularization parameter are chose from \(1e^{[-16:0]}\), and the penalty parameters is pre-defined as 1. The parameters are set according to the above rules.

Table 3 Batch and online learning comparisons on multiclass data sets
Table 4 Batch and online learning comparisons on multiclass data sets

And the compared regression models are as follows:

  • CoR-LS (Lan et al. 2016): it is a co-regularized least square regression model (CoR-LS) for multi-view data;

  • LCFS (Wang et al. 2013): it unifies coupled linear regressions, \(\ell _{21}\)-norms and trace norm into a generic minimization formulation so that subspace learning and coupled feature selection can be performed simultaneously;

  • SVR-FULL: it concatenates all views to form a new single view, and applies SVRFootnote 8 for regression. We use fivefold cross-validation on training sets to decide the regularization parameter c from \(10^{[-3:3]}\) and the precision parameter \(\epsilon \) from \(10^{[-1:1]}\);

  • OSVR-FULL: it is the online SVR method. We concatenate all views to form a new single view and applies OSVR. The code provided by ParrellaFootnote 9 for OSVR. For each data set, we use fivefold cross-validation on training sets to decide the regularization parameter c from \(10^{[-3:3]}\) and the precision parameter \(\epsilon \) from \(10^{[-1:1]}\).

5.3 Parameter setting

In our batch learning, the regularization parameter C is chosen from the integer set \(\{1,2,3\}\) and the subspace dimension m from the integer set \(\{20,30,50\}\) for each data set by performing fivefold cross validation on training data. While in our online learning, the regularization parameter C is chosen from the integer set \(\{1,5,15\}\) and the subspace dimension m from the integer set \(\{30,50,70\}\). For the hyperparameters, both our batch and online learning are set as the same, i.e., \(a_\alpha =b_\alpha = 1\hbox {e}{-}3\), \(a_\phi =1\hbox {e}{-}2\), \(a_\nu = 1\hbox {e}{-}1\), \(b_\phi =b_\nu =\beta = 1\hbox {e}{-}5\). And we set the maximum iterations to 200.

For regression task, the regularization parameter C is chosen from the integer set \(\{2^{(-5)},\ldots ,2^{(10)}\}\) and the subspace dimension m from the integer set \(\{50,100,150\}\) for each data set by performing fivefold cross validation on training data. The hyperparameters are set as the same, i.e., \(a_\alpha =b_\alpha = 1\hbox {e}{-}3\), \(a_\phi =1\hbox {e}{-}2\), \(a_\nu =1\hbox {e}{-}1\), \(b_\phi =b_\nu =\beta = 1\hbox {e}{-}5\), \(\epsilon =0.01\). And we set the maximum iterations to 20 for each mini-batch.

5.4 Experimental results

Since a normal prior with zero mean is imposed on the observation data, we normalize the observation data to have zero mean and unit variance. In batch learning experiments, the results of all models on all data sets are averaged over 20 independent runs. We adopt two evaluating metrics accuracy and F1 score for classification tasks. So the results about accuracy are shown in Table 3 and the results about F1 socre are shown in Table 4. The ratio sampled for training data is 0.5 in the six data sets Trecvid, Washington, Cornell, Texas, Wisconsin, NUS-WIDE and 0.05 in News4Gv. Since MMH can only deal with two views in its codeFootnote 10 so its result is missing for News4Gv and NUS-WIDE in Table 3. And it provides a software for MMH, but the software doesn’t provide the F1 score, so its result is missing in Table 4. In online learning experiments, we use the same training/testing split of the above batch learning experiments. We sample 0.1 of the training data as the batch training, and the rest come one by one. Since OPMV can only deal with two-view data, so its result is missing for News4Gv and NUS-WIDE in Tables 3 and 4. From Tables 3 and 4, we have the following insightful observations:

  • Our \(\hbox {BM}^{2}\)SMVL achieves the best performance on the Trecvid, NUS-WIDE, Washington, Cornell, Texas and Wisconsin data sets and performs just a little worse than the SVM-FULL in the News4Gv data. We attribute it to that our method can automatically infer the penalty parameter of max-margin model based on the data augmentation idea, while MVMED and MMH are both under the maximum entropy discrimination framework and cannot infer the penalty parameter. SVM-FULL makes full use of all the information from the observations by concatenating all views to form a new single view. This maybe the reason why it performs better than our \(\hbox {BM}^{2}\)SMVL in the News4Gv. But some information from the observations is not helpful for the classification in other data sets. In this case, SVM-FULL cannot achieve a good performance.

  • Our method infers a posterior under the Bayesian framework instead of a point estimate as in VMRML. With Bayesian model averaging over the posterior, we can make more robust predictions than VMRML.

  • We also find that \(\hbox {OBM}^{2}\)SMVL performs better than OPMV on all data sets and just a little worse than \(\hbox {BM}^{2}\)SMVL. Unlike OPMV, which seeks a point estimate by optimizing some deterministic objective function, our online model infers a posterior under the Bayesian framework. The point estimate can be affected seriously by inappropriate regularization, outliers and noises, especially when the training data arrive sequentially.

We show every independent experimental run about all models and dataset in Figs. 1, 2 and 3.

Fig. 1
figure 1

Classification (accuracy) results on different datasets

Fig. 2
figure 2

Classification (F1 score)

For regression task, the results of all models are averaged over 5 independent runs on each data set. All the results are shown in Table 5. The ratio sampled for training data is 0.5 in the data set HotelReview and CT-Image and MovieLens. Since CoR-LS and LCFS can only deal with two-view data, so their results are missing for MovieLens in Table 5.

In online regression learning experiments, we use the same training/testing split of the above batch learning experiments. We sample 0.1 of the training data as the batch training, and the rest samples come with a number of 30. From Table 5, we have the following insightful observations:

  • \(\hbox {BM}^{2}\)SMVL consistently outperforms SVR-FULL. The reason may be that SVR-FULL just concatenates all views to form a new single view but some information from the observations is not helpful for regression on some data sets. And SVR-FULL doesn’t take the relevance of different views into account.

  • Our \(\hbox {BM}^{2}\)SMVL achieves the best performance on the HotelReview, MovieLens and CT-Image data sets. We attribute it to that our method can automatically infer the penalty parameter of max-margin model based on the data augmentation idea, while CoR-LS and LCFS both cannot infer the penalty parameter. What’s more, Our method infers a posterior under the Bayesian framework instead of a point estimate as in CoR-LS and LCFS. With Bayesian model averaging over the posterior, we can make more robust predictions than CoR-LS and LCFS.

  • We also find that \(\hbox {OBM}^{2}\)SMVL performs better than OSVR-FULL on all data sets. Unlike OSVR-FULL, which seeks a point estimate by optimizing some deterministic objective function, our online model infers a posterior under the Bayesian framework. The point estimate can be affected seriously by inappropriate regularization, outliers and noises, especially when the training data arrive sequentially. What’s more, OSVR-FULL doesn’t consider the relevance of different views.

As we can see, our model performs better on most of the experiment runs. For some data sets, the split of the training/testing data influences the performance of all models. That is why the standard deviations of some data sets is a little big. And we can see from Figs. 1, 2 and 3, our model statistically performs the best on most of the data sets.

Fig. 3
figure 3

Regression (RMSE) results on different datasets

Table 5 Comparisons on regression data sets
Table 6 Training time (s) on classification tasks
Table 7 Training time (s) on classification data set NUS
Table 8 Training time (s) on regression Tasks

5.5 Computation complexity analysis

We compare efficiency of different algorithms on classification and regression tasks, the results are reported in Tables 67 and 8. From Table 6, we can see VMRML costs the least time on most of small scale data sets. To further analysis the computational complexity, we conduct experiments on NUS-WIDE with different numbers of the training data (1000, 2000, 3000, 5000, 10,000) which is reported in Table 7. From Table 7, we find our models \(\hbox {BM}^2\)SMVL and \(\hbox {OBM}^2\)SMVL scale linearly with the number of training data N which coincides with the computational complexity discussed in Section Computational Complexity. Although the training time of VMRML is shorter than that of \(\hbox {BM}^2\)SMVL and \(\hbox {OBM}^2\)SMVL in Table 7, it seems that VMRML scales squarely with the number of training data and VMRML needs to store 5 Gram matrixes for both training and testing data on NUS-WIDE data set. Further more, we conduct the experiment with 20,000 training data, VMRML is out of memory while \(\hbox {BM}^2\)SMVL and \(\hbox {OBM}^2\)SMVL still work. Besides, \(\hbox {BM}^2\)SMVL performs better than VMRML on all considered data. From Table 7, we can also find the training time of SVM-FULL is the shortest, because it adopts the ‘linear’ kernel with low algorithm complexity. But SVM-FULL achieves the lowest accuracy and F1 score on NUS-WIDE compared to other offline multi-view methods.

Fig. 4
figure 4

(a) Results on different data sets with different parameters m in \(\hbox {BM}^{2}\)SMVL; (b) Results on different data sets with different regularization parameters C in \(\hbox {BM}^{2}\)SMVL

For regression, our online model \(\hbox {OBM}^2\)SMVL takes less time than OSVR-FULL on MovieLens and CT-Image. In batch learning, SVR-FULL takes the least time on MovieLens and Hotel, LCFS takes the least time on CT-Image. \(\hbox {BM}^2\)SMVL takes less time than CoR-LS and LCFS on HotelReview, it is because the dimensions of the views are very high on HotelReview. \(\hbox {BM}^2\)SMVL learns low-dimension representations from multiple views, so \(\hbox {BM}^2\)SMVL shows advantages when the dimensions of views are high. What’s more, our model achieves the lowest RMSE on all considered data sets. Although our model can not cost the least time on all data sets, we believe it’s an acceptable and reasonable trade-off between the model complexity and performance for our models.

5.6 Sensitivity analysis

We study the sensitivity of \(\hbox {BM}^{2}\)SMVL and \(\hbox {OBM}^{2}\)SMVL with respect to the subspace dimension m, and the regularization parameter C.

When we study the influence of m, C (batch) is set as 2 for \(\hbox {BM}^{2}\)SMVL and C (online) is set as 15 for \(\hbox {OBM}^{2}\)SMVL. The averaged results are shown in Figs. 4a and 5a. We find that the test accuracy increases when m becomes larger. And when m is large enough, the test accuracy remains stable.

When we study the influence of C, m is set as 30 for both batch and online learning. From the results in Figs. 4b and 5b, we can find that different data sets may prefer different values of C. In batch learning, C (batch) balances the classification model and subspace learning model, so our model cannot get the best performance when C (batch) is too large or too small. C (online) reflects the importance of new arrival data in our online model. When C (online) is too small, the new arrival data plays a tiny role in the online model and offers little help to improve the performance of our online model. For some data sets like Cornell, when C (online) is too large, the performance of \(\hbox {OBM}^{2}\)SMVL would become bad because the online model doesn’t take full advantage of the historical knowledge. For some other data sets like Trecvid and Washington, they are less sensitive to C (online) when C (online) is large enough.

Fig. 5
figure 5

(a) Results on different data sets with different subspace dimensions m (online) in \(\hbox {OBM}^{2}\)SMVL; (b) Results on different data sets with different regularization parameters C (online) in \(\hbox {OBM}^{2}\)SMVL

6 Conclusion

We propose an online Bayesian method to learn predictive subspace for multi-view data. Specifically, the proposed method is based on the data augmentation idea for max-margin learning, which allows us to automatically infer the weight and penalty parameter and find the most appropriate predictive subspace simultaneously under the Bayesian framework. Experiments on various classification and regression tasks show that both our batch model \(\hbox {BM}^{2}\)SMVL and online model \(\hbox {OBM}^{2}\)SMVL can achieve superior performance, compared with a number of state-of-the-art competitors.