1 Introduction

In real application, each object can be described by multiple different views or different features [30]. For example, as shown in Fig. 1, the objects can be represented by texture, color, shape, text and speech. These multiview representations provide complementary information to each other [47]. Integrating information from multiple views and uncovering the common latent structure shared by multiple views are the main concerning for the multiple views representation learning [55]. Generally, the traditional method is to concatenate all the features into a single vector, and then applies existing algorithms to this single vector. Obviously, this method ignores the differences of statistical properties between different views and also lacks physical meaning [39]. Actually, multi-view data contains consistent and complementary information simultaneously across different views [28, 31]. Leveraging the complementary information amongst views has better generalization ability than single view [7, 23, 25, 30, 46].

Fig. 1
figure 1

Examples of multiview data. a an object can be described from different views; b the fruit can be depicted from color, shape and so on; c the identity of person can be represented by faces, iris and fingerprint; d a word can be expressed in different languages

In the past year, multi-view learning algorithms have been proposed and applied successfully to image processing and computer vision [20, 42, 45, 48]. Those methods can be mainly categorized into three classes, i.e., co-training, multi-kernel learning and shared subspace learning. Co-training [2, 26] pursues to maximize the mutual agreement on two different views alternately. Based on the assumption that different kernels correspond to different views, multiple kernel learning [27, 51] combines different kernels to improve the performance . Different from co-training and multiple kernel method, the aim of subspace learning [6, 14, 19, 22, 32, 48] is to obtain a latent subspace based on the assumption that different views are generated from this latent common subspace. The classical subspace method includes canonical correlation analysis [17], which obtains the latent subspace via maximizing the correlation between different views. Though these approaches are successful in multi-view learning, they do not perform well on purely non-negative features such as pixel values or color histogram [21].

As an effective technique for data analysis [10, 43], nonnegative matrix factorization (NMF) has been widely used in non-negative features extraction [13, 24, 41]. In recent years, many NMF-related feature extraction algorithms have been proposed. For example, LNMF [29], GNMF [4], DNMF [53], RDPNA[52] and so on. These algorithms can generate superior clustering results, but only deal with single-view data. In the past five years, several extensions of NMF to multi-view data have been proposed [5, 12, 34,35,36, 38, 44, 56]. For example, Liu et al. [12] presented a multi-view clustering approach via joint NMF, which aimed at finding a consensus matrix by minimizing the disagreement between the coefficient matrix and the consensus matrix. But, it performs well only for the homogeneous views. Akata et al. [1] presented an approach to learn common representation from image features and the associated tag via common coefficient matrix constraint. Xiangnan et al. [18] extended NMF for multi-view clustering by jointly factorizing the multiple matrices through the co-regularization, which has shown better performance for views with varying levels of quality. Sunil et al. [15] proposed a partial shared nonnegative subspace learning method for two views, which shows the effectiveness in social media retrieval. Furthermore, Liu et al. [31] generalized this idea into multiview nonnegative data, which can deal with more than two views. Recently, Shao et al. [37] proposed a online multi-view clustering method with incomplete view via imposing lasso regularization on the representation of each view. More references can be referred to [16, 33, 40, 50]. These methods are useful for the nonnegative multiview data analysis, however, they are not suitable for the noisy views and incomplete views, which are often encountered in real applications. For example, in the clustering of bi-lingual documents, two different languages can be regarded as two different views, however, many documents usually have only one language part.

In this paper, we propose a co-regularized nonnegative matrix factorization method with correlation constraint for robust multi-view feature learning, which provides an explicit latent representation via capturing complementary and consistent information across different views. As shown in Fig. 2, different views are represented in heterogeneous feature spaces. The proposed method aims to learn robust features from all views simultaneously via exploiting the complementary and consistent information between different views. More specifically, we learn the latent representation shared by different views via maximizing the correlation between the coefficient matrix and consensus matrix. Meanwhile, we impose similarity constraints on the latent representation by co-regularizing each pair of views during the factorization process. The main contributions are summarized as follows:

  • co-regularization: we exploit co-regularization for each pair of views, which is effective to accommodate the imbalance of the quality of multiple views;

  • correlation constraint: we impose correlation constraint on the low-dimensional space to obtain the compact latent representation shared by different views;

  • robustness to noisy views: the experimental results show the proposed method are more robust than existing methods, especially for the noisy views.

Fig. 2
figure 2

Illustration of the proposed approach. Firstly, different features are extracted from the objects, such as color, texture, shape and so on. Then, the low-dimensional representations are obtained by multi-view nonnegative matrix factorization, in which the co-regularization on each pair of coefficient matrices and the correlation constraints are imposed during the factorization processing

The remainder of this paper is organized as follows. In Section 2, we briefly review some related works. In Section 3, we present the proposed multi-view NMF via the co-regularization with correlation constraint. In Section 4, we give the details of optimization algorithm. Then, we report the experimental results in Section 5 and summarize this paper in Section 6.

2 Related works

In this section, we briefly review nonnegative matrix factorization (NMF) and multiview NMF.

2.1 NMF

Given a nonnegative matrix X, NMF decomposes X into the production of non-negative matrices U and V [10], i.e., XU V T. The objective function of NMF can be formulated as follows [5]:

$$\begin{array}{@{}rcl@{}} && \min\limits_{U,V}{\left\| X-UV^{T}\right\| }_{F}^{2} \\ && s.t. ~~ U\geq 0,V\geq 0. \end{array} $$
(1)

NMF has shown good performance in pattern recognition and computer vision [41, 49].

2.2 Multi-view NMF

Given a multiview nonnegative dataset consisting of N samples with n v different views as \(\{X^{(1)},X^{(2)},\dots ,X^{(n_{v})}\}\). For each view X (v), multiview NMF [12] factorizes X (v)U (v)(V (v))T , and learns a latent representation V across all the views via the following objective function:

$$\begin{array}{@{}rcl@{}} &&\min\sum\limits_{v=1}^{n_{v}}\left\{\left\|X^{(v)}-U^{(v)}(V^{(v)})^{T}\right\|_{F}^{2}+\lambda_{v} \left\|V^{(v)}-V^{*}\right\|_{F}^{2}\right\}, \\ && s.t.~U^{(v)},V^{(v)},V^{*} \geq 0, \end{array} $$
(2)

where λ v is the regularization parameter, which balances the importance of different views and the reconstruction error.

3 Co-regularized multiview NMF with correlation constraint

In this section, we present the co-regularized multiview NMF with correlation constraint for nonnegative representation learning. Given a multiview data set consisting of N samples with n v views as \({\Gamma }=\{X^{{(v)}}\in \mathbb {R}_{+}^{m_{v}\times N}\}_{v=1}^{n_{v}}\), \(X^{{(v)}}=\{ \vec {x}_{1}^{(v)},{\cdots } ,\vec {x}_{N}^{(v)} \}\), \( X_{i}=\{\vec {x}_{i}^{(1)},\dots ,\vec {x}_{i}^{(n_{v})}\} \), where X (v) denotes the N samples of the v th view with dimensionality m v , X i is the i th sample from different views. We want to learn the common latent representation V cross different views under the framework of NMF.

3.1 Co-regularization

The intuitive method for multiview representation learning is to learn a common representation by regularizing the representation matrices of different views. This idea works well for homogeneous views or all the views with similar quality. However, in real applications, the quality between views might vary drastically. Thus, the existing methods would be failed.

In this paper, we impose similarity constraints on each pair of views, which encourages the coefficients matrices learned from any pair of views to be complement with each other during the factorization processing. Given the contaminated data from view X (v) and the associated representation V (v), the corresponding clear data from view t is X (t) and the associated representation is V (t), the coefficients matrices V (v) and V (t) would be complement with each other by minimizing \(\left \|V^{(v)}-V^{(t)}\right \|_{F}^{2}\). Thus, the problem of quality imbalance between different views can be addressed efficiently. Considering all the pair of views, the co-regularization term can be defined as follows:

$$\begin{array}{@{}rcl@{}} \sum\limits_{t=1}^{n_{v}}{\lambda_{vt}\left\|V^{(v)}-V^{(t)}\right\|_{F}^{2} } \end{array} $$
(3)

where λ v t is the parameter to balance the importance of the similarity constraint between V (v) and V (t).

3.2 Correlation constraint

As we known, different views are complementary to each other, which capture the same latent structure of the same entity [12, 21]. To utilizing this information, in this paper, we propose correlation constraint on the low-dimensional representation to learn a compact and shared latent representation. Given the coefficient vector \(V_{i,\cdot }^{(v)} \) and consensus vector \( V_{i,\cdot }^{*}\) of the i th sample, we encourage the correlation between \(V_{i,\cdot }^{(v)} \) and \( V_{i,\cdot }^{*}\) to be as large as possible. This can be formulated as follows:

$$\max\left\{V_{i,\cdot}^{(v)}(V_{i,\cdot}^{*})^{T}\right\}.$$

Considering all the N sample, we have \({\sum }_{i=1}^{N}V_{i,\cdot }^{(v)}(V_{i,\cdot }^{*})^{T}=Tr(V^{(v)}(V^{*})^{T})\), where T r denotes the trace of a matrix. Thus, correlation constraints can be formulated as follows:

$$\begin{array}{@{}rcl@{}} \min\left\{Tr\left[V^{*}(V^{*})^{T}-V^{(v)}(V^{*})^{T}\right]\right\}. \end{array} $$
(4)

Here, we impose constraints \( V_{i,\cdot }^{*}(V_{i,\cdot }^{*})^{T}\) on \( V_{i,\cdot }^{*}\) in order to learn meaningful representation.

3.3 Objective function

Incorporating the co-regularization (3) and correlation constraint (4) into the NMF framework, we obtain the objective function for the proposed method:

$$\begin{array}{@{}rcl@{}} &&\min\sum\limits_{v=1}^{n_{v}}\left\{{\left\|{X^{(v)}-U^{(v)}(V^{(v)})^{T}}\right\| }_{F}^{2}+\sum\limits_{t=1}^{n_{v}}{\lambda_{vt}\left\|{ V^{(v)}-V^{(t)}}\right\| }_{F}^{2}\right.\\ && \qquad\qquad\left.+ \sigma_{v} Tr\left[V^{*}(V^{*})^{T}-V^{(v)}(V^{*})^{T}\right]\vphantom{\sum\limits_{t=1}^{n_{v}} \lambda_{vt}}\right\},\\ && s.t.~U^{(v)},V^{(v)},V^{*} \geq 0,\left\| U_{\cdot ,k}^{(v)}\right\|_{1}=1, \forall 1\leq k\leq K \end{array} $$
(5)

where λ v t and σ v are the regularization parameters and K is the dimensionality of low dimensional subspace. \(\| U_{\cdot ,k}^{(v)}\|_{1}=1\) is the normalization with respect to the basis vector according to the relationship between NMF and probabilistic latent semantic analysis [11].

4 Optimization algorithm

To simplify the computation, we formulate the constraint on the basis matrix U (v) into following diagonal matrix:

$$Q^{(v)}=Diag\left( \sum\limits_{i=1}^{m_{v}}U_{i,1}^{(v)},\sum\limits_{i=1}^{m_{v}}U_{i,2}^{(v)},\ldots, \sum\limits_{i=1}^{m_{v}}U_{i,K}^{(v)} \right) $$

Thus, problem (5) can be reformulated as below:

$$\begin{array}{@{}rcl@{}} &&\min\sum\limits_{v=1}^{n_{v}}\left\{{\left\|{X^{(v)}-U^{(v)}(V^{(v)})^{T}}\right\|}_{F}^{2}+\sum\limits_{t=1}^{n_{v}} \lambda_{vt}\left\|{V^{(v)}Q^{(v)}-V^{(t)}}\right\|_{F}^{2}\right.~~~\\ &&\qquad\qquad\left.+ \sigma_{v} Tr\left[V^{*}(V^{*})^{T}-V^{(v)}Q^{(v)}(V^{*})^{T}\right]\vphantom{\sum\limits_{t=1}^{n_{v}} \lambda_{vt}}\right\}\\ &&~s.t.~U^{(v)},V^{(v)},V^{*} \geq 0 \end{array} $$
(6)

4.1 Optimize U (v) and V (v) for given V

We utilize the alternative update scheme, i.e., solving one variable with the others fixed. When V is fixed, for each given v, the computation of U (v) and V (v) is independent of view. Therefore, we use X, U, V, λ t , σ and Q to denote X (v), U (v), V (v), λ v t , σ v and Q (v) for the brevity.

4.1.1 Optimize U for given V and V

Given V and V , the problem (6) can be solved by optimizing each row of U separately as follows:

$$\begin{array}{@{}rcl@{}} L(U)&=&\left\| X-UV^{T}\right\|_{F}^{2}+\sum\limits_{t=1}^{n_{v}}{\lambda_{t}\left\|{VQ-V^{(t)}}\right\| }_{F}^{2}\\ &&+\sigma Tr\left[V^{*}(V^{*})^{T}-VQ(V^{*})^{T}\right]+Tr({\Theta}^{T}U), \end{array} $$
(7)

where \( {\Theta } =[\theta _{i,k}]\in \mathbb {R}^{m\times K} \) is the Lagrange multipliers for the non-negative constraint U ≥ 0. The partial derivatives of L(U) with respect to U i, k is presented below:

$$\begin{array}{@{}rcl@{}} \frac{\partial{L(U)}}{\partial{U_{i,k}}}=-2(XV)_{i,k}+2(UV^{T}V)_{i,k}+S_{i,k}+\sigma T_{i,k}+\theta_{i,k} , \end{array} $$
(8)

where S i, k is the derivative of \( {\sum }_{t=1}^{n_{v}}{\lambda _{t}\left \|{VQ-V^{(t)}}\right \| }_{F}^{2} \) , and T i, k is the derivative of T r[V (V )TV Q(V )T] with respect to U i, k . Their calculus formulations are shown below, respectively:

$$\begin{array}{@{}rcl@{}} S_{i,k}&=& \frac{\partial{\sum}_{t=1}^{n_{v}}{\lambda_{t}\left\|{VQ-V^{(t)}}\right\| }_{F}^{2}}{\partial {U_{i,k}}} \\ &=&2\sum\limits_{t=1}^{n_{v}}\lambda_{t}\left\{\sum\limits_{l=1}^{m}U_{l,k}\left( \sum\limits_{j=1}^{N}V_{j,i}V_{j,k}\right)-\sum\limits_{j=1}^{N}V_{j,i}V_{j,k}^{(t)}\right\}\\ T_{i,k}&=& \frac{\partial{Tr\left[V^{*}(V^{*})^{T}-VQ(V^{*})^{T}\right]}}{\partial {U_{i,k}}}=-\sum\limits_{j=1}^{N}V_{j,i}V_{j,k}^{*} \end{array} $$
(9)

Setting (8) to zero and utilizing the KKT conditions θ i, k U i, k = 0 , we can get following equation for U i, k :

$$\left( -2(XV)_{i,k}+2(UV^{T}V)_{i,k}+S_{i,k}+\sigma T_{i,k}+\theta_{i,k}\right) U_{i,k} =0 $$

This equation leads to the update rule below for U i, k :

$$\begin{array}{@{}rcl@{}} U_{i,k}\leftarrow U_{i,k} \frac{2(XV)_{i,k}+2{\sum}_{t=1}^{n_{v}}\lambda_{t}{\sum}_{j=1}^{N}V_{j,i}V_{j,k}^{(t)}+\sigma{\sum}_{j=1}^{N}V_{j,i} V_{j,k}^{*}}{2(UV^{T}V)_{i,k}+2{\sum}_{t=1}^{n_{v}}\lambda_{t}{\sum}_{l=1}^{m}U_{l,k}{\sum}_{j=1}^{N}V_{j,i}V_{j,k}} \end{array} $$
(10)

4.1.2 Optimize V for given U and V

To optimize V, we first normalize the columns of U using Q as following:

$$U\leftarrow UQ^{-1},V\leftarrow VQ $$

Then, the problem (6) is equivalent to minimize following objective function:

$$\begin{array}{@{}rcl@{}} L(V)&=&\left\| X-UV^{T}\right\|_{F}^{2}+\sum\limits_{t=1}^{n_{v}}{\lambda_{t}\left\|{V-V^{(t)}}\right\| }_{F}^{2} \\ &&+\sigma Tr\left[V^{*}(V^{*})^{T}-V(V^{*})^{T}\right]+Tr({\Psi}^{T}V), \end{array} $$
(11)

where \( {\Psi } =[\psi _{j,k}]\in \mathbb {R}^{N\times K} \) is the Lagrange multipliers for the non-negative constraints V ≥ 0. The partial derivatives of L(V ) with respect to V j, k is below:

$$\begin{array}{@{}rcl@{}} \frac{\partial{L(V)}}{\partial{V_{j,k}}}&=&-2(X^{T}U)_{j,k}+2(VU^{T}U)_{j,k}+2\sum\limits_{t=1}^{n_{v}}\lambda_{t}V_{j,k} \\ &&-2\sum\limits_{t=1}^{n_{v}}\lambda_{t}V_{j,k}^{(t)}-\sigma V_{j,k}^{*}+\psi_{j,k} \end{array} $$
(12)

Setting (12) to zero and utilizing the KKT conditions ψ j, k V j, k = 0, we can get following equation for V j, k :

$$\begin{array}{@{}rcl@{}} (-2(X^{T}U)_{j,k}\,+\,2(VU^{T}U)_{j,k}\,+\,2\sum\limits_{t=1}^{n_{v}}\lambda_{t}V_{j,k} \,-\,2\sum\limits_{t=1}^{n_{v}}\lambda_{t}V_{j,k}^{(t)}\,-\,\sigma V_{j,k}^{*}+\psi_{j,k}) V_{j,k} =0 \end{array} $$
(13)

Thus, the update rules for V j, k can be shown below:

$$ V_{j,k}\leftarrow V_{j,k}\frac{2(X^{T}U)_{j,k}+2{\sum}_{t=1}^{n_{v}}\lambda_{t}V_{j,k}^{(t)}+\sigma V_{j,k}^{*}}{2(VU^{T}U)_{j,k}+2{\sum}_{t=1}^{n_{v}}\lambda_{t}V_{j,k}} $$
(14)

4.2 Optimize V for given U (v) and V (v)

Taking the derivative of the objective function (6) with respect to V , we obtain

$$ \frac{\partial R}{\partial V^{*}}=\sum\limits_{v=1}^{n_{v}}2\sigma_{v} V^{*}-\sum\limits_{v=1}^{n_{v}}\sigma_{v}V^{(v)}Q^{(v)} , $$
(15)

where \(R={\sum }_{v=1}^{n_{v}}\sigma _{v} Tr\left [V^{*}(V^{*})^{T}-V^{(v)}Q^{(v)}(V^{*})^{T}\right ]\). Setting (15) to 0, we get the closed solution for V :

$$ \begin{aligned} V^{*} = \frac{{\sum}_{v=1}^{n_{v}}\sigma_{v} V^{(v)}Q^{(v)}}{{\sum}_{v=1}^{n_{v}}2\sigma_{v}}~~~~ \end{aligned} $$
(16)

U (v), V (v)and V are updated alternatively via (10), (14) and (16). It can be seen that U (v), V (v)and V are non-negative after each update. Moreover, it is provable that the objective function is non-increasing under the above iterative updating rules, and the convergence is guaranteed. The proof can be demonstrated by constructing the auxiliary function similar to [8]. This procedure repeats until convergence. The complete algorithm is summarized in Algorithm 1.

figure f

4.3 Complexity analysis

We adopt the standard NMF as baseline to analyse the time complexity of the proposed method. It can be seen that the proposed method is essentially an extension of NMF for multiple view data. The complexity of basic NMF’s update rules in each iteration is O(m K N), where big O is the notation for complexity. For each update of U in our proposed method, its cost is O(n v m K N). For each update of V, the additional cost in the proposed method is the second term in the numerator and denominator, whose time complexity is O(n v K N). Therefore the time complexity of the proposed method for each view is O(n v m K). Then, the total complexity of the proposed method in each iteration is O(n v m K N), where n v is the number of the views.

5 Experimental results

In this section, we conduct experiments on four datasets to evaluate the performance of the proposed method compared to the following algorithms.

  • Single view. This method runs each view separately using the NMF. Both the best and the worst single view results are reported, which are denoted by BSV and WSV respectively.

  • Feature concatenation (FC). This method runs NMF directly on the concatenated features from all views.

  • Multi-view NMF (Multi-NMF) [12]. This method requires all the representation of different views to share a common latent one, i.e., \( {\sum }_{v=1}^{n_{v}}\lambda _{v} \| V^{(v)}-V^{*} \|_{F}^{2}\). As the authors provided a NMF-based initialization, we use the same initialization method and set the regularization parameters as 0.01.

  • Multi-view RNMF (Multi-RNMF) [35]. This method learns the common latent representation under the nonnegative patch alignment framework and considers the local geometric structure for each view.

  • Co-regularization NMF (CoNMF) [18]. This method learns the common latent space via pair-wise co-regularization.

  • Our method. This method learns latent representation by simultaneously exploiting the complementary and consistent information from all views through the co-regularization and correlation constraint.

5.1 Data sets and evaluation

ORL dataset

The ORL dataset consists of 40 subjects and 10 different images for each subject with totally 400 images. The images are grayscale and have been normalized to 32 × 32 pixels, some of which are shown in Fig. 3a. We adopt two different views. The first view is the raw pixel values, i.e., \( X^{1}\in \mathbb {R}_{+}^{1024\times 400}\), and the second view is the L B P (8,1) feature, i.e., \( X^{2}\in \mathbb {R}_{+}^ {59\times 400}\).

Fig. 3
figure 3

a The samples from the ORL dataset. b The samples from CMU-PIE dataset with random block occlusion. c The samples from handwritten digit dataset. It consists of original images and noisy images with salt & pepper noise, in which the noise level is 25%

CMU-PIE dataset

There are 41,368 images under 68 persons with 13 different poses, 43 different illumination, and 4 different expressions in the CMU-PIE dataset. In our experiment, we chose 42 images at pose 27 for each person at different illumination conditions with resolution 32 × 32 and add white random block occlusion with size 10 × 10. There are 2856 images in all and some examples are shown in Fig. 3b. We consider two different views: the first view is the raw pixel values \( X^{1}\in \mathbb {R}_{+}^{1024\times 2856}\), and the second view is the local binary pattern \( X^{2}\in \mathbb {R}_{+}^{256\times 2856}\).

UCI handwritten digit dataset

This handwritten digits (0–9) data is from the UCI repository, which consists of 2000 samples, with the first view being the 76 fourier coefficients of the character shapes, the second view being the 240 pixel averages in 2 × 3 windows, the third view being the 216 profile correlations, and the fourth view being the 47 Zernike moments. In order to test the robustness of the proposed method, salt&pepper noises are added with noise level varied as {5%,10%,15%,20%,25%}. Some examples are shown in Fig. 3c.

OuluVS dataset

Lipreading is a technology to interpret the utterance solely using the visual information of lip movements. The OuluVS dataset records the lipreading video of 20 subjects, with a total of 817 videos. Each subject was asked to sit in front of a camera and speaks 10 different sentences as shown in Table 1. The subjects are from four countries, with different speech habit and accent. Usually, multivariate time series are used to model the facial landmarks around mouth outer. Then, the extracted time series are formulated as texture images with a modified recurrence plot. The recognition is based on the texture images. In this dataset, we extract two different features as different views from the texture images. The first view is the uniform local binary pattern operator \(LBP_{p,r}^{u2}\) with p = 8 and r = {1, 2, 3} to generate a 177-dimensional feature vector. For the second view is the grey level co-occurrence matrix (GLCM). We use four direction (0°, 45°, 90° and 135°) and five distances (d = {1, 2, 3, 4, 5}) to calculate 20 GLCMs. Thus, a 400-dimensional feature vector is obtained.

Table 1 The ten different sentences in OuluVS dataset

Evaluation metrics

For quantitative evaluation, the accuracy (ACC) and the normalized mutual information metric (NMI) are used to measure the clustering performance [3, 54]. The detailed definitions are shown below.

  • Clustering accuracy (ACC). ACC compares the generated clusters with the ground truth. In details, given samples x i , let l i and g i be the clusters label and ground truth label. The definition of ACC is defined as below:

    $$ACC = \frac{1}{n}\sum\limits_{i=1}^{n}\delta(g_{i},map(l_{i})),$$

    where n is the total number of samples, and δ(x, y) is the delta function that equals one if x = y, else δ(x, y) = 0. And map(⋅) is the permutation mapping function, which maps each cluster label to the real label. Here, we used the Kuhn-Mukres algorithm [9]. It is easy to see the range of ACC is [0, 1]. The more large of value ACC is, the better of cluster results is.

  • Normalized mutual information (NMI). Let \(\mathcal {C}\) denote the set of clusters obtained from the ground truth, and \(\mathcal {C}^{\prime }\) be the cluster results, the mutual information is defined as below:

    $$\begin{array}{@{}rcl@{}} MI = \sum\limits_{\mathbf{c}_{i} \in \mathcal{C}, \mathbf{c}_{j}^{\prime}\in \mathcal{C}^{\prime}} p(\mathbf{c}_{i},\mathbf{c}_{j}^{\prime}) log\frac{p(\mathbf{c}_{i},\mathbf{c}^{\prime}_{j})}{p(\mathbf{c}_{i})p(\mathbf{c}_{j}^{\prime})}, \end{array} $$

    where p(c i ) and \(p(\mathbf {c}_{j}^{\prime })\) are the probabilities that a sample arbitrarily selected from the data set belongs to the clusters c i and \(\mathbf {c}_{j}^{\prime }\), respectively. \(p(\mathbf {c}_{i},\mathbf {c}_{j}^{\prime })\) is the joint probability density function of \(\mathcal {C}\) and \(\mathcal {C}^{\prime }\). In our experiments ,the NMI is defined as below

    $$NMI = \frac{MI(\mathcal{C},\mathcal{C}^{\prime})}{\max(H\mathbf{(\mathcal{C})},H\mathbf{(\mathcal{C}^{\prime})})},$$

    where \(H\mathbf {(\mathcal {C})}\) and \(H\mathbf {(\mathcal {C}^{\prime })}\) denote the correntropy of \(\mathcal {C}\) and \(\mathcal {C}^{\prime }\), respectively. It is easy to see that NMI ranges from 0 to 1. The more large value of NMI is, the better result of clutering is.

5.2 Experimental results with two views

In this section, we conduct clustering on three real-word datasets with two views, respectively. For handwritten digit dataset, we select the first and second view. In order to evaluate the performance effectively, we runs each experiment 30 times, then the average clustering results and standard variation are reported.

The clustering results with different number of clusters K on ORL dataset are shown in Tables 2 and 3. From that, it can be seen the clustering performances of all algorithms get better with the increase of K and the proposed method performs better than other multi-view algorithms in most cases. It notes that only the best results are recorded in BSV, which is not stable for clustering.

Table 2 ACC with different K values on the ORL dataset
Table 3 NMI with different K values on the ORL dataset

The clustering results with different number of clusters K on CMU-PIE dataset are shown in Tables 4 and 5. It is obvious that the multi-view algorithms outperform the single feature method, even for the best results. Among all the multiview methods, Multi-RNMF and our method are better than that of all the other methods. Our method outperforms Multiview-RNMF method slightly.

Table 4 ACC with different K values on the CMU-PIE dataset with occlusion
Table 5 NMI with different K values on the CMU-PIE dataset with occlusion

For the handwritten digit dataset and OuluVS dataset, we fix the cluster number K at 10. Tables 6 and 7 show the comparison results of the average clustering performance on those two datasets, respectively. It is clear that the clustering performance of Multi-NMF and Multi-RNMF is better than that of single view NMF. Meanwhile, the performance of FC has obviously different on the different datasets because it only integrates multiple features by concatenating of feature, which ignores the differences of statistical properties between different views. In addition, our method obtain impressive clustering performance in all cases due to the utilization of both consistent and complementary information across different views.

Table 6 Clustering results on the handwritten digit dataset (K = 10)
Table 7 Clustering results on the OuluVS dataset (K = 10)

5.3 Experimental results with four views

In this section, we conduct clustering on the handwritten digit dataset with four views. To test the robustness, the salt&pepper noises are added with noise level varying from 5 to 25%. The clustering results are shown in Figs. 4 and 5. It is obvious that the performance of multi-view algorithms is better than that of single view NMF and FC. Both Multi-NMF and Multi-RNMF achieve satisfactory clustering results, while Multi-RNMF performs better than that of Multi-NMF with increase of noise level. Meanwhile, the proposed method obtains the best clustering results compared to other algorithms. Specifically, the performance of all algorithms drops down sharply with the increase of the noise level, but the proposed method decreases slightly. This is mainly because we utilize the co-regularization and correlation constraints to exploit the complementarity and consistent information across different views.

Fig. 4
figure 4

Clustering results of different methods on the handwritten digit dataset with four views in the presence of salt&pepper noises varying from 5 to 10%

Fig. 5
figure 5

Clustering results of different methods on the handwritten digit dataset with four views in the presence of salt&pepper noises varying from 15 to 25%

5.4 Visualization of clustering results

To visualize the clustering results, we randomly select three subjects from the ORL dataset with two views. There are 10 samples for each subject and 30 samples in all. The hidden factor k is set to 3. The learned common representation in the new space are shown in Fig. 6. It is obvious to see that the proposed method obtains more discriminative features.

Fig. 6
figure 6

The visualization of low-dimensional representation of three different classes, which are randomly selected from the ORL dataset. Different colors denote different classes, and the ten points in each class corresponds to the ten images in each subject

5.5 Analysis of convergence

In this section, we demonstrate the convergence of our method by conducting experiment on handwritten digit dataset with the same initialization. As shown in Fig. 7, the objective function value is non-increasing under the proposed iterative update rules. Meanwhile, it is easy to see that the proposed method converges faster than that of Multi-NMF. Compared to Multi-NMF, the objective function value of the proposed method decreases fast within 5 iterations. This is because the co-regularization term and the correlation constraints limited the solution space. These also can be verified through the clustering results on the handwritten digit dataset in Figs. 4 and 5.

Fig. 7
figure 7

The convergence curves of Multi-NMF and our method on handwritten digit dataset

5.6 Parameters selection

Two kinds of parameters σ v and λ v t are needed to set. The parameter σ v balances the correlation constraint and the latent representation, while λ v t determines the importance of each pair of view in co-regularization. In this section, we conduct experiments on three dataset with two views to study the influence of them. Figures 8 and 9 show the performance of the proposed method with one parameter varying while the others fixed. It is clear that our method is relatively stable across a wide range of values, especially on the ORL dataset. For the OuluVS dataset, they vary dramatically compared to other datasets. According to the results the parameters are set to 0.001 and 0.05 respectively in our experiments.

Fig. 8
figure 8

The performance of different methods on three dataset with σ v varying while fixing λ v t = 0.001

Fig. 9
figure 9

The performance of different methods on three dataset with λ v t varying while fixing σ v = 0.05

6 Conclusion

In this paper, we proposed a co-regularized multiview nonnegative matrix factorization method with correlation constraint for nonnegative representation learning. We exploited the complementary information through the co-regularization to deal with the imbalance views. Thus, the latent representations were complemented to each other when one of views was contaminated. Meanwhile, we imposed correlation constraint on the common latent subspace to obtain the latent representation shared by different views. The experimental results show that the representation learned by proposed method is more compact and discriminative, especially for noisy view. In the future work, we will study the supervised multiview nonnegative representation learning for classification.