1 Introduction

Recently, multiview data have become very common in many real-world applications [1, 2]. For example, in image data processing, an image can be presented by diverse features, such as HOG, SIFT, and LBP [3]. In biometrics, a person’s identity can be recognized by faces, fingerprints, and sounds [4]. Since diverse feature views can depict diverse perspectives of the same object, sufficient research results have demonstrated that the performance of multiview learning will be substantially improved by excavating complementary information of multiview data [5].

Multiview clustering (MVC), which divides multiview data into different groups by efficiently integrating multiple feature views to guarantee that highly similar instances are divided into the same group while dissimilar instances are divided into different groups, is a fundamental task of multiview learning [6, 7]. In general, most existing MVC methods employ graph-based models since the similarity graph can effectively characterize the data structure [8]. Typically, these methods first construct a view-specific affinity graph by using some similarity metrics and then fuse all the constructed view-specific affinity graphs into a consensus affinity graph. Finally, a spectral or graph algorithm is applied to realize clustering [9, 10], or the clustering result can be obtained directly from the fusion [11].

The clustering performance of graph-based methods is largely dependent on the quality of the constructed affinity graph [12]. To learn a better affinity graph, various affinity graph construction methods are proposed, and typically, they are divided into three categories [13]. The first category is to predefine a similarity graph as an affinity graph [14]. Here, some outstanding issues are as follows: (1) The construction of an affinity graph is easily affected by the choice of similarity metric, neighborhood size, and scaling parameter, all of which are data-dependent and noise-sensitive [15]; (2) The constructed affinity graph cannot capture the underlying graph structure of the data well. The second category is the adaptive neighbors graph approach, which assigns adaptive and optimal neighbors to each data point according to local distances to learn an affinity graph [16,17,18]. It does not need to specify the neighborhood size, and the similarity among data points is adaptively learned from the data. Generally, the data points with a smaller distance have a higher affinity value, while the data points with a larger distance have a lower affinity value. This approach is an effective way to preserve the local manifold structure [15]. The third category is the data self-expression approach, which reconstructs each data point by a linear combination of all other points in the same subspace and then generates a coefficient matrix to build the affinity graph [19, 20]. This approach is an effective way to capture the global structure [21].

In general, the underlying data structures are unknown in advance which poses a challenge for constructing an affinity graph that can best capture the essential data structure [22]. Because both the local and global graph structures are crucial for uncovering the graph structures of data, they can provide each other with possible complementary information to boost the performance. Therefore, it is necessary to integrate the adaptive neighbor graph approach and the data self-expression approach into a unified graph structure learning model to learn a view-specific affinity graph that can not only automatically learn the similarity information from the data but also capture the local and global structures of the data.

Given that diverse views admit the same underlying cluster structure, we can obtain the consensus information from the fusion of diverse views to better exploit the cluster structure [23]. Therefore, after obtaining the view-specific affinity graphs, we need to consider how to effectively fuse them. Simply taking the average of them [24] or directly generating a common graph from them [25] fails to consider the discrimination of different views. Kang et al. [26] proposed a new dynamically weighted graph fusion method to integrate multiview information, and each view can be treated as perturbations of a consensus graph. Usually, the closer it is to the consensus graph, the larger the weight that is assigned. This graph fusion method can merge different views into a consensus graph, distinguish the contributions of different views and explore the heterogeneous complementary information effectively.

An ideal consensus affinity graph should have between-cluster affinities that are all zeros, while the inner-cluster affinities are not zeros, namely, it should obey the block diagonal structure, which facilitates good clustering performance [27,28,29,30]. Unfortunately, in real-life noisy applications, the consensus affinity graph may only have a weak block-diagonal structure, and the number of target blocks in it is difficult to control. Thus, we need to consider how to build a robust block diagonal representation for the consensus affinity graph. Because the k-block diagonal regularizer is a method to pursue a block diagonal structure directly containing k blocks [27], it encourages the affinity graph to obey the desired k-block diagonal structure, and therefore, it is essential to employ the k-block diagonal regularizer to improve the quality of the consensus affinity graph.

In summary, a novel MVC method, namely consensus affinity graph learning via structure graph fusion and block diagonal representation (CAGL-SGBD) for multiview clustering, has been proposed, and the main contributions of this work are as follows:

  1. (1)

    Adaptive neighbor graph learning and the data self-expression model are integrated into a unified structure graph fusion framework that can capture the local and global structures of the data, be robust to noise, and guide the construction of the initial affinity graph for each individual view.

  2. (2)

    All the constructed structure affinity graphs are weighted into a consensus affinity graph that not only incorporates the complementary affinity structure of important views but also encourages the learned consensus affinity graph to have the capability of preserving the consensus affinity structure that is unanimously admitted by multiple views.

  3. (3)

    The k-block diagonal regularizer is introduced for the consensus affinity graph to force it to have an explicit cluster structure.

  4. (4)

    Our method integrates structure graph fusion, consensus affinity graph learning, and k-block diagonal regularization, which helps to obtain an enhanced consensus affinity graph that maintains the graph structure characteristics of multiview data and is beneficial to clustering.

The rest of the paper is organized as follows. Section 2 briefly introduces the preliminaries that are necessary for the research. Section 3 gives a detailed description and formulation of CAGL-SGBD. Section 4 designs an efficient optimization algorithm, and some analyses are presented. Section 5 presents numerical experiments. Section 6 concludes our work.

2 Preliminaries

In this subsection, we first present notations used throughout the paper, and then briefly overview some technologies that are necessary for this work.

2.1 Notations

In this paper, matrices are denoted as boldface capital letters, e.g., \({{\varvec{X}}}\). Vectors are written as boldface lower-case letters, e.g., \({{\varvec{x}}}\), and scalars are written as lower-case letters, e.g., x. For an arbitrary matrix \({{\varvec{X}}}\), its (i, j)-th entry is written as \(x_{ij}\), its j-th column is written as \({{\varvec{x}}}_j\), and its i-th row is written as \({{\varvec{x}}}^i\). For x, its i-th entry is written as \(x_i\). The transpose and inverse of matrix \({{\varvec{X}}}\) are denoted as \({{\varvec{X}}}^T\) and \({{\varvec{X}}}^{-1}\) respectively. The data matrix of the v-th view is denoted as \({{\varvec{X}}}^{(v)}\). \(\Vert {{\varvec{X}}}\Vert _2^2\) and \(\Vert {{\varvec{X}}}\Vert _F^2\) represent the \(l_2\)-norm and Frobenius norm of matrix \({{\varvec{X}}}\), respectively. The identity matrix is denoted by \({{\varvec{I}}}\).

2.2 Local Structure Graph Learning

Recently, the adaptive neighbor graph approach has been widely employed in graph-based clustering to capture the local manifold structure [18, 21, 31]. Given n multiview observations \({\{[x_i^{(1)};\cdots ;x_i^{(n_v)}]\}}_{i=1}^n \) from \( n_v\) different views, \(x_i^{(v)}\) denotes the i-th data point of the v-th view, and \({{\varvec{X}}}^{(v)}=[x_1^{(v)},x_2^{(v)},\cdots ,x_n^{(v)}] \). \({{\varvec{Z}}}^{(v)}\) is the affinity matrix of the v-th view.The adaptive neighbors graph approach can be expressed as

$$\begin{aligned}&\min _{{{\varvec{Z}}}^{(v)}}\sum _{v=1}^{n_v}\sum _{i,j=1}^n\Vert x_i^{(v)}- x_j^{(v)}\Vert _2^2 z_{ij}^{(v)}\nonumber \\&s.t.\ \ {{\varvec{Z}}}^{(v)}{} {\textbf {1}}={\textbf {1}},\ {{\varvec{Z}}}^{(v)} \ge 0, \end{aligned}$$
(1)

In Eq. (1), the constraint term is added to guarantee that the sum of each row of \({{\varvec{Z}}}^{(v)}\) is one and to ensure the probability property of \({{\varvec{Z}}}^{(v)}\) [32]. The affinity matrices learned from Eq. (1) will capture the local manifold structure adaptively. Since the local structure is very prominent for its information discovery ability, and generally, it is believed to be better than the global structure [33], therefore, exploring the local structure graph learning is a widely recognized graph clustering method [34, 35]. However, it is susceptible to noise and ignores the global structure of data.

2.3 Global Structure Graph Learning

The data self-expression approach is an effective way to automatically capture the global structure of data [22], and its basic idea is that each data sample can be expressed as a linear combination of other samples, and the combination coefficient indicates the similarities between samples [26]. The multiview self-expression model can be expressed as

$$\begin{aligned} {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)}{{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)},v\in {\{1,\cdots ,n_v\}} \end{aligned}$$
(2)

where \({{\varvec{X}}}^{(v)}\), \({{\varvec{Z}}}^{(v)}\), and \({{\varvec{E}}}^{(v)}\) stand for the data matrix, coefficient matrix, and error matrix for the v-th view, respectively. The coefficient matrix \({{\varvec{Z}}}^{(v)}\) characterizes the similarities among samples, a low-rank constraint can be applied to it to capture the global data structure [36], and a \(\textit{l}_{2,1}\)-norm can be imposed on \({{\varvec{E}}}^{(v)}\) to address the sample-specific outliers and corruptions [37]. Thus, the global structure graph learning can be formulated as

$$\begin{aligned}&\min _{{{\varvec{Z}}}^{(v)}, {{\varvec{E}}}^{(v)}}\sum _{v=1}^{n_v}\Vert {{\varvec{Z}}}^{(v)}\Vert _* + \lambda \sum _{v=1}^{n_v}\Vert {{\varvec{E}}}^{(v)}\Vert _{2,1}\nonumber \\&s.t.\ {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)}, {{\varvec{Z}}}^{(v)}\ge 0 \end{aligned}$$
(3)

Equation (3) is also a typical objective function of subspace clustering based on the self-representation model, and the coefficient matrix \({{\varvec{Z}}}^{(v)}\) learned from it can capture the global structure information of the v-th view. Many existing subspace clustering methods are based on the above model to cooperatively learn the low-rank representation and the affinity matrix to improve the clustering performance [38, 39].

2.4 K -block Diagonal Regularizer

The k-block diagonal structure of the affinity graph can promote perfect data clustering. For an arbitrary affinity matrix \({{\varvec{H}}}\in {{\varvec{R}}}^{n \times n}({{\varvec{H}}}\ge 0,{{\varvec{H}}}={{\varvec{H}}}^T) \), its corresponding Laplacian matrix is \( {{\varvec{L}}}_{{\varvec{H}}}=Diag({{\varvec{H}}}{} {\textbf {1}})-{{\varvec{H}}} \), The k-block diagonal regularizer [30, 40] is defined as

$$\begin{aligned} \parallel {{\varvec{H}}}\Vert _{\fbox {\textit{k}}}=\sum _{i=n-k+1}^n \lambda _i ({{\varvec{L}}}_{{\varvec{H}}}) \end{aligned}$$
(4)

where \(\lambda _i ({{\varvec{L}}}_{{\varvec{H}}})\) (\(i\in {\{1,\cdots ,n\}}\)) are the eigenvalues of \({{\varvec{L}}}_{{\varvec{H}}}\) in descending order.

The regularizer in Eq. (4) has an important property, namely, the multiplicity k of the eigenvalue 0 of \({{\varvec{L}}}_{{\varvec{H}}}\) equals the number of connected components (blocks) in \({{\varvec{H}}}\), which will make the affinity graph have exactly k connected components for data with k clusters. Thus, the k-block diagonal regularizer can not only facilitate the affinity matrix to be a block diagonal structure but also control the number of blocks [29, 30].

3 Proposed Method

3.1 Structure Graph Fusion

To explore a more flexible local manifold structure and a better global structure representation capacity and to take full advantage of the possible complementary information provided by them, we integrate Eqs. (1) and (3) into a unified structure graph fusion framework to jointly learn a view-specific structure affinity graph, which can be formally expressed as

$$\begin{aligned}&\min _{{{\varvec{Z}}}^{(v)}, {{\varvec{E}}}^{(v)}}\sum _{v=1}^{n_v}\{{Tr({{\varvec{X}}}^{(v)}{{\varvec{L}}}_{{{\varvec{Z}}}}^{(v)}{{\varvec{X}}}^{(v)T})}+ \lambda _1\Vert {{\varvec{Z}}}^{(v)}\Vert _*+\lambda _2\Vert {{\varvec{E}}}^{(v)}\Vert _{2,1}\} \nonumber \\&s.t.\ {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)}{{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)},\ {{\varvec{Z}}}^{(v)}{} {\textbf {1}}={\textbf {1}},\ {{\varvec{Z}}}^{(v)}\ge 0 \end{aligned}$$
(5)

where \(\lambda _1\) and \(\lambda _2\) are trade-off parameters. \({{\varvec{L}}}_{{{\varvec{Z}}}}^{(v)}\) denotes the Laplacian matrix, \({{\varvec{L}}}_{{{\varvec{Z}}}}^{(v)}={{\varvec{D}}}_{{{\varvec{Z}}}}^{(v)}-({{\varvec{Z}}}^{(v)T}+{{\varvec{Z}}}^{(v)})/2\), \({{\varvec{D}}}_{{{\varvec{Z}}}}^{(v)}\) is a diagonal degree matrix, and its diagonal elements are \(\sum _{j}(z_{ij}^{(v)}+z_{ji}^{(v)})/2\). The first term in Eq. (5) ensures that each entry in \({{\varvec{Z}}}^{(v)}\) can directly describe the local similarity between data points in the v-th view, and the second term encourages \({{\varvec{Z}}}^{(v)}\) to follow the low-rank property to capture the global structure of the data. The third term addresses sample-specific corruption and outliers.

Thus, the structure affinity graph \({{{\varvec{Z}}}}^{(v)}\) learned from Eq. (5) can not only characterize the affinities between data points in the v-th view but also preserve the local and global structures of the data in the v-th view.

3.2 Consensus Affinity Graph Learning

To learn an optimal consensus affinity graph, we fuse all the structure affinity graphs into a consensus affinity graph \({{{\varvec{S}}}}\) based on two intuitive assumptions [26]: (1) Each \({{{\varvec{Z}}}}^{(v)}\) can be regarded as a perturbation of \({{{\varvec{S}}}}\); (2) The closer \({{{\varvec{Z}}}}^{(v)}\) is to \({{{\varvec{S}}}}\), the larger the weight of the v–th view is assigned. Thus, we have

$$\begin{aligned}&\min _{{{\varvec{S}}}}\sum _{v=1}^{n_v}w^{(v)}\Vert {{\varvec{S}}}-{{{\varvec{Z}}}}^{(v)} \Vert _F^2 \nonumber \\&s.t.\ {{\varvec{s}}}^i {\textbf {1}}=1,s_{ij}\ge 0, w^{(v)}\ge 0 \end{aligned}$$
(6)

where \(w^{(v)}\) is the weight of the v–th view and represents the importance of the v–th view; the larger \(w^{(v)}\) is, the greater the importance of the v–th view. \({{\varvec{s}}}^i\) represents the i–th row of \({{\varvec{S}}}\), and the constraint terms are added to ensure the probabilistic nature of \({{\varvec{S}}}\).

To ensure that learned consensus affinity graph \({{\varvec{S}}}\) can well characterize the affinities between data points from different views and capture the consensus affinity structure that is universally admitted by all views, we assume that for all structure affinity graphs \(\{{{{\varvec{Z}}}}^{(v)}\}_{v=1}^{n_v}\), any two data points \(z_i\) and \(z_j\) should have the same affinity value \(s_{ij}\). Then, we have

$$\begin{aligned}&\min _{{{\varvec{S}}}}\frac{1}{2}\sum _{v=1}^{n_v}\sum _{i,j}^{n}\Vert \textit{z}_{i}^{(v)} -\textit{z}_{j}^{(v)} \Vert _2^2s_{ij} \nonumber \\&s.t.\ {s}_{ij} \ge 0 \end{aligned}$$
(7)

We employ Eq. (7) to learn the similarities between data points is based on an intuitive assumption that the self-expressiveness coefficient matrix \({{\varvec{Z}}}^{(v)}\) learned from Eq. (5) can be deemed the substitute of the data matrix \({{\varvec{X}}}^{(v)}\) because each entry of \({{\varvec{Z}}}^{(v)}\) can quantify the similarity between two data points in \({{\varvec{X}}}^{(v)}\), namely, if two data points are close to each other in the original space, their new representations in the new space must also be similar to each other [13]. Furthermore, compared to the original data matrix \({{\varvec{X}}}^{(v)}\), the clean structure affinity matrix \({{\varvec{Z}}}^{(v)}\) can better describe the intrinsic structure of the real data; thus, a more robust consensus affinity matrix \({{\varvec{S}}}\) can be derived from \(\{{{{\varvec{Z}}}}^{(v)}\}_{v=1}^{n_v}\).

Then, the model of consensus affinity graph learning is

$$\begin{aligned}&\min _{{{\varvec{S}}},w^{(v)}}\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}}-{{{\varvec{Z}}}}^{(v)} \Vert _F^2+Tr({{\varvec{Z}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{Z}}}^{(v)T})\} \nonumber \\&s.t.\ {{\varvec{s}}}^i {\textbf {1}}=1,s_{ij}\ge 0, w^{(v)}\ge 0 \end{aligned}$$
(8)

where \({{\varvec{L}}}_{{{\varvec{S}}}}\) denotes the Laplacian matrix, \({{\varvec{L}}}_{{\varvec{S}}}={{\varvec{D}}}_{{\varvec{S}}}-({{\varvec{S}}}^T+{{\varvec{S}}})/2\), and \({{\varvec{D}}}_{{{\varvec{S}}}}\) is a diagonal degree matrix whose diagonal elements are \(\sum _j(s_{ij}+s_{ji})/2\). Equation (8) can adaptively learn a consensus affinity graph \({{\varvec{S}}}\) that has the capability of preserving the consensus affinity structure being admitted by all the structural affinity graphs. At the same time, during the fusion process, the structural affinity graph is dynamically weighted, which effectively reduces the influences of noisy views and merges important complementary information.

3.3 Block Diagonal Representation

However, the consensus affinity graph \({{\varvec{S}}}\) learned from Eq. (8) may not have the block diagonal structure that is needed for clustering. Therefore, we introduce a k-block diagonal regularizer for \({{\varvec{S}}}\) to ensure that it satisfies the block diagonal characteristic; then, we have

$$\begin{aligned}&\Vert {{\varvec{S}}}\Vert _{\fbox {\textit{k}}}=\sum _{i=n-k+1}^n\lambda _i ({{\varvec{L}}}_{{\varvec{S}}}) \nonumber \\&s.t.\ {{\varvec{S}}}\ge 0,{{\varvec{S}}}={{\varvec{S}}}^T \end{aligned}$$
(9)

where \({{\varvec{S}}}=[{{\varvec{s}}}_1,\cdots ,{{\varvec{s}}}_j,\cdots ,{{\varvec{s}}}_n]\). The \({{\varvec{S}}}\) learned from Eq. (9) is k-block diagonal and has an explicit cluster structure.

3.4 Objective Function

By integrating the structure graph fusion in Eq. (5), the consensus affinity graph learning in Eq. (8), and the block diagonal representation in Eq. (9) into a unified model, the final objective function is

$$\begin{aligned}&\min _{{{\varvec{S}}}, w^{(v)}, {{\varvec{Z}}}^{(v)}, {{\varvec{E}}}^{(v)}}\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{Z}}}^{(v)}\Vert _F^2 + Tr({{\varvec{Z}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{Z}}}^{(v)T})+2\lambda _1 Tr({{\varvec{X}}}^{(v)}{{\varvec{L}}}_{{{\varvec{Z}}}}^{(v)}{{\varvec{X}}}^{(v)T}) \nonumber \\&\qquad \qquad +\lambda _2 \Vert {{\varvec{Z}}}^{(v)}\Vert _* +\lambda _3\Vert {{\varvec{E}}}^{(v)} \Vert _{2,1}\} + \lambda _4\Vert {{\varvec{S}}}\Vert _ {\fbox {\textit{k}}} \nonumber \\&s.t.\ {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)},\ {{\varvec{Z}}}^{(v)}{} {\textbf {1}}={\textbf {1}},\ {{\varvec{Z}}}^{(v)}\ge 0, \nonumber \\&{{\varvec{s}}}^i {\textbf {1}}=1,{{\varvec{S}}}\ge 0,{{\varvec{S}}}={{\varvec{S}}}^T, w^{(v)}\ge 0 \end{aligned}$$
(10)

where \(\lambda _1\), \(\lambda _2\), \(\lambda _3\) and \(\lambda _4\) are trade-off parameters for balancing the corresponding terms. The first term fuses the structural affinity graphs \({\{{{\varvec{Z}}}^{(v)}}\}_{v=1}^{n_v}\) of different views adaptively into a consensus affinity graph \({{\varvec{S}}}\). The second term encourages the learned \({{\varvec{S}}}\) to capture the affinities among data points that are unanimously admitted by all views. The third term ensures that \({{\varvec{Z}}}^{(v)}\) can adaptively capture the local manifold structure of the original data. The fourth term imposes a low-rank constraint on the representation matrix \({{\varvec{Z}}}^{(v)}\) to capture the global structure. Through the joint learning of the first four terms, the structural affinity matrix \({{\varvec{Z}}}^{(v)}\) can capture both the local and global structures of the data in each individual view, and meanwhile, the intrinsic structures of data that are contained in the structural affinity graphs \({{\varvec{Z}}}^{(v)}\) can be well integrated into the consensus affinity graph \({{\varvec{S}}}\). The fifth term resists sample-specific corruptions and outliers to enhance the robustness of the model. The sixth term is the k-block diagonal representation of the learned \({{\varvec{S}}}\).

Consequently, the consensus affinity graph \({{\varvec{S}}}\) learned from Eq. (10) can well integrate the underlying data structure of \({\{{{\varvec{Z}}}^{(v)}}\}_{v=1}^{n_v}\), characterize the similarity among data points, and have an explicit k-block diagonal structure.

4 Optimization

In this section, an augmented Lagrange multiplier with an alternating direction minimizing strategy is used to solve Eq. (10). Specifically, Eq. (10) can be optimized alternatively by introducing auxiliary variables: \({{\varvec{S}}}={{\varvec{M}}}\), \({{\varvec{Z}}}^{(v)}={{\varvec{A}}}^{(v)}\), \({{\varvec{Z}}}^{(v)}={{\varvec{B}}}^{(v)}\), \({{\varvec{Z}}}^{(v)}={{\varvec{C}}}^{(v)}\) and \({{\varvec{Z}}}^{(v)}={{\varvec{D}}}^{(v)}\), \(v\in \{1,\dots ,n_v\}\), to make problem (10) separable. Then Eq. (10) is converted to the following optimization problem:

$$\begin{aligned}&\min _{\begin{array}{c} {{\varvec{S}}}, w^{(v)}, {{\varvec{E}}}^{(v)},{{\varvec{M}}},{{\varvec{Z}}}^{(v)},\\ {{\varvec{A}}}^{(v)},{{\varvec{B}}}^{(v)},{{\varvec{C}}}^{(v)},{{\varvec{D}}}^{(v)} \end{array} }\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{A}}}^{(v)}\Vert _F^2 + Tr({{\varvec{B}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{B}}}^{(v)T})+ \nonumber \\&2\lambda _1 Tr({{\varvec{X}}}^{(v)}{{\varvec{L}}}_{{{\varvec{C}}}}^{(v)}{{\varvec{X}}}^{(v)T})+\lambda _2 \Vert {{\varvec{D}}}^{(v)}\Vert _* +\lambda _3\Vert {{\varvec{E}}}^{(v)} \Vert _{2,1}\} + \lambda _4\Vert {{\varvec{M}}}\Vert _{\fbox {\textit{k}}} \nonumber \\&s.t.\ {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)},{{\varvec{Z}}}^{(v)}={{\varvec{A}}}^{(v)}, {{\varvec{Z}}}^{(v)}={{\varvec{B}}}^{(v)}, {{\varvec{Z}}}^{(v)}={{\varvec{C}}}^{(v)}, \nonumber \\&{{\varvec{Z}}}^{(v)}={{\varvec{D}}}^{(v)},{{\varvec{S}}}={{\varvec{M}}},{{\varvec{C}}}^{(v)}{} {\textbf {1}}={\textbf {1}},{{\varvec{C}}}^{(v)}\ge 0,{{\varvec{D}}}^{(v)}\ge 0,{{\varvec{M}}}\ge 0,\nonumber \\&{{\varvec{M}}}={{\varvec{M}}}^T,{{\varvec{s}}}^i {\textbf {1}}=1,s_{ij}\ge 0 ,w^{(v)}\ge 0 \end{aligned}$$
(11)

The augmented Lagrange function of Eq. (11) is

$$\begin{aligned}&\min _{\begin{array}{c} {{\varvec{S}}}, w^{(v)}, {{\varvec{E}}}^{(v)},{{\varvec{M}}},{{\varvec{Z}}}^{(v)},\\ {{\varvec{A}}}^{(v)},{{\varvec{B}}}^{(v)},{{\varvec{C}}}^{(v)},{{\varvec{D}}}^{(v)} \end{array} }\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{A}}}^{(v)}\Vert _F^2 + Tr({{\varvec{B}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{B}}}^{(v)T})+ \nonumber \\&2\lambda _1 Tr({{\varvec{X}}}^{(v)}{{\varvec{L}}}_{{{\varvec{C}}}}^{(v)}{{\varvec{X}}}^{(v)T})+\lambda _2 \Vert {{\varvec{D}}}^{(v)}\Vert _* +\lambda _3\Vert {{\varvec{E}}}^{(v)} \Vert _{2,1}\} + \lambda _4\Vert {{\varvec{M}}}\Vert _{\fbox {\textit{k}}}+ \nonumber \\&\sum _{v=1}^{n_v}\frac{\mu }{2}(\Vert {{\varvec{X}}}^{(v)}-{{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}- {{\varvec{E}}}^{(v)}+ \frac{{{\varvec{Y}}}_1^{(v)}}{\mu }\Vert _F^2 + \nonumber \\&\Vert {{\varvec{Z}}}^{(v)}-{{\varvec{A}}}^{(v)}+ \frac{{{\varvec{Y}}}_2^{(v)}}{\mu }\Vert _F^2 +\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{B}}}^{(v)}+ \frac{{{\varvec{Y}}}_3^{(v)}}{\mu }\Vert _F^2 + \nonumber \\&\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{C}}}^{(v)}+ \frac{{{\varvec{Y}}}_4^{(v)}}{\mu }\Vert _F^2+\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{D}}}^{(v)}+ \frac{{{\varvec{Y}}}_5^{(v)}}{\mu }\Vert _F^2)+ \frac{\mu }{2}\Vert {{\varvec{S}}} - {{\varvec{M}}} + \frac{{{\varvec{Y}}}_6}{\mu }\Vert _F^2 \nonumber \\&s.t. {{\varvec{C}}}^{(v)}{} {\textbf {1}}={\textbf {1}},{{\varvec{C}}}^{(v)}\ge 0,{{\varvec{D}}}^{(v)}\ge 0,{{\varvec{M}}}\ge 0,{{\varvec{M}}}={{\varvec{M}}}^T,{{\varvec{s}}}^i {\textbf {1}}=1,s_{ij}\ge 0 ,w^{(v)}\ge 0 \end{aligned}$$
(12)

where \({{\varvec{Y}}}_1^{(v)}\), \({{\varvec{Y}}}_2^{(v)}\), \({{\varvec{Y}}}_3^{(v)}\), \({{\varvec{Y}}}_4^{(v)}\),\({{\varvec{Y}}}_5^{(v)}\) and \({{\varvec{Y}}}_6\) are Lagrange multipliers, and \(\mu > 0\) is a penalty parameter. Equation (12) can be solved by alternately updating each variable while fixing all the other variables. The update rules are as follows:

Update \({{\varvec{A}}}^{(v)}\)   Fixing all variables except \({{\varvec{A}}}^{(v)}\), Eq. (12) can be written as

$$\begin{aligned} \min _{{{\varvec{A}}}^{(v)}}w^{(v)} \Vert {{\varvec{S}}} - {{\varvec{A}}}^{(v)}\Vert _F^2 +\frac{\mu }{2} \Vert {{\varvec{Z}}}^{(v)}-{{\varvec{A}}}^{(v)}+ \frac{{{\varvec{Y}}}_2^{(v)}}{\mu }\Vert _F^2 \end{aligned}$$
(13)

Taking the derivative of Eq. (13) w.r.t. \({{\varvec{A}}}^{(v)}\) and setting it to 0, the updating rule of \({{\varvec{A}}}^{(v)}\) is

$$\begin{aligned}&{{\varvec{A}}}^{(v)^{t+1}}= (2w^{(v)}+ \mu )^{-1} (2w^{(v)}{{\varvec{S}}} + \mu {{\varvec{Z}}}^{(v)}+ {{\varvec{Y}}}_2^{(v)}) \end{aligned}$$
(14)

Update \({{\varvec{B}}}^{(v)}\)   Fixing all variables except \({{\varvec{B}}}^{(v)}\), Eq. (12) can be simplified as

$$\begin{aligned} \min _{{{\varvec{B}}}^{(v)}}Tr({{\varvec{B}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{B}}}^{(v)T}) +\frac{\mu }{2} \Vert {{\varvec{Z}}}^{(v)}-{{\varvec{B}}}^{(v)}+ \frac{{{\varvec{Y}}}_3^{(v)}}{\mu }\Vert _F^2 \end{aligned}$$
(15)

Taking the derivative of Eq. (15) w.r.t. \({{\varvec{B}}}^{(v)}\) and setting it to 0, the updating rule of \({{\varvec{B}}}^{(v)}\) is

$$\begin{aligned}&{{\varvec{B}}}^{(v)^{t+1}}= (\mu {{\varvec{Z}}}^{(v)}+ {{\varvec{Y}}}_3^{(v)}) (2{{\varvec{L}}}_{{{\varvec{S}}}}+\mu {{\varvec{I}}})^{-1} \end{aligned}$$
(16)

Update \({{\varvec{C}}}^{(v)}\)   Fixing all variables except \({{\varvec{C}}}^{(v)}\), Eq. (12) can be simplified as

$$\begin{aligned}&\min _{{{\varvec{C}}}^{(v)}}\lambda _1\sum _{i,j=1}^n\Vert x_i^{(v)}- x_j^{(v)}\Vert _2^2 c_{ij}^{(v)}+\frac{\mu }{2}\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{C}}}^{(v)}+ \frac{{{\varvec{Y}}}_4^{(v)}}{\mu }\Vert _F^2 \nonumber \\&s.t.{{\varvec{C}}}^{(v)}{} {\textbf {1}}={\textbf {1}},{{\varvec{C}}}^{(v)}\ge 0 \end{aligned}$$
(17)

To simplify notations, the view index is tentatively ignored. Let \(h_{ij}\) be the (ij)-th element of \({{\varvec{H}}}=({{\varvec{Z}}}+ \frac{{{\varvec{Y}}}_4}{\mu })\); note that Eq. (17) is independent between different rows, we can address the following problem separately for each i,

$$\begin{aligned} \min _{{{\varvec{c}}}^{i}{} {\textbf {1}}=1,c_{ij}\ge 0}\lambda _1\sum _{j=1}^n\Vert x_i - x_j\Vert _2^2 c_{ij}+\sum _{j=1}^n(\frac{\mu }{2}c_{ij}^2 - \mu h_{ij}c_{ij}) \end{aligned}$$
(18)

Denote \(g_{ij}=\lambda _1\Vert x_i-x_j\Vert _2^2- \mu h_{ij}\) as the j-th element of \({{\varvec{g}}}_{i} \in \mathcal {R}^{1 \times n}\); then, \({{\varvec{c}}}^{i}\) in Eq. (18) can be updated as

$$\begin{aligned}&\min _{{{\varvec{c}}}^{i}{} {\textbf {1}}=1,{{\varvec{c}}}^{i}\ge 0}\Vert {{\varvec{c}}}^{i} + \frac{{{\varvec{g}}}_{i}}{\mu }\Vert _2^2 \end{aligned}$$
(19)

The Lagrangian function of Eq. (19) is

$$\begin{aligned} \mathcal {L}({{\varvec{c}}}^{i},\delta ,\varphi ) =\Vert {{\varvec{c}}}^{i} + \frac{{{\varvec{g}}}_{i}}{\mu }\Vert _2^2 + \delta (1-{{\varvec{c}}}^{i}{} {\textbf {1}}) + \varphi ^T(-{{\varvec{c}}}^{i}) \end{aligned}$$
(20)

where \(\delta \) and \(\varphi \ge 0 \) are the Lagrangian multipliers, and the optimal solution \({{\varvec{c}}}^{i}\) is

$$\begin{aligned} {{{\varvec{c}}}^{i}}^{t+1} =(-\frac{{{\varvec{g}}}_{i}}{\mu } + \delta {\textbf {1}}^{T})_+ \end{aligned}$$
(21)

where \((\cdot )_+ =\max (\cdot ,0)\).

Update \({{\varvec{D}}}^{(v)}\)   Fixing all variables except \({{\varvec{D}}}^{(v)}\), Eq. (12) is equivalent to

$$\begin{aligned} \min _{{{\varvec{D}}}^{(v)}}\frac{\lambda _2}{\mu }\Vert {{\varvec{D}}}^{(v)}\Vert _* +\frac{1}{2} \Vert {{\varvec{D}}}^{(v)}-({{\varvec{Z}}}^{(v)}+ \frac{{{\varvec{Y}}}_5^{(v)}}{\mu })\Vert _F^2 \end{aligned}$$
(22)

Equation (22) can be solved by the singular value thresholding operator [41].

Update \({{\varvec{Z}}}^{(v)}\)   Fixing all variables except \({{\varvec{Z}}}^{(v)}\), Eq. (12) is equivalent to

$$\begin{aligned}&\min _{{{\varvec{Z}}}^{(v)}}\Vert {{\varvec{X}}}^{(v)}-{\textbf {X}}^{(v)} {{\varvec{Z}}}^{(v)}- {{\varvec{E}}}^{(v)}+ \frac{{{\varvec{Y}}}_1^{(v)}}{\mu }\Vert _F^2 + \nonumber \\&\Vert {{\varvec{Z}}}^{(v)}-{{\varvec{A}}}^{(v)}+ \frac{{{\varvec{Y}}}_2^{(v)}}{\mu }\Vert _F^2 +\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{B}}}^{(v)}+ \frac{{{\varvec{Y}}}_3^{(v)}}{\mu }\Vert _F^2 + \nonumber \\&\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{C}}}^{(v)}+ \frac{{{\varvec{Y}}}_4^{(v)}}{\mu }\Vert _F^2 + \Vert {{\varvec{Z}}}^{(v)}- {{\varvec{D}}}^{(v)}+ \frac{{{\varvec{Y}}}_5^{(v)}}{\mu }\Vert _F^2 \end{aligned}$$
(23)

Taking the derivative of Eq. (23) w.r.t. \({{\varvec{Z}}}^{(v)}\) and setting it to 0, the updating rule of \({{\varvec{Z}}}^{(v)}\) is

$$\begin{aligned} {{\varvec{Z}}}^{(v)^{t+1}}=({{\varvec{X}}}^{(v)T}{{\varvec{X}}}^{(v)}+4{{\varvec{I}}})^{-1} ({{\varvec{X}}}^{(v)T}{{\varvec{V}}}_1 + {{\varvec{V}}}_2 + {{\varvec{V}}}_3) \end{aligned}$$
(24)

where \({{\varvec{V}}}_1={{\varvec{X}}}^{(v)}-{{\varvec{E}}}^{(v)}+{{\varvec{Y}}}_1^{(v)}/\mu \), \({{\varvec{V}}}_2={{\varvec{A}}}^{(v)}+{{\varvec{B}}}^{(v)}+{{\varvec{C}}}^{(v)}+{{\varvec{D}}}^{(v)}\), \({{{\varvec{V}}}}_3=({{\varvec{Y}}}_2^{(v)}+{{\varvec{Y}}}_3^{(v)}+{{\varvec{Y}}}_4^{(v)}+{{\varvec{Y}}}_5^{(v)})/{\mu }\).

Update \({{\varvec{E}}}^{(v)}\)   Fixing all variables except \({{\varvec{E}}}^{(v)}\), Eq. (12) is equivalent to

$$\begin{aligned} \min _{{{\varvec{E}}}^{(v)}}\frac{\lambda _3}{\mu }\Vert {{\varvec{E}}}^{(v)}\Vert _{2,1} +\frac{1}{2} \Vert {{\varvec{E}}}^{(v)}-{{\varvec{F}}}^{(v)}\Vert _F^2 \end{aligned}$$
(25)

where \({{\varvec{F}}}^{(v)}={{\varvec{X}}}^{(v)}-{{\varvec{X}}}^{(v)}{{\varvec{Z}}}^{(v)}+{{\varvec{Y}}}_1^{(v)}/ \mu \). According to [37], if the optimal solution to Eq. (25) is \({{\varvec{E}}}^{(v)}\), then the j-th column of \({{\varvec{E}}}^{(v)}\) is

$$\begin{aligned}{}[{{\varvec{E}}}^{(v)^{t+1}}]_{:,j}={\left\{ \begin{array}{ll} \frac{\Vert [{{\varvec{F}}}^{(v)}]_{:,j}\Vert _2-\lambda _3/\mu }{\Vert [{{\varvec{F}}}^{(v)}]_{:,j}\Vert _2}[{{\varvec{F}}}^{(v)}]_{:,j},&{}if \Vert [{{\varvec{F}}}^{(v)}]_{:,j}\Vert _2>\frac{\lambda _3}{\mu }\\ 0,&{}otherwise \end{array}\right. } \end{aligned}$$
(26)

Update \({{\varvec{S}}}\)   Fixing all variables except \({{\varvec{S}}}\), Eq. (12) is equivalent to

$$\begin{aligned} \min _{{{\varvec{S}}}}\sum _{v=1}^{n_v}w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{A}}}^{(v)}\Vert _2^2 +&\frac{1}{2}\sum _{v=1}^{n_v}\sum _{i,j=1}^n\Vert b_i^{(v)}- b_j^{(v)}\Vert _2^2 s_{ij}+\frac{\mu }{2}\Vert {{\varvec{S}}}-{{\varvec{M}}} + \frac{{{\varvec{Y}}}_6}{\mu }\Vert _F^2 \nonumber \\&s.t.{{\varvec{s}}}^{i}{} {\textbf {1}} =1, {{\varvec{S}}}\ge 0 \end{aligned}$$
(27)

Equation (27) is further written as

$$\begin{aligned} \min _{{{\varvec{s}}}^{i}{} {\textbf {1}}=1,s_{ij}\ge 0}\sum _{v=1}^{n_v}w^{(v)}\sum _{i,j=1}^n(s_{ij}-a_{ij}^{(v)})^2+\sum _{i,j=1}^n o_{ij}s_{ij}+\sum _{i,j=1}^n \frac{\mu }{2}(s_{ij}+l_{ij})^2 \end{aligned}$$
(28)

where \(o_{ij}=\frac{1}{2}\sum _{v=1}^{n_v}\Vert b_i^{(v)}- b_j^{(v)}\Vert _2^2\) is the j-th element of \({{\varvec{o}}}_{i} \in \mathcal {R}^{1 \times n}\), and \(l_{ij}\) is the (ij)-th element of \({{\varvec{L}}}=-{{\varvec{M}}}+{{\varvec{Y}}}_6/\mu \). Equation (28) is independent between different i, and we can address the following problem separately for each i,

$$\begin{aligned} \min _{{{\varvec{s}}}^{i}{} {\textbf {1}}=1,{{\varvec{s}}}^{i}\ge 0}\sum _{j=1}^{n}\sum _{v=1}^{n_v}w^{(v)}(s_{ij}-a_{ij}^{(v)})^2+\sum _{j=1}^n o_{ij}s_{ij}+\sum _{j=1}^n(\frac{\mu }{2}s_{ij}^2+\mu l_{ij}s_{ij}) \end{aligned}$$
(29)

Then, \({{\varvec{s}}}^i\) in Eq. (29) can be updated by optimizing the following equation:

$$\begin{aligned}&{{\varvec{s}}}^{i^{t+1}}=\mathop {\arg \min }_{{{\varvec{s}}}^{i}{} {\textbf {1}}=1,{{\varvec{s}}}^{i}\ge 0}\Vert {{\varvec{s}}}^{i} + \frac{ - 2{\sum _{v=1}^{n_v}w^{(v)}}{{\varvec{a}}}_i^{(v)}+{{\varvec{o}}}_{i}+{\mu }{{\varvec{l}}}_i}{2{\sum _{v=1}^{n_v}w^{(v)}}+\mu }\Vert _2^2 \end{aligned}$$
(30)

where \({{\varvec{a}}}_i^{(v)}\) and \({{\varvec{l}}}_i\) are \(1\times n\) vectors whose j-th elements are \(a_{ij}^{(v)}\) and \(l_{ij}\), respectively.

Update \(w^{(v)}\)   The weight of the v-th view is computed by

$$\begin{aligned} w^{(v)^{t+1}}=\frac{1}{2\Vert {{\varvec{S}}}-{{\varvec{A}}}^{(v)}\Vert _F} \end{aligned}$$
(31)

Proof: Motivated by the iteratively reweighted technique in [12, 42], we define an auxiliary problem without \(w^{(v)}\) as follows:

$$\begin{aligned} \min _{{{\varvec{A}}}^{(v)}}\sqrt{\Vert {{\varvec{S}}}-{{\varvec{A}}}^{(v)}\Vert _F^2} +\frac{\mu }{2} \Vert {{\varvec{Z}}}^{(v)}-{{\varvec{A}}}^{(v)}+\frac{{{\varvec{Y}}}_2^{(v)}}{\mu }\Vert _F^2 \end{aligned}$$
(32)

Taking the derivative of Eq. (32) with respect to \({{\varvec{A}}}^{(v)}\) and setting the derivative to 0, we have

$$\begin{aligned} {\widehat{w}}^{(v)}\frac{\partial {\Vert {{\varvec{S}}}-{{\varvec{A}}}^{(v)}\Vert _F^2}}{\partial {{{\varvec{A}}}^{(v)}}}+\frac{\partial {(\frac{\mu }{2} \Vert {{\varvec{Z}}}^{(v)}-{{\varvec{A}}}^{(v)}+\frac{{{\varvec{Y}}}_2^{(v)}}{\mu }\Vert _F^2)}}{\partial {{{\varvec{A}}}^{(v)}}}=0 \end{aligned}$$
(33)

where \({\widehat{w}}^{(v)}=1/(2\Vert {{\varvec{S}}}-{{\varvec{A}}}^{(v)}\Vert _F)\). Obviously, Eq. (33) is the same as the derivative process of Eq. (13) with respect to \({{\varvec{A}}}^{(v)}\). Thus, \({\widehat{w}}^{(v)}\) can be considered as \(w^{(v)}\) in Eq. (13). Theoretically, to avoid dividing by 0, \({\widehat{w}}^{(v)}\) can be transformed into

$$\begin{aligned} w^{(v)^{t+1}}=\frac{1}{2\Vert {{\varvec{S}}}-{{\varvec{A}}}^{(v)}\Vert _F+\delta } \end{aligned}$$
(34)

where \(\delta \) is infinitely close to 0. The proof is completed.

Update \({{\varvec{M}}}\)   Fixing all variables except \({{\varvec{M}}}\), Eq. (12) is equivalent to

$$\begin{aligned}&\lambda _4 \Vert {{\varvec{M}}}\Vert _{\boxed {k}} + \frac{\mu }{2} \Vert {{\varvec{S}}} - {{\varvec{M}}} + \frac{{{\varvec{Y}}}_6}{\mu }\Vert _F^2 \nonumber \\&s.t.\ {{\varvec{M}}} \ge 0,{{\varvec{M}}}={{\varvec{M}}}^T, \end{aligned}$$
(35)

In view of the nonconvexity of \(\Vert {{\varvec{M}}}\Vert _{\boxed {k}}\) in Eq. (35), we introduce the following theorem:

Theorem 1. [30] Let \({{\varvec{M}}} \in \mathcal {R}^{n \times n}\) and \({{\varvec{M}}}\succeq 0\). Then,

$$\begin{aligned}&\sum _{i=n-k+1}^n \lambda _i({{\varvec{M}}})=\min _{{\varvec{M}}} \langle {{\varvec{M}}},{{\varvec{Q}}}\rangle \nonumber \\&s.t.\ 0\preceq {{\varvec{Q}}} \preceq {{\varvec{I}}},Tr({{\varvec{Q}}})=k, \end{aligned}$$
(36)

where \({{\varvec{M}}}\) and \({{\varvec{Q}}}\) are symmetric matrices, \({{\varvec{M}}}\succeq 0\) represents that \({{\varvec{M}}}\) is positive semidefinite, \({{\varvec{Q}}} \preceq {{\varvec{I}}}\) represents \({{\varvec{Q}}}-{{\varvec{I}}}\preceq 0\), \(Tr({{\varvec{Q}}})\) represents the sum of the main diagonal elements of \({{\varvec{Q}}}\). We can reformulate \(\Vert {{\varvec{M}}}\Vert _{\boxed {k}}\) as a convex programming problem,

$$\begin{aligned} \Vert {{\varvec{M}}}\Vert _{\boxed {k}}=\min _{{{\varvec{Q}}}} \langle {{\varvec{L}}}_{{\varvec{M}}},{{\varvec{Q}}} \rangle \nonumber \\ s.t.\ 0\preceq {{\varvec{Q}}} \preceq {{\varvec{I}}},Tr({{\varvec{Q}}})=k \end{aligned}$$
(37)

Therefore, Eq. (35) is equivalent to

$$\begin{aligned} \lambda _4 \langle Diag({{\varvec{M}}1}) - {{\varvec{M}}},{{\varvec{Q}}} \rangle + \frac{\mu }{2}\Vert {{\varvec{S}}} - {{\varvec{M}}} + \frac{{{\varvec{Y}}}_6}{\mu }\Vert _F^2 \nonumber \\ s.t.\ {{\varvec{M}}} \ge 0,{{\varvec{M}}}={{\varvec{M}}}^T,0\preceq {{\varvec{Q}}} \preceq {{\varvec{I}}},Tr({{\varvec{Q}}})=k \end{aligned}$$
(38)

Equation (38) can be optimized by solving \({{\varvec{M}}}\) and \({{\varvec{Q}}}\) alternatively. The specific updating rules are as follows:

  1. (1)

    \({{\varvec{Q}}}\) can be optimized with fixed variable \({{\varvec{M}}}\) as

    $$\begin{aligned}&{{\varvec{Q}}}^{t+1}=\mathop {\arg \min }_{{\varvec{Q}}} \lambda _4 \langle Diag({{\varvec{M}}1})-{{\varvec{M}}},{{\varvec{Q}}} \rangle \nonumber \\&s.t.\quad 0\preceq {{\varvec{Q}}} \preceq {{\varvec{I}}},Tr({{\varvec{Q}}})=k \end{aligned}$$
    (39)

It can be solved by

$$\begin{aligned}&{{\varvec{Q}}}^{t+1} = {{\varvec{U}}}{{\varvec{U}}}^T \end{aligned}$$
(40)

where \({{\varvec{U}}} \in {{\varvec{R}}}^{n \times k}\) consist of k eigenvectors related to the k smallest eigenvalues of \(Diag({{\varvec{M}}1}) - {{\varvec{M}}}\) [28].

  1. (2)

    \({{\varvec{M}}}\) can be optimized with fixed variable \({{\varvec{Q}}}\) as

    $$\begin{aligned}&{{\varvec{M}}}^{t+1} = \mathop {\arg \min }_{{\varvec{M}}} \lambda _4 \langle Diag({{\varvec{M}}1})-{{\varvec{M}}},{{\varvec{Q}}} \rangle + \frac{\mu }{2} \Vert {{\varvec{S}}} - {{\varvec{M}}} + \frac{{{\varvec{Y}}}_6}{\mu }\Vert _F^2 \nonumber \\&s.t.{{\varvec{M}}} \ge 0,{{\varvec{M}}}={{\varvec{M}}}^T \end{aligned}$$
    (41)

Equation (41) is equivalent to

$$\begin{aligned}&{{\varvec{M}}}^{t+1} = \mathop {\arg \min }_{{\varvec{M}}} \frac{1}{2}\Vert {{\varvec{M}}} - {{\varvec{S}}}- \frac{{{\varvec{Y}}}_6}{\mu } + \frac{\lambda _4}{\mu }(diag({{\varvec{Q}}}){\textbf {1}}^T-{{\varvec{Q}}})\Vert ^2 \nonumber \\&s.t.\ {{\varvec{M}}} \ge 0,{{\varvec{M}}}={{\varvec{M}}}^T \end{aligned}$$
(42)

Let \({{\varvec{D}}} = {{\varvec{S}}} + \frac{{{\varvec{Y}}}_6}{\mu } - \frac{\lambda _4}{\mu }(diag({{\varvec{Q}}}){\textbf {1}}^T-{{\varvec{Q}}})\), according to [30],

$$\begin{aligned} {{\varvec{M}}}^{t+1}=[(\widehat{{{\varvec{D}}}}+\widehat{{{\varvec{D}}}}^T) / 2]_+ \end{aligned}$$
(43)

where \(\widehat{{{\varvec{D}}}}={{\varvec{D}}}-Diag(diag({{\varvec{D}}}))\).

Update multipliers   \(\mu \), \({{\varvec{Y}}}_1^{(v)}\), \({{\varvec{Y}}}_2^{(v)}\), \({{\varvec{Y}}}_3^{(v)}\), \({{\varvec{Y}}}_4^{(v)}\), \({{\varvec{Y}}}_5^{(v)}\) and \({{\varvec{Y}}}_6\) are updated as follows:

$$\begin{aligned}&\mu = \min (\rho \mu ,\mu _{max})\nonumber \\&{{\varvec{Y}}}_1^{(v)}= {{\varvec{Y}}}_1^{(v)}+ \mu ({{\varvec{X}}}^{(v)}- {{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}- {{\varvec{E}}}^{(v)})\nonumber \\&{{\varvec{Y}}}_2^{(v)}= {{\varvec{Y}}}_2^{(v)}+ \mu ({{\varvec{Z}}}^{(v)}- {{\varvec{A}}}^{(v)})\nonumber \\&{{\varvec{Y}}}_3^{(v)}= {{\varvec{Y}}}_3^{(v)} +\mu ({{\varvec{Z}}}^{(v)}- {{\varvec{B}}}^{(v)} )\nonumber \\&{{\varvec{Y}}}_4^{(v)}= {{\varvec{Y}}}_4^{(v)}+ \mu ({{\varvec{Z}}}^{(v)}- {{\varvec{C}}}^{(v)} ) \nonumber \\&{{\varvec{Y}}}_5^{(v)}= {{\varvec{Y}}}_5^{(v)}+ \mu ({{\varvec{Z}}}^{(v)}- {{\varvec{D}}}^{(v)} ) \nonumber \\&{{\varvec{Y}}}_6 = {{\varvec{Y}}}_6 + \mu ({{\varvec{S}}}-{{\varvec{M}}}) \end{aligned}$$
(44)

With the help of the alternate optimization scheme, the final S can be obtained and used for clustering. The specific optimization process is summarized in Algorithm 1.

Algorithm 1
figure a

Optimization algorithm for CAGL-SGBD

4.1 Complexity Analysis

In our model, there are ten unknown variables (\({{\varvec{S}}}\), \({{\varvec{Q}}}\), \({{\varvec{M}}}\), w, \({{\varvec{Z}}}\), \({{\varvec{A}}}\), \({{\varvec{B}}}\), \({{\varvec{C}}}\),\({{\varvec{D}}}\), \({{\varvec{E}}}\)), and it is a nonconvex optimization problem. We alternately update each variable. Let \(n_v\), \(t_1\) and n be the number of views, iterations, and data points, respectively, and we mainly consider computationally expensive operations. The complexity of updating \({{\varvec{A}}}\) is \(O(n_v n)\). The main complexity of updating \({{\varvec{B}}}\) and \({{\varvec{Z}}}\) is the matrix inversion, which is \(O(n_v n^3)\). To update \({{\varvec{D}}}\) (the nuclear norm proximal operator), the main complexity is \(O(n_v n^3)\). The main complexity of updating \({{\varvec{C}}}\) and \({{\varvec{S}}}\) is calculating the Euclidean distance which requires \(O(n_v n^2)\). The complexities of updating \({{\varvec{E}}}\) is \(O(n_v n)\). The complexities of updating \({{\varvec{Q}}}\) and \({{\varvec{M}}}\) are \(O(n^3)\) and O(n), respectively. Since \(n_v \ll n\), therefore, the total complexity is \(O(t_1 n^3 )\).

Since Eq. (10) is nonconvex, it is difficult to ensure that it can converge to a local minimum. Fortunately, most suboptimization problems have a closed-form solution during optimization, and subsequent empirical evidence of the convergence analysis on real datasets also demonstrated that the proposed algorithm has good convergence behavior.

5 Experiments

5.1 Datasets

  1. (1)

    MSRC [43] consists of 210 images and 7 classes. For each image, five visual feature vectors are extracted, including color moment CM (24), CENT (254), LBP (256), GIST (512) and HOG (576).

  2. (2)

    ORL [44] consists of 400 images belonging to 40 distinct subjects with 10 images for each subject. For each image, four feature vectors are extracted, including LBP (59), CENT (254), GIST (512) and HOG (864).

  3. (3)

    HW [45] consists of 2000 digit images corresponding to 10 classes. For each image, six features are extracted, namely, FOR (76), FAC (216), KAR (64), MOR (6), ZER (47), and PIX (240).

  4. (4)

    100 leaves [45] consists of 1600 samples from 100 plant species. For each sample, three features are extracted, namely, 64-D texture histogram, 64-D fine-scale margin, and 64-D shape descriptor.

  5. (5)

    COIL20 [11] consists of 1440 images and 20 object categories. For each image, three different feature vectors are extracted, namely, the 1024-D intensity feature, 3304-D LBP feature, and 6750 Gabor feature.

  6. (6)

    BBCSport [45] is a document dataset consisting of 544 documents belonging to 5 classes from the BBC Sport website. In our experiments, two views are used whose dimensions are 3183 and 3203.

5.2 Comparison Methods and Evaluation Metrics

We compare the proposed method with the following methods: Ncut [46], S-MVSC [44], MCGC [23], MVGL [11], DiMSC [24], CSMSC [47], LMSC [25], MCLES [48], GBS-KO [45], LMVSC [49], CGD [8] and GFSC [26]. For the single-view clustering method Ncut, we employ it for MVC by concatenating the features of each view in a columnwise manner and feed them into Ncut.

Evaluation Metrics: To facilitate evaluation, we use six widely used evaluation metrics to evaluate the clustering performances, including accuracy (ACC), standardized mutual information (NMI), purity, precision, F-score, and adjusted rand index (ARI). For each metric, a larger value indicates a better cluster performance. The running time is also recorded to better reflect the time utilization.

5.3 Experimental Settings

In the experiments, we fix the number of nearest neighbors to 15. For each comparison method, we either use the default parameter settings recommended by the original paper as much as possible in our experiments (if the parameters were provided), or we manually tune them and retain those with the best performances. For our proposed method, the default values of the four parameters on the six datasets when the optimal clustering performances are achieved are listed in Table 1, and the default values will be used in the comparison experiments, parameter analysis, ablation study, visualization, and convergence analysis. Without loss of generality, we run each method 30 times and report the average score and standard deviation.

Table 1 The default values of the four parameters

5.4 Experimental Results

The results are shown in Tables 2, 3, 4, 5, 6 and 7. For each metric, the best and the second-best values are bolded and underlined, respectively. By observing the experimental results, we can obtain the following conclusions.

Table 2 Clustering performances on the MSRC dataset
Table 3 Clustering performances on the ORL dataset
Table 4 Clustering performances on the HW dataset
Table 5 Clustering performances on the 100 leaves dataset
Table 6 Clustering performances on the COIL20 dataset
Table 7 Clustering performances on the BBCSport dataset

Our proposed method significantly outperforms all baselines on the MSRC, HW, 100 leaves and COIL20 datasets. Especially on the COIL20 dataset, our method achieves the most perfect clustering results with \(100\%\) on all metrics. On the MSRC dataset, our method outperforms the single-view clustering method Ncut by \(43.1\%\) in terms of NMI and outperforms the second best MVC method MCLES by \(4.52\%\) in terms of NMI. On the HW dataset, in terms of NMI, our method outperforms Ncut by \(24.41\%\) and outperforms the second best method MCGC by \(2.25\%\). On the 100 leaves dataset, in terms of NMI, our method outperforms Ncut by \(8.68\%\) and outperforms the second best method GBS-KO by \(0.47\%\). On the ORL dataset, our method outperforms all baselines except for the NMI, purity and precision, which are 0.14, 0.24, and \(3.17\%\) lower than GBS-KO, MCLES and LMVSC, respectively. On the BBCSport dataset, our method achieves the best results in terms of the ACC, NMI and purity. The experimental results show that the proposed method is a promising MVC method.

The performances of MVC methods are not always better than those of the single-view clustering methods. The key problem of MVC is to make full use of the consistency and complementarity information of different views, which is still a challenging problem in practice.

In all the baselines, S-MVSC, MCGC, MVGL, GBS-KO, CGD and GFSC are methods for learning a unified graph/matrix. GBS-KO shows good clustering performances on all six datasets. CGD also shows good performances on all six datasets except for ORL. MCGC shows the second-best best performance on the HW dataset. All three baselines consider the manifold structure of the original data and exploit the complementary information of multiple graphs to generate a unified graph for clustering, but they do not consider the global structure of the data or block representation of the unified graph. Thus, their clustering results are inferior to those of the proposed CAGL-SGBD. GFSC is a method that employs self-weighting to automatically learn a unified graph for all views, and its clustering performance is not good on the 100 leaves, COIL20 and BBCSport datasets. S-MVSC performs well on the BBCSport dataset and shows highly efficient computing performances on all datasets.

Both LMSC and MCLES learn a similarity matrix or cluster indicator matrix based on the unified latent embedding representation in the embedding space. MCLES exhibits the second best clustering performances on the MSRC dataset, and its clustering performances on the ORL and BBCSport datasets are also good, but it cannot be tested on the HW, 100 leaves and COIL20 datasets because it took more than three hours to run. In addition, both LMSC and MCLES may lose some important discriminant information in the process of embedding data from the original space to the embedding space, resulting in lower clustering performances than CAGL-SGBD.

The proposed method CAGL-SGBD is superior to the above unified graph/matrix/representation-based methods. This is mainly because it not only integrates the local manifold structure and global structure information of intraviews through structure graph fusion but also dynamically integrates the complementary information between views by consensus affinity graph learning, and ultimately, it enforces a structured k-block diagonal representation on the learned consensus affinity graph.

Furthermore, the running time of CAGL-SGBD is also within an acceptable range compared to all the baselines.

5.5 Parameter Analysis

Our model has four parameters \(\lambda _1\), \(\lambda _2\), \(\lambda _3\), \(\lambda _4\), and we conduct a grid search for them. \(\lambda _1\) is tuned from {0.001, 0.01, 0.1, 1, 10, 100} on the MSRC dataset, is tuned from {0.001, 0.01, 0.1} on the HW and BBCSport datasets, and is tuned from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} on the ORL, 100 leaves and COIL20 datasets. \(\lambda _2\), \(\lambda _3\), \(\lambda _4\) are tuned from {0.1, 0.5, 0.7, 1, 5, 7, 10}, {1, 5, 7, 10, 50, 70, 100} and {0.001, 0.005, 0.007, 0.01, 0.05, 0.07, 0.1}, respectively, on all datasets.

When analyzing one parameter, keep the default values of the other three parameters. Figure 1 takes the ORL, 100 leaves, and COIL20 datasets as examples to show that the variation in NMI values of CAGL-SGBD varies with the four parameters on the three datasets. As shown in Fig. 1, the proposed method is less sensitive to the changes in these four parameters on the three datasets except that it is more sensitive to the changes in \(\lambda _3\) on the 100 leaves dataset.

Fig. 1
figure 1

Variations in NMI versus the four parameters

5.6 Ablation Study

In this subsection, we conduct an ablation study on the proposed model. Specifically, we learn a consensus affinity graph through the following two approaches: (1) Only consider the structure graph fusion, namely, \(\lambda _4=0\), simplified as CAGL-SG; (2) Only consider the block representation, namely, \(\lambda _1=\lambda _2=\lambda _3=0\), simplified as CAGL-BD. For the first approach, we have

$$\begin{aligned}&\min _{{{\varvec{S}}}, w^{(v)}, {{\varvec{Z}}}^{(v)}, {{\varvec{E}}}^{(v)}}\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{Z}}}^{(v)}\Vert _F^2+Tr({{\varvec{Z}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{Z}}}^{(v)T})+\nonumber \\&2\lambda _1 Tr({{\varvec{X}}}^{(v)}{{\varvec{L}}}_{{{\varvec{Z}}}}^{(v)}{{\varvec{X}}}^{(v)T})+\lambda _2 \Vert {{\varvec{Z}}}^{(v)}\Vert _* +\lambda _3\Vert {{\varvec{E}}}^{(v)} \Vert _{2,1}\} \nonumber \\&s.t.\ {{\varvec{X}}}^{(v)}={{\varvec{X}}}^{(v)}{{\varvec{Z}}}^{(v)}+{{\varvec{E}}}^{(v)}, {{\varvec{Z}}}^{(v)}{} {\textbf {1}}={\textbf {1}}, {{\varvec{Z}}}^{(v)}\ge 0, {{\varvec{s}}}^i{\textbf {1}}=1, {{\varvec{S}}}\ge 0, w^{(v)}\ge 0 \end{aligned}$$
(45)

For the second approach, we have

$$\begin{aligned}&\min _{{{\varvec{S}}}, w^{(v)}}\sum _{v=1}^{n_v}\{w^{(v)}\Vert {{\varvec{S}}} - {{\varvec{Z}}}^{(v)}\Vert _F^2 + Tr({{\varvec{Z}}}^{(v)}{{\varvec{L}}}_{{{\varvec{S}}}}{{\varvec{Z}}}^{(v)T})\}+\lambda _4\Vert {{\varvec{S}}}\Vert _{\fbox {\textit{k}}} \nonumber \\&s.t.\ {{\varvec{s}}}^i {\textbf {1}}=1,{{\varvec{S}}}\ge 0,{{\varvec{S}}}={{\varvec{S}}}^T, w^{(v)}\ge 0 \end{aligned}$$
(46)

where \({{\varvec{Z}}}^{(v)}\) is initialized by the KNN graph in both Eqs. (45) and (46). Table 8 reports the clustering results of Eqs. (45), (46) and (10) on the MSRC, ORL, HW, 100 leaves and COIL20 datasets.

Table 8 Ablation study of the proposed method

As shown in Table 8, the clustering performance of Eq. (46) is very poor on some datasets. The model in Eq. (46) directly fuses \(\{{{{\varvec{Z}}}}^{(v)}\}_{v=1}^{n_v}\), which are obtained by the KNN graph, into a consistent affinity matrix S and enforces S to have a block diagonal representation. However, the S learned by Eq. (46) cannot capture the local and global structural information of the data well and cannot resist noise; therefore, the clustering performance is not good. Equation (45) obtains better clustering results than Eq. (46) on all datasets, which indicates that the structure graph fusion may play a more important role than block diagonal representation in improving the quality of the consensus affinity matrix. The results of Eq. (10) are better than those of Eqs. (45) and (46), indicating that the joint learning of structure graph fusion and block diagonal representation are beneficial to improving the quality of the consensus affinity matrix. Thus, CAGL-SGBD is superior to CAGL-SG and CAGL-BD.

5.7 Visualization

The visualization results of the consensus affinity matrix learned by the proposed method on the six datasets are demonstrated in Fig. 2.

The consensus affinity matrix learned by the proposed method on the MSRC, ORL, HW, 100 leaves and BBCSport datasets all exhibit obvious block-diagonal structures, and the number of blocks is equal to the number of corresponding classes. In Fig. 2d, since there are 100 blocks on the 100 leaves dataset, each block is so small that the block structure on the diagonal appears to be a straight line. Although the consensus affinity matrix learned on the COIL20 dataset has no obvious block-diagonal structure, the data points are all concentrated on the diagonal.

Fig. 2
figure 2

Visualization results of the consensus affinity matrix learned by CAGL-SGBD on all datasets

We also use t-SNE to visualize the learned consensus affinity matrix, as shown in Fig. 3. It can be seen that data points belonging to the same category are close, while the data points belonging to different categories are far away. On the COIL20 dataset, our proposed method can achieve the most perfect clustering results.

Fig. 3
figure 3

Visualization of the consensus affinity matrix learned by CAGL-SGBD on all datasets using t-SNE

5.8 Convergence Analysis

In this subsection, we conducted convergence analysis to verify the convergence property of the proposed method. We calculate the log objective function value and the primal residual (computed as \(\max (\Vert {{\varvec{X}}}^{(v)}- {{\varvec{X}}}^{(v)} {{\varvec{Z}}}^{(v)}- {{\varvec{E}}}^{(v)} \Vert _\infty ,\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{A}}}^{(v)}\Vert _\infty ,\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{B}}}^{(v)}\Vert _\infty ,\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{C}}}^{(v)}\Vert _\infty ,\Vert {{\varvec{Z}}}^{(v)}- {{\varvec{D}}}^{(v)}\Vert _\infty ,\Vert {{\varvec{S}}}-{{\varvec{M}}}\Vert _\infty ))\) at each iteration. Since our algorithm tends to converge at approximately 50 iterations on the benchmark datasets, we expand the number of iterations to 150 to intuitively demonstrate the convergence of the proposed method at 100 iterations. Figure 4 illustrates the log objective function value and the value of primal residual changes with the number of iterations on the six datasets. The results empirically confirm the convergence behavior of the proposed algorithm within 100 iterations.

Fig. 4
figure 4

Convergence curves of the proposed algorithm

6 Conclusion

In this paper, we propose a new MVC method CAGL-SGBD. The proposed method can automatically learn the graph structure information of intraviews, the complementary information of interviews and the structured representation of consensus affinity graphs by structure graph fusion, consensus affinity graph learning and k-block diagonal representation, respectively. Extensive experimental results on six benchmark datasets sufficiently show that CAGL-SGBD is effective, can compete with other advanced MVC methods, and obtains the best performance on the MSRC, HW, 100 leaves and COIL20 datasets. The results of the ablation study show that jointly learning the structure graph fusion and block diagonal representation can greatly improve the clustering performance. The visualization experimental results on the MSRC, ORL, HW, 100 leaves, and BBCSport datasets show that the consensus affinity matrix learned by CAGL-SGBD displays an explicit block-diagonal structure. In the future, we will consider how to address the graph structure and structured representation problem of large-scale multiview data.