Introduction

Multiple kernel clustering (MKC) aims to optimally integrate the consensus information among multiple base kernels to generate a consensus kernel for improving clustering performance. In the last few years, MKC methods have been widely applied into various applications, with benefiting from the powerful expression ability to handle non-linear data than traditional single kernel clustering methods [1,2,3,4,5]. In general, existing MKC methods can fall into two main classes, i.e. linear kernel fusion (LKF) scheme-based methods and non-linear kernel fusion (NKF) scheme-based methods.

Usually, LKF assumes that a linear combination of multiple pre-defined base kernels can yield an optimal kernel. Motivated by this, SCMK [6] uses the linear combination of base kernels to learn an optimal kernel. Furthermore, LKGr [7] imposes an extra low-rank constraint on the base kernels to capture the structure information of base kernels by considering the low-rank property of the samples. Recently, many efforts have been devoted to improve the effectiveness of LKF by leveraging other non-linear kernel fusion schemes in the kernel matrix space. For instance, neighbor-kernel-based MKC [2] method non-linearly constructs multiple neighbor kernels preparing for yielding a consensus kernel by considering the neighborhood structure among base kernels. Nevertheless, in practical applications, LKF tends to over-reduce the feasible set of the consensus kernel. That is, the representation capability of consensus kernel fails to outperform single kernel in some cases [8]. Therefore, the aforementioned MKC methods with LKF scheme will not always be effective in handling non-linear clustering tasks.

Different from LKF, NKF is developed to better satisfy the non-linear clustering, and it involves two intuitive assumptions: (1) the consensus kernel is a neighborhood of candidate kernels; (2) the larger weights are assigned to more important/similar candidate kernels, and vice versa. Hence, the flexibility and reliability of consensus kernel are significantly highlighted. Recently, NKF derived a mass of excellent MKC methods [8,9,10,11]. For instance, SMKL [8] indirectly yields an affinity graph for clustering by using NKF. JMKSC [9] integrates the correntropy based NKF and block diagonal constraint to learn an affinity graph with optimal block diagonal property. Considering the suboptimal low-rank constraint existing in LKGr [7], LLMKL [10] proposes to use a low-rank substitute of consensus kernel to upgrade the affinity graph. SPMKC [11] captures the global and local structure over SMKL [8] to control similarity between consensus kernel and affinity graph, and improves the clustering performance significantly. Overall, the aforementioned LKF and NKF based MKC methods are referred to as multiple kernel subspace clustering (MKSC) [2, 6,7,8,9,10,11] methods, which typically work as follows: (1) generating multiple base kernels from single sample dataset; (2) fusing these kernels with LKF or NKF scheme to obtain a consensus kernel at the matrix level; (3) utilizing self-expressiveness subspace learning on the learned consensus kernel to adaptively learn an affinity graph for spectral clustering.

Although effective, they usually concentrate more on learning consensus kernel instead of affinity graph, which violates the intention of MKSC, i.e., the ultimate goal of MKSC is to learn an optimal affinity graph [3] for clustering. Therefore, CAGL [3] innovatively learns multiple candidate affinity graphs from multiple base kernels first, and then fuses these candidate graphs to directly learn a consensus graph at the matrix level. Note herein that CAGL adopts a two-step way to separately learn a consensus kernel graph. Although CAGL has achieved great progress compared with the other state-of-the-art MKSC methods, it still suffers from the following flaws: (1) ignoring the high-order correlations hidden in different base kernels essentially, such that the consistent and complementary information of the given multiple kernels may not be fully explored; (2) learning the weights of multiple candidate graphs in the matrix level rather than tensor level is short of correct guidance and sensitive to noise, which leads to unreliable weights; and (3) adopting a two-step way to learn the candidate affinity graphs and consensus affinity graph separately, such that the obtained solution is usually inferior.

In light of the aforementioned limitations, a novel MKSC method, dubbed auto-weighted multiple kernel tensor clustering (AMKTC) is proposed for improving the clustering. Concretely, by leveraging self-expressiveness subspace learning with multiple base kernels, multiple candidate affinity graphs are first learned. Then, we stack these graphs into graph tensor in reconstructed kernel Hilbert space, where the consistency and complementary of candidate graphs are fully exploited by imposing a tensor-singular value decomposition (t-SVD)-based tensor nuclear norm (t-TNN). Finally, with the high-quality candidate graphs at hand, an auto-weighted graph fusion scheme is developed to obtain an optimal consensus affinity graph by using the guidance of t-TNN constraint. In summary, the main contributions of this paper are summarized as follows:

  • This paper proposes a novel MKSC method, dubbed as AMKTC, to effectively handle non-linear data clustering. AMKTC integrates the consensus affinity graphs learning and candidate affinity graph learning into a unified framework, where they can be jointly learned via a mutually reinforce manner.

  • This paper proposes to explore the high-order correlations of base kernels by utilizing the t-TNN constraint on the rotated graph tensor. Meanwhile, auto-weighted graph fusion enforces the learning of consensus affinity graph with optimal weights.

  • Compared with the other state-of-the-art MKSC methods, the superiority of AMKTC is abundantly demonstrated by implementing extensive experiments.

The remainder of the paper is organized as follows. The next sections briefly review the several works related MKSC and tensor. The following section presents AMKTC method and its solver. The experimental results are shown in the next section. Last section presents the conclusion.

Notations and preliminaries

Notations

In this paper, matrices, vectors, and their entries are denoted as upper case letters, bold upper case letters, and lower case letters, e.g., \(\textbf{B}\), \(\textbf{b}\) and \(b_{ij}\), respectively. Tensors are denoted as bold calligraphy letters, e.g., \({\varvec{\mathcal {B}}}\). For a 3-order tensor \({\varvec{\mathcal {B}}}\), it involves the main two definitions, i.e. fiber and slice. The fiber represents that an order is free while the other two orders are fixed, i.e. \({\varvec{\mathcal {B}}}(:, j, k), {\varvec{\mathcal {B}}}(i,:, k)\) and \({\varvec{\mathcal {B}}}(i, j,:)\). The slice indicates that only one order is fixed while the other two orders are free, i.e. horizontal slice \({\varvec{\mathcal {B}}}( i;:,:)\), lateral slice \({\varvec{\mathcal {B}}}(:, j,:)\), and frontal slice \({\varvec{\mathcal {B}}}(:,:, k)\). For convenience, \({\varvec{\mathcal {B}}}(:,:, k)\) can be simplified as \(\textbf{B}^{(k)}\) or \({\varvec{\mathcal {B}}}^{(k)}\). \({\varvec{\mathcal {B}}}_{f}=\texttt {fft}({\varvec{\mathcal {B}}},[\ ], 3)\) is the fast Fourier transformation (FFT) of tensor \({\varvec{\mathcal {B}}}\), and its inverse FFT denotes as \({\varvec{\mathcal {B}}}=\texttt {ifft}(\mathbf {{\varvec{\mathcal {B}}}_{f}},[\ ], 3)\). \(\texttt {bvec}({\varvec{\mathcal {B}}})=[\textbf{B}^{(1)}; \textbf{B}^{(2)}; \cdots ; \textbf{B}^{(n_{3})}] \in \mathbb {R}^{{n_1 n_3}\times n_2}\) and \(\texttt {fold}(\texttt {bvec}({\varvec{\mathcal {B}}}))={\varvec{\mathcal {B}}}\) are defined as the block vectorizing and the inverse operator of \({\varvec{\mathcal {B}}}\), respectively. \(\texttt {bcirc}({\varvec{\mathcal {B}}}) \in \mathbb {R}^{{n_1n_3}\times {n_2n_3}}\) and \(\texttt {bdiag}({\varvec{\mathcal {B}}}) \in \mathbb {R}^{{n_1n_3}\times {n_2n_3}}\) denotes corresponding block circulant matrix and the block diagonal matrix, respectively, i.e.

$$\texttt {bcirc}({\varvec{\mathcal {B}}})=\left[ \begin{array}{cccc}\textbf{B}^{(1)} &{} \textbf{B}^{\left( n_{3}\right) } &{} \cdots &{} \textbf{B}^{(2)} \\ \textbf{B}^{(2)} &{} \textbf{B}^{(1)} &{} \cdots &{} \textbf{B}^{(3)} \\ \vdots &{} \ddots &{} \ddots &{} \vdots \\ \textbf{B}^{\left( n_{3}\right) } &{} \textbf{B}^{\left( n_{3}-1\right) } &{} \cdots &{} \textbf{B}^{(1)}\end{array}\right] $$
$$\texttt {bdiag}({\varvec{\mathcal {B}}})=\left[ \begin{array}{cccc}\textbf{B}^{(1)}&{}0&{}0&{}0 \\ 0&{}\textbf{B}^{(2)}&{}0&{}0 \\ 0&{}0&{}\ddots &{}0 \\ 0&{}0&{}0&{}\textbf{B}^{\left( n_{3}\right) }\end{array}\right] $$

where inverse block diagonalizing operator denotes as \(\texttt {fold}(\texttt {bdiag}({\varvec{\mathcal {B}}}))={\varvec{\mathcal {B}}}\).

Preliminaries of tensor

In this section, to help understand tensor nuclear norm better, some definitions about matrix decompositions and tensor decompositions are introduced as below [12, 13].

Definition 1

(Identity Tensor) The identity tensor \({\varvec{\mathcal {I}}}\in \mathbb {R}^{n_1\times n_1 \times n_3}\) satisfies that its first frontal slice is the identity matrix with size \(n_{1} \times n_{1}\) while the others frontal slices are zero.

Definition 2

(Orthogonal Tensor) A tensor \({\varvec{\mathcal {E}}} \in \mathbb {R}^{n_{1} \times n_{1} \times n_{3}}\) is orthogonal if

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {E}}}^{\intercal } * {\varvec{\mathcal {E}}}={\varvec{\mathcal {E}}} * {\varvec{\mathcal {E}}}^{\intercal }={\varvec{\mathcal {I}}} \end{aligned} \end{aligned}$$
(1)

Definition 3

(Tensor Transpose) Denote \({\varvec{\mathcal {B}}}^\intercal \in \mathbb {R}^{n_2\times n_1 \times n_3}\) as the transpose operator of the tensor \({\varvec{\mathcal {B}}}\in \mathbb {R}^{n_1\times n_2 \times n_3}\). And its calculation is achieved by transposing all frontal slices of \({\varvec{\mathcal {B}}}\).

Definition 4

(t-product) Let \({\varvec{\mathcal {B}}}\in \mathbb {R}^{n_1 \times n_2 \times n_3}\) and \({\varvec{\mathcal {N}}}\in \mathbb {R}^{n_2 \times n_4 \times n_3}\), the tensor-product is the \(n_1 \times n_4 \times n_3\) tensor

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {B}}} * {\varvec{\mathcal {N}}}= \texttt {fold} (\texttt {bcirc}({\varvec{\mathcal {B}}}) \texttt {bvec}({\varvec{\mathcal {N}}})) \end{aligned} \end{aligned}$$
(2)

where the definition of \(\texttt {bcirc}\) and \(\texttt {bvec}\) as shown in “Notations”.

Definition 5

(Tensor Singular Value Decomposition) (t(-SVD)) The t-SVD of \({\varvec{\mathcal {B}}}\) is defined as

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {B}}}={\varvec{\mathcal {U}}} * {\varvec{\mathcal {S}}} * {\varvec{\mathcal {V}}}^{\intercal } \end{aligned} \end{aligned}$$
(3)

where \({\varvec{\mathcal {S}}}\in \mathbb {R}^{n_1\times n_2 \times n_3}\) is f-diagonal, as well as \({\varvec{\mathcal {U}}}\in \mathbb {R}^{n_1\times n_1 \times n_3}\) and \({\varvec{\mathcal {V}}}\in \mathbb {R}^{n_2\times n_2 \times n_3}\) are orthogonal. t-SVD operator is demonstrated in Fig. 1 for facilitate understanding.

Fig. 1
figure 1

The t-SVD operator of tensor \({\varvec{\mathcal {B}}} \in \mathbb {R}^{n_1 \times n_2 \times n_3}\)

Definition 6

(t-SVD-based Tensor Nuclear Norm (t-TNN)) \(\Vert {\varvec{\mathcal {B}}}\Vert _{\circledast }\) is the t-SVD-based tensor nuclear norm of \({\varvec{\mathcal {B}}} \in \mathbb {R}^{n_1 \times n_2 \times n_3}\), which is denoted as the sum of singular values of all the frontal slices of \({\varvec{\mathcal {B}}}_{f}\), i.e.

$$\begin{aligned} \begin{aligned} \Vert {\varvec{\mathcal {B}}}\Vert _{\circledast }=\sum _{k=1}^{n_{3}}\left\| {\varvec{\mathcal {B}}}_{f}^{(k)}\right\| _{*}=\sum _{i=1}^{\min \left( n_{1}, n_{2}\right) } \sum _{k=1}^{n_{3}} \left|{\varvec{\mathcal {S}}}_{f}^{(k)}(i, i)\right|\end{aligned} \end{aligned}$$
(4)

where \({\varvec{\mathcal {S}}}_{f}^{(k)}\) is obtained from the complex-valued matrix SVD as \({\varvec{\mathcal {B}}}_{f}^{(k)}={\varvec{\mathcal {U}}}_{f}^{(k)} {\varvec{\mathcal {S}}}_{f}^{(k)} {{\varvec{\mathcal {V}}}_{f}^{(k)}}^\intercal \).

Related work

Multiple kernel subspace clustering

Given a data matrix \(\textbf{X}=\left[ \textbf{x}_{1}, \textbf{x}_{2}, \ldots , \textbf{x}_{n}\right] \in \textbf{R}^{d\times n}\), where d, n, and \(\textbf{x}_i\) demonstrate sample dimensionality, sample size, and the i-th sample, respectively. As far as we know, the self-expressiveness subspace learning (SESL) model [14,15,16,17,18] is formulated as

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}} \mathbf {\Psi }(\textbf{X}, \textbf{X S}) +\alpha \mathcal {R}( \textbf{S}) \end{aligned} \end{aligned}$$
(5)

where \(\alpha >0\) is a regularization parameter and \(\mathbf {\Psi }\) stands for the loss function, \(\textbf{ S}\) is the desired coefficient matrix; and its regularization term \(\mathcal {R}(\textbf{ S})\) is commonly replaced by sparse [19] or low-rank constraints [20]. Usually, affinity graph can be calculated by \(\textbf{ S}=({\textbf{S}}^\intercal +\textbf{S})/{2}\) for performing spectral clustering [21, 22]. However, Eq. (5) cannot well handle the non-linear data existing in practice extensively. Consequently, Eq. (5) can be extended to kernel space by using a kernel mapping function \(\phi (\cdot )\), i.e. \(\textbf{x}_i \rightarrow \phi (\textbf{x}_i)\). With the r mapping function \(\{\phi ^{(k)}(\cdot )\}_{k=1}^r\), multiple kernel subspace clustering (MKSC) can be achieved as

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}}\underbrace{\mathbf {\Psi }(\phi (\textbf{X}), \phi (\textbf{X})\textbf{S}) +\alpha \Phi _1(\textbf{S})}_{\text {kernel self-expressiveness subspace learning}} +\underbrace{\beta \mathcal {F}\left( \{\textbf{H}^{(k)}\}_{k=1}^r, \textbf{H}\right) }_{\text{ kernel } \text{ fusion } \text{ scheme }} \end{aligned}\nonumber \\ \end{aligned}$$
(6)

where \(\textbf{H}^{(k)}=(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\) is the pre-defined k-th kernel matrix; \(\Phi _1\) indicates certain regularization terms of \(\textbf{S}\); \(\mathcal {F}\) expresses the linear kernel fusion [7] or non-linear kernel fusion [11] schemes; and \(\beta \) is the balance parameter. Taking full advantage of complementary information beneath different base kernels, MKSC methods have demonstrated the performance enhancement compared with single kernel counterparts [2, 6,7,8,9,10,11, 23,24,25,26]. Noting herein that Eq. (6) focuses more on the consensus kernel learning to indirectly yield an affinity graph [2, 6,7,8,9,10,11]. However, the graph learning is the final goal for spectral clustering methods [3, 27,28,29]. Thus, the most advanced MKSC method, i.e. CAGL, proposes a two-step pure graph learning method to intensively learn a consensus graph rather than consensus kernel for clustering and has obtained promising achievements [3].

t-SVD-based tensor nuclear norm

Tensor computation has widely used in machine learning, signal processing, data mining, computer vision, remote sensing, and biomedical engineering [30]. Due to the validity of the nuclear norm and tensor computation, [31] extends Eq. (5) to tensor version, i.e.

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}^{(k)}} \mathbf {\Psi }(\textbf{X}^{(k)},\textbf{X}^{(k)} \textbf{S}^{(k)}) +\alpha \Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*} \end{aligned} \end{aligned}$$
(7)

where \(\Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*}=\sum _{k=1}^{r} \xi ^{(k)}\Vert \textbf{S}^{(k)}\Vert _{*}\), the weights \(\{\xi ^{(k)}\}_{k=1}^r>0\) are equivalent and \(\sum _{k=1}^{r} \xi ^{(k)}=1\) in [31]; \(\hat{{\varvec{\mathcal {S}}}}\) and \(\textbf{S}^{(k)}\) are the merging 3-order tensor and the unfolding matrixes, respectively; \(\Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*}\) is the generalized tensor nuclear norm (g-TNN) on \(\hat{{\varvec{\mathcal {S}}}}\). Note that g-TNN lacks a clear physical meaning for general tensors, and it is illogical for the same weights to penalize ranks of \(\hat{{\varvec{\mathcal {S}}}}\) [13]. Different from g-TNN, t-TNN cannot only possess physical meaning clearly, but also can exploit high-order relationship among different affinity graphs [5, 32]. Besides, t-TNN works better than the unfolding-based TNN when the both are to capture the structural information for tensor [33]. Therefore, t-TNN is employed in this paper to exploit structural information of tensor.

Proposed methodology

Auto-weighted multiple kernel tensor clustering

By leveraging self-expressiveness subspace learning with the r mapping function \(\{\phi ^{(k)}(\cdot )\}_{k=1}^r\), the kernelized self-expressiveness is formulated as

$$\begin{aligned} \begin{aligned}&\min \limits _{\textbf{S}^{(k)}} \Vert \phi ^{(k)}(\textbf{X})-\phi ^{(k)}(\textbf{X}) \textbf{S}^{(k)}\Vert _F^2+\alpha \mathcal {R}(\textbf{S}^{(k)})\\&\quad =\min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}[(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})-2{(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\textbf{S}^{(k)}}\\&\qquad +(\textbf{S}^{(k)})^\intercal {(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\textbf{S}^{(k)}}]+\alpha \mathcal {R}( \textbf{S}^{(k)})\\&\quad =\min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}(\textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+{\textbf{S}^{(k)}}^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)})+\alpha \mathcal {R}( \textbf{S}^{(k)})\\ \end{aligned}\nonumber \\ \end{aligned}$$
(8)

where \(\texttt {Tr}(\cdot )\) is the trace of a square matrix. Note here that the kernel trick, defined as \(\mathrm{{ker}}(\textbf{x}_i,\textbf{x}_j)=\phi (\textbf{x}_i)^\intercal \phi (\textbf{x}_j)\), makes \(\textbf{H}^{(k)}=(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\). Considering \(\mathcal {R}(\textbf{S}^{(k)})= \Vert \textbf{S}^{(k)}\Vert _F^2\) can avoid the trivial solutions (i.e., \(\textbf{S}^{(k)}=\textbf{I}\)), where \(\textbf{I}\) is the identity matrix, we then have

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}(\textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+\textbf{S}^{(k)^\intercal } \textbf{H}^{(k)}\textbf{S}^{(k)})+\alpha \Vert \textbf{S}^{(k)}\Vert _{F}^{2} \end{aligned}\nonumber \\ \end{aligned}$$
(9)

where \(\Vert \cdot \Vert _{F}^2\) is the Frobenius norm, and r candidate affinity graphs \(\{\textbf{S}^{(k)}\}_{k=1}^r\) can be learned. To fully capture both the consistent and complementary information among these r graphs, these graphs are stacked into a tensor \({\varvec{\mathcal {S}}}^{*} \in \mathbb {R}^{n \times n \times r}=\texttt {bvfold}([\textbf{S}^{(1)}; \cdots ; \textbf{S}^{(r)}])\) and rotate \({\varvec{\mathcal {S}}}^*\) from \(\mathbb {R}^{n \times n\times r}\) to \({\varvec{\mathcal {S}}}\in \mathbb {R}^{n\times r \times n}\), i.e. \({\varvec{\mathcal {S}}}=\texttt {rotate}({\varvec{\mathcal {S}}}^*)\), as illustrated in Fig. 2. Compared with using \({\varvec{\mathcal {S}}}^*\), the computational complexity is largely reduced (as shown in “Computational complexity and convergence”).

Fig. 2
figure 2

The rotated process of AMKTC for graph tensor

The different \(\textbf{S}^{(k)}\) should possess some consensus structure information since r candidate affinity graphs are originated from one source. Additionally, graph tensor \({\varvec{\mathcal {S}}}\) should enjoy low-rank property, due to the fact that the number of samples is usually much bigger than the number of clusters. Thus, we impose the low-rank tensor constraint \(\Vert {\varvec{\mathcal {S}}}\Vert _\circledast \) on the rotated tensor. Different from traditional g-TNN, t-TNN possesses great properties for exploring the complementary information of candidate graphs [13]. By imposing the t-TNN constraint, Eq. (9) can be updated to

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}\sum _{k=1}^{r} \texttt {Tr}&\left( \textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\&\quad +\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {S}}}\Vert _\circledast \\ \text {s.t.}\ {}&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$
(10)

where \(\alpha \) is a balance parameter, \(\texttt {bvec}({\varvec{\mathcal {S}}})=[\textbf{S}^{(1)}; \textbf{S}^{(2)}; \cdots ; \textbf{S}^{(n_{3})}]\) is the block vectorizing function of \({\varvec{\mathcal {S}}}\), and \(\beta >0\) is a parameter for controlling the contribution of t-TNN. Doing so implies that Eq. (10) can simultaneously explore consistent and complementary information in kernelized low-rank tensor space.

Once \({\varvec{\mathcal {S}}}\) is learned by Eq. (10), the averaged \(\textbf{S}=\frac{\sum _{k=1}^{r}\textbf{S}^{(k)}}{r}\) can be obtained, and then performing spectral clustering. Nevertheless, in doing so, the equal weight for different graphs ignores the importance among graphs, so that the unreliable graphs may significantly decrease the clustering performance. In order to solve the above problem, a well-designed self-weighted strategy [34] can be considered to assign the appropriate weights for different graphs, and its model is formulated as

$$\begin{aligned} \begin{aligned}&\min \limits _{\textbf{Z},{w}^{(k)}}\sum _{k=1}^{r}({w}^{(k)})^{\lambda }\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \\&\text {s.t.} \quad z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1, \sum _{k=1}^r w^{(k)}=1, {w}^{(k)}\ge {0}\\ \end{aligned} \end{aligned}$$
(11)

where \(\lambda > 0\) is the scalar to control the weights distribution; and \(z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1\) are used to guarantee probability property. Inspired by similar form applied in [35], Eq. (11) can be converted to the following form:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{Z},\textbf{w}}&\sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}+\lambda \Vert \textbf{w}\Vert _2^2 \\&\text {s.t.}\quad z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1, \sum _{k=1}^r w^{(k)}=1, {w}^{(k)}\ge {0} \end{aligned} \end{aligned}$$
(12)

where \(\textbf{w}=[w^{(1)},\ldots ,w^{(r)}]\), and \(\lambda > 0\) is a regularization parameter to control the contributions of different \(\textbf{S}^{(k)}\) to \(\textbf{Z}\). On the one hand, if \(\lambda \rightarrow 0\), Eq. (12) may obtain a trivial solution, i.e. only the smallest distance between \(\textbf{S}^{(k)}\) and \(\textbf{Z}\) corresponds to a weight \(w^{(k)}=1\) and the other weights are zeros. While for \(\lambda \rightarrow \infty \), all the elements of \(\textbf{w}\) are equivalent to \(\frac{1}{r}\). Although effective, the linear combination of candidate graphs largely limit the representation ability of consensus graph. Besides, noise and outliers may result in inappropriate assignments for optimizing weights. On the other hand, \(\lambda \) in Eq. (12) also requires to be tuned in a large range, i.e. \(0 \rightarrow \infty \), where the ideal \(\lambda \) with the optimal clustering performance varies for different datasets.

To address the aforementioned problems, we make the assumption that each candidate affinity graph can be deemed as a perturbation of consensus graph. It is worth noting that more important candidate graphs to the consensus graph should receive large weights, and smaller weights for suboptimal graphs. Motivated by this assumption, we develop an auto-weighted strategy to remove \(\lambda \) without degrading clustering performance by the following optimization problem:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{Z},{w}^{(k)}}\sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \quad \text {s.t.} \ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \\ \end{aligned} \end{aligned}$$
(13)

where the constraints of \({w}^{(k)}\) are also removed so as to give the weights more freedom. The \({w}^{(k)}\) of Eq. (13) can be solved via Theorem 1.

Theorem 1

The weight corresponds to the k-th candidate affinity graph is computed by \(w^{(k)}=\frac{1}{2 \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}}\).

Proof

Motivated by the iteratively re-weighted method in [36], an auxiliary problem without \(\textbf{w}\) is introduced as

$$\begin{aligned} \begin{aligned} \min _{\textbf{Z}} \sum _{k=1}^{r} \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}\quad \text {s.t.}\ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \end{aligned} \end{aligned}$$
(14)

which leads a Lagrange function as follows:

$$\begin{aligned} \begin{aligned} \sum _{k=1}^{r} \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}+\Phi _2(\mathbf {\Lambda }, \textbf{Z}) \end{aligned} \end{aligned}$$
(15)

where \(\mathbf {\Lambda }\) and \(\Phi _2(\mathbf {\Lambda }, \textbf{Z})\) indicates the Lagrange multiplier and indicator function of \(\textbf{Z}\), respectively. Taking the derivative of Lagrange function w.r.t \(\textbf{Z}\), Eq. (15) becomes

$$\begin{aligned} \begin{aligned} \sum _{k=1}^{r}\widehat{w}^{(k)}\frac{\partial {\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}}{\partial \textbf{Z}} +\frac{\partial \Phi _2 (\Lambda ,\textbf{Z})}{\partial \textbf{Z}}=0 \end{aligned} \end{aligned}$$
(16)

where \(\widehat{w}^{(k)}=1/(2\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F})\). It is obvious that Eq. (16) is also the derivation of the Lagrange function of Eq. (13). Note herein that \(\widehat{w}^{(k)}\) is dependent on the \(\textbf{Z}\) and thus cannot be solved directly. However, if we make \(\textbf{Z}\) stationary, \(\widehat{w}^{(k)}\) can be deemed as a solution for problem Eq. (16). Theoretically, to avoid dividing zero, \(\widehat{w}^{(k)}\) can be converted to

$$\begin{aligned} \begin{aligned} {w}^{(k)}=\frac{1}{2\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}+\zeta } \end{aligned} \end{aligned}$$
(17)

where \(\zeta \) is infinitely close to zero. \(\square \)

Learning \(\textbf{S}^{(k)}\) and \(\textbf{Z}\) separately will cause the suboptimal solution. In addition, the tensor low-rank can suppress noise to further improve reliability of weights. To this end, we seamlessly integrating Eqs. (10) and (13) in a unified framework. Overall, the ultimate objection of AMKTC is formulated as

$$\begin{aligned} \min \limits _{\textbf{S}^{(k)},\textbf{Z},\textbf{w}}{} & {} \sum _{k=1}^{r} \texttt {Tr}\left( \textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \nonumber \\ +{} & {} \alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {S}}}\Vert _\circledast + \gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}\nonumber \\ \text {s.t.}\ {}{} & {} z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1\nonumber \\ {}{} & {} {\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}$$
(18)

The learned consensus graph of Eq. (18) has superiorities as follows: (1) \(\textbf{Z}\) can capture the non-linear relationships well; (2) by utilizing t-TNN constraint on \({\varvec{\mathcal {S}}}\), the high-order information can be captured underlying multiple base kernels, so that the more exact candidate affinity graphs \(\{{\varvec{\textbf{S}}}^{(k)}\}_{k=1}^{r}\) can lead to build the more effective and reasonable weights. Motivated by these merits, a high-quality \(\textbf{Z}\) can be obtained for improving clustering performance.

Optimization

Based on the alternating direction method of multipliers (ADMM) [37], an auxiliary variable \({\varvec{\mathcal {A}}}\) is first introduced to make Eq. (18) separable as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}, {\varvec{\mathcal {A}}},\textbf{Z},\textbf{w}}&\sum _{k=1}^{r} \texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\ +&\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \\ \text {s.t.}\quad&{\varvec{\mathcal {A}}}={\varvec{\mathcal {S}}},z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1\\&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$
(19)

whose augmented Lagrangian function is formed as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)},{\varvec{\mathcal {A}}},\textbf{Z},\textbf{w}}&\sum _{k=1}^{r} \texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\ +&\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\frac{\mu }{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}}+\frac{{\varvec{\mathcal {Y}}}}{\mu } \Vert _F^2\\ +&\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \quad \text {s.t.}\ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \\&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$
(20)

where \(\mu \) and \({\varvec{\mathcal {Y}}}\) are the corresponding the penalty parameter and Lagrangian multiplier. Equation (20) can be solved by in turn calculating each variable with the remaining variables fixed.

Step 1. \(\textbf{S}\)-subproblem: By fixing the \({\varvec{\mathcal {A}}}\), \(\textbf{w}\) and \(\textbf{Z}\), we update \(\{\textbf{S}^{(k)}\}_{k=1}^r\) via

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}&\texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) +\alpha \Vert \textbf{S}^{(k)}\Vert _F^2\\ {}&+\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} +\frac{\mu }{2}\Vert \textbf{A}^{(k)}-\textbf{S}^{(k)}+\frac{\textbf{Y}^{(k)}}{\mu } \Vert _F^2 \\ \end{aligned}\nonumber \\ \end{aligned}$$
(21)

whose closed-form solution can be obtained by taking the derivative of above equation w.r.t \(\textbf{S}^{(k)}\) and setting to zero. Thus, we then have

$$\begin{aligned} \begin{aligned} \left( \textbf{S}^{(k)}\right) ^*=(2\textbf{H}^{(k)}+\textbf{J}_1^{(k)})^{-1}(2\textbf{H}^{(k)}+\textbf{J}_2^{(k)}) \end{aligned} \end{aligned}$$
(22)

where \(\textbf{J}_1^{(k)}=(2\alpha +2\gamma w^{(k)}+\mu )\textbf{I}\) and \(\textbf{J}_2^{(k)}=2\gamma w^{(k)}\textbf{Z}+\mu \textbf{A}^{(k)}-\textbf{Y}^{(k)}\).

Step 2. \({\varvec{\mathcal {A}}}\)-subproblem: By fixing \(\{\textbf{S}^{(k)}\}_{k=1}^r\) and removing the irrelevant items, the optimization problem of \({\varvec{\mathcal {A}}}\) is

$$\begin{aligned} \begin{aligned} \min \limits _{{\varvec{\mathcal {A}}}}\ \beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\frac{\mu }{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}}+\frac{{\varvec{\mathcal {Y}}}}{\mu } \Vert _F^2 \\ \end{aligned} \end{aligned}$$
(23)

Let \({\varvec{{\varvec{\mathcal {M}}}}}={\varvec{\mathcal {S}}}-\frac{{\varvec{\mathcal {Y}}}}{\mu }\), this t-TNN minimization problem can be solved by applying the tensor tubal-shrinkage via Theorem 2 [13].

Theorem 2

For two 3-order tensor \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n_1\times n_2\times n_3}\), \({\varvec{{\varvec{\mathcal {M}}}}}\in \mathbb {R}^{n_1\times n_2\times n_3}\) and given a scalar \(\rho >0\), an optimization problem of low-rank tensor is

$$\begin{aligned} \begin{aligned} \min _{{\varvec{\mathcal {A}}}} \rho \Vert {\varvec{\mathcal {A}}}\Vert _{\circledast }+\frac{1}{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{{\varvec{\mathcal {M}}}}}\Vert _{F}^{2} \end{aligned} \end{aligned}$$
(24)

whose the global optimal solution can be achieved via the following tensor tubal-shrinkage operator:

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {A}}}={\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{{\varvec{\mathcal {M}}}}})={\varvec{\mathcal {U}}} * {\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{\mathcal {S}}}) * {\varvec{\mathcal {V}}}^{\intercal } \end{aligned} \end{aligned}$$
(25)

where \({\varvec{{\varvec{\mathcal {M}}}}}={\varvec{\mathcal {U}}} * {\varvec{\mathcal {S}}} * {\varvec{\mathcal {V}}}^{\intercal }\), \({\varvec{\mathcal {U}}}\) and \({\varvec{\mathcal {V}}}\) are orthogonal tensor with the size of \(n_1 \times n_1 \times n_3\) and \(n_2 \times n_2 \times n_3\), and \({\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{\mathcal {S}}})={\varvec{\mathcal {S}}}*{\varvec{\mathcal {Q}}}\). \({\varvec{\mathcal {Q}}}\in \mathbb {R}^{n_1\times n_2\times n_3}\) denotes an f-diagonal tensor, whose each diagonal element denotes as \({\varvec{\mathcal {Q}}}_f(i,i,j)=(1-\frac{n_3 \rho }{{\varvec{\mathcal {S}}}^{(j)}(i,i)})_+\).

Step 3. \(\textbf{Z}\)-subproblem: Eq. (18) becomes to the following subproblem:

$$\begin{aligned} \begin{aligned} \min \limits _{z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1}&\sum _{k=1}^r w^{(k)} \Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _F^2 \end{aligned} \end{aligned}$$
(26)

which can be changed to the form of vectors as

$$\begin{aligned} \begin{aligned} \min _{\textbf{z}_i} \sum _{k=1}^{r}\Vert \textbf{z}_i-\textbf{s}_i^{(k)}\Vert _{2}^{2}\quad s.t. \ 0 \le \textbf{z}_i, \textbf{1}^\intercal \textbf{z}_i=1 \end{aligned} \end{aligned}$$
(27)

which can be effectively solved as [3].

Step 4. \(\textbf{w}\)-subproblem: Ignoring the irrelevant items and fixing the other variables, we update \(\textbf{w}\) via Theorem 1.

Step 5. Multipliers-subproblem: The multipliers involved ADMM is given by

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {Y}}}={\varvec{\mathcal {Y}}}+\mu ({\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}})\\ \mu =\min \left( \tau _1 \mu , \mu _{\max }\right) \end{aligned} \end{aligned}$$
(28)

where \(\tau _1\) and \(\mu _{max}\) are scalars of ADMM. The stopping criterion is to satisfy the residuals condition \(\max \{ |\textrm{obj}^{t+1}-\textrm{obj}^{t}|,\Vert {\varvec{\mathcal {S}}}^{t+1}-{\varvec{\mathcal {S}}}^{t}\Vert _{F}^{2}\} \le \epsilon \), where obj, t and \(\epsilon \) are the objective value of Eq. (18), the number of iteration, and the pre-defined threshold tolerance, respectively. Finally, the optimization procedure of AMKTC is summarized in Algorithm 1.

figure a

Computational complexity and convergence

Algorithm 1 involves five main subproblems, including problem (22), problem (23), problem (27), problem (17), and problem (28). And their corresponding computational complexities are as follows: Problem (22) involves the computational complexity of \(\textbf{w}\) with \(\mathcal {O}(r n^{2})\). The complexity of (23) is \(\mathcal {O}(rn^3)\) due to its matrix inversion operator. The complexity of (27) first requires \(\mathcal {O}(rn^2\texttt {log}(n))\) due to the FFT and inverse FFT of the tensor \({\varvec{\mathcal {S}}}\in \mathbb {R}^{n\times r \times n}\), and then the SVD of graph tensor \({\varvec{\mathcal {S}}}\) needs \(\mathcal {O}(rn^2)\texttt {log}(n)\). Thus, the complexity of problem (28) is \(\mathcal {O}(rn^2\texttt {log}(n)+r^2n^2)\) in total under \(\texttt {rotate}\) operator. Note that if problem (28) does not performed \(\texttt {rotate}\) operator in Fig. 2, the complexity will increase to \(\mathcal {O}(rn^2\texttt {log}(r)+rn^3)\), where \(r\ll n\). Both problems (17) and (28) only involve the computational complexity with \(\mathcal {O}(n)\). Theoretically, the computational cost of Algorithm 1 is approximate to \(\mathcal {O}(tn^3)\), where t denotes the total number of iterations. Since \(t\ll n\) and a small r in practice, the overall cost for our AMKTC can be deemed as \(\mathcal {O}(n^3)\). Despite the computation complexity between the competitors and AMKTC is similar, clustering performance of the competitors is largely inferior to AMKTC.

Table 1 Summaries of the seven used datasets
Table 2 The choice of kernel function

Although Algorithm 1 is difficult to generally prove the convergence, Algorithm 1 can converge to the local optimum in a high probability since the subproblems are all have closed-form solutions. In addition, empirical evidence of “Optimization” demonstrates that Algorithm 1 has good convergence behavior.

Experiments

In this section, the effectiveness of our AMKTC is verified by conducting experiments on seven widely used datasets.

Benchmark datasets and kernel setting

In experiments, seven widely used datasets including Yale, Jaffe, AR, ORL, binaryalphadigs (BA), COIL-20, and Deep CIAR-10 (DC) are employed to evaluate the clustering performance of AMKTC, where CIFAR is a dataset of deep learning feature constructed as the work [3]. The details of these datasets are summarized in Table 1. Following the settings in [7, 8, 10, 23], we construct 12 base kernels (i.e., \(r=12\) is the number of kernel functions. ) and form a kernel pool as mentioned in Table 2, including seven radial basis function (RBF) kernels with \({ker}(x_{i}, x_{j})=\exp (-\Vert x_{i}-x_{j}\Vert _{2}^{2} /(2 \tau _2 \sigma ^{2}))\), where \(\tau _2\) varies in the range of \(\{0.01,0.05,0.1,1,10,50,100\}\) and \(\sigma \) is the maximum distance between samples; four polynomial kernels with \({ker}\left( x_{i}, x_{j}\right) =\left( a+x_{i}^{T} x_{j}\right) ^{b}\), where \(a=\{0,1\}\) and \(b=\) \(\{2,4\}; \) a cosine kernel with \({\text {ker}}\left( x_{i}, x_{j}\right) =\left( x_{i}^{T} x_{j}\right) /\left( \left\| x_{i}\right\| \cdot \left\| x_{j}\right\| \right) \cdot \) Finally, all the kernel matrices \(\left\{ \textbf{H}^{(k)}\right\} _{k=1}^{r}\) are normalized to [0, 1] range.

Evaluation metrics

Three most used evaluation metrics, including clustering accuracy (ACC), normalized mutual information (NMI), and purity are used to evaluate the clustering performance of our proposed AMKTC. The higher the value indicates the clustering performance is better.

Table 3 The ACC of some compared MKC methods on seven datasets
Table 4 The NMI of some compared MKC methods on seven datasets
Table 5 The purity of some compared MKC methods on seven datasets

Our AMKTC are evaluated by comparing with eleven state-of-the-art methods, including MKKM [38], RMKKM [24], MVCLFA [25], LKGr [7], SCMK [6], SMKL [8], JMKSC [9], LLMKL [10], SPMKC-E [11], SPMKC [11], CAGL [3], and AMKTC. The clustering results are reported in Tables 3, 4 and 5.

Fig. 3
figure 3

Clustering ACC w.r.t the parameters \(\alpha \) and \(\beta \) on AR, BA, Deep CIFAR-10, COIL-20 and ORL datasets

Fig. 4
figure 4

Clustering NMI w.r.t the parameters \(\alpha \) and \(\beta \) on AR, BA, Deep CIFAR-10, COIL-20 and ORL datasets

For fairness, we tune parameters of competitors by following the recommended values of respective authors in their papers. Further, we repeat competitors for 20 times and report the average results. The concrete details about these k-means based methods are described as follows:

  • MKKM [24] is a multiple kernel extension of the fuzzy k-means.

  • RMKKM [24] is the robust extension of the MKKM.

  • MVCLFA [25] divides pre-defined kernel matrixes into multiple low-dimensional partitions, which are to learn a consensus partition for MKC.

Please see the details of these rest MKSC comparison methods in “Introduction”.

Experimental results and analysis

Tables 3, 4 and 5 report the average clustering results of competitors, where the standard deviations in these tables are not shown since they are less that \(1\%\).

  • As shown in Tables 3, 4 and 5, the proposed AMKTC consistently achieves significant improvements under all evaluation metrics compared with three k-means based MKC and eight MKSC methods. Noticeably, the average performance of six datasets outperforms the second best method with approximately \(19.8\%\), \(16.5\%\) and \(18.4\%\) in terms of ACC, NMI and purity, respectively. Especially for Yale, BA and Deep CIFAR-10 datasets, AMKTC outperforms the second best method with approximately \(20.0\%\), \(44.0\%\), \(18.6\%\) in terms of ACC, \(23.8\%\), \(32.4\%\), \(24.3\%\) in terms of NMI, and \(19.4\%\), \(34.3\%\), \(17.8\%\) in terms of purity, respectively.

  • Theoretically, the k-means based MKC [24, 25] methods mainly focus on learning a consensus kernel, and then employs kernel k-means to obtain the final clustering results. They face the problem of narrow solution set and cannot preserve the complex structure hidden in data well [3].

  • While for MKSC methods, they adopt the widely used self-expressiveness subspace learning (SESL) to learn an affinity graph, and then employs spectral clustering to obtain the final clustering results, so as to effectively capture the subspace structure. Moreover, it can be observed that MKSC methods perform better than k-means based MKC ones in most cases. In these MKSC methods, it is noted that only CAGL and the proposed AMKTC are pure graph learning ones, while the other methods persist in LKF or NKF schemes. This violates the intention that the high-quality affinity graph is the expectation of MKSC methods [3]. From the results, it is obvious that CAGL and AMKTC are superior to the other competitors.

  • Compared with CAGL, it mainly suffers from following flaws (see “Introduction”): (1) it neglects the high-order correlations among base kernels; and (2) it adopts a two-step way resulting in a suboptimal solution. Inversely, by introducing t-TNN, AMKTC can directly learn a consensus graph with the optimal weights; meanwhile it can capture the high-order correlations in the kernelized tensor space. From the results, the superiority of our method is abundantly demonstrated.

Parameter sensitivity

Fig. 5
figure 5

Original data \(\textbf{X}\), learned \(\textbf{Z}\) of AMKTC, and learned \(\textbf{Z}\) of CAGL are visualized on ORL and DC datasets

Fig. 6
figure 6

Convergence curves of the proposed AMKTC method on the AR, BA, Deep CIFAR-10, COIL-20, ORL and Yale datasets

Three parameters \(\alpha \), \(\beta \) and \(\gamma \) of AMKTC are required to be set properly. For simplicity, \(\gamma \) is fixed as \(\gamma =0.0001\), which can always obtain the best clustering performance in terms of all evaluation datasets experimentally. Specially, the parameters \(\alpha \) and \(\beta \) are to balance the effect of \(\Vert \textbf{S}^{(k)}\Vert _F^2\) and \({\varvec{\mathcal {S}}}_\circledast \), respectively. By leveraging a grid search technique with step size 10, we tune both \(\alpha \) and \(\beta \) from \(10^{-5}\) to \(10^2\). Take the AR, ORL, COIL-20, BA and Deep CIFAR-10 datasets for example, as shown in Figs. 3 and 4, the best performance of AMKTC on COIL-20 dataset can be achieved by setting \(\alpha \in [10^{-2},10^{-1}]\) and \(\beta \in [10^{-5},10^{-4}]\), while on the residual evaluation datasets, the best performance can be obtained when setting \(\alpha \in [10^{-2},1]\) and \(\beta \in [10^{-5},10^{-3}]\). The aforementioned analyses demonstrate that AMKTC is not sensitive to the setting of parameters.

Visualization of clustering result

Original data and the clustering results of two methods (i.e., our AMKTC and the second best competitor CAGL) are visualized on ORL and DC datasets in Fig. 5. First, the clustering results of AMKTC and CAGL can clearly achieve the better structure distribution than the original data in both datasets. In addition, although the different color points of CAGL in Fig. 5b and e have the scattered structure distribution, it still exists some incorrect cluster partitions compared to our method. While in contrast, Fig. 5c and f of AMKTC only has a fewer wrong partitions than the second best competitor CAGL. Especially as mentioned in Fig. 5f, there is just a little bit of wrong partitions. This phenomenon demonstrates that our method has good extendibility of partition for different datasets.

Convergence

In this section, we evaluate the convergence of the proposed AMKTC method on seven datasets (i.e., AR, BA, CIFAR-10, COIL-20, ORL and Yale), and then the results are shown in Fig. 6. Obviously, the iterator algorithm converges rapidly till to the stable point, even some of which meet the stop criterion within just eight epochs.

Conclusion

In this paper, a novel multiple kernel subspace clustering method, namely AMKTC, is proposed to address the challenging problems of capturing high-order correlations hidden in the different base kernels. Meanwhile, a non-linear auto-weighted graph fusion scheme is used to learn consensus affinity graph with optimal weights. In AMKTC, we integrate the candidate affinity graph learning, the graph tensor learning and auto-weighted consensus graph learning in a unified objective function, such that consensus affinity graph with optimal weights is learned for clustering. Compared to the state-of-the-art MKC methods, the superior performance of our methods are demonstrated via extensive experimental results. In the future, we will study tensor train and tensor ring to further enhance the representation capability of graph tensor.