Auto-weighted multiple kernel tensor clustering

Wang, Yanlong; Liu, Jinhua; Chang, Cun; Ren, Zhenwen

doi:10.1007/s40747-023-01112-5

Auto-weighted multiple kernel tensor clustering

Original Article
Open access
Published: 05 June 2023

Volume 9, pages 6863–6874, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Auto-weighted multiple kernel tensor clustering

Download PDF

Yanlong Wang¹,
Jinhua Liu¹,
Cun Chang¹ &
…
Zhenwen Ren ORCID: orcid.org/0000-0003-3791-9750^1,2,3

998 Accesses
1 Citation
Explore all metrics

Abstract

Multiple kernel subspace clustering (MKSC) has attracted intensive attention since its powerful capability of exploring consensus information by generating a high-quality affinity graph from multiple base kernels. However, the existing MKSC methods still exist the following limitations: (1) they essentially neglect the high-order correlations hidden in different base kernels; and (2) they perform candidate affinity graph learning and consensus affinity graph learning in two separate steps, where suboptimal solution may be obtained. To alleviate these problems, a novel MKSC method, namely auto-weighted multiple kernel tensor clustering (AMKTC), is proposed. Specifically, AMKTC first integrates the consensus affinity graph learning and candidate affinity graph learning into a unified framework, where the optimal goal can be achieved by making these two learning processes negotiate with each other. Further, an auto-weighted fusion scheme with one-step manner is proposed to learn the final consensus affinity graph, where the reasonable weights will be automatically learned for each candidate graph. Finally, the essential high-order correlations between multiple base kernels can be captured by leveraging tensor-singular value decomposition (t-SVD)-based tensor nuclear norm constraint on a 3-order graph tensor. Experiments on seven benchmark datasets with eleven comparison methods demonstrate that our method achieves state-of-the-art clustering performance.

Clustering via multiple kernel k-means coupled graph and enhanced tensor learning

Article 10 May 2022

Coupled Learning for Kernel Representation and Graph Tensor in Multi-view Subspace Clustering

Nonconvex low-rank and sparse tensor representation for multi-view subspace clustering

Article 11 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Multiple kernel clustering (MKC) aims to optimally integrate the consensus information among multiple base kernels to generate a consensus kernel for improving clustering performance. In the last few years, MKC methods have been widely applied into various applications, with benefiting from the powerful expression ability to handle non-linear data than traditional single kernel clustering methods [1,2,3,4,5]. In general, existing MKC methods can fall into two main classes, i.e. linear kernel fusion (LKF) scheme-based methods and non-linear kernel fusion (NKF) scheme-based methods.

Usually, LKF assumes that a linear combination of multiple pre-defined base kernels can yield an optimal kernel. Motivated by this, SCMK [6] uses the linear combination of base kernels to learn an optimal kernel. Furthermore, LKGr [7] imposes an extra low-rank constraint on the base kernels to capture the structure information of base kernels by considering the low-rank property of the samples. Recently, many efforts have been devoted to improve the effectiveness of LKF by leveraging other non-linear kernel fusion schemes in the kernel matrix space. For instance, neighbor-kernel-based MKC [2] method non-linearly constructs multiple neighbor kernels preparing for yielding a consensus kernel by considering the neighborhood structure among base kernels. Nevertheless, in practical applications, LKF tends to over-reduce the feasible set of the consensus kernel. That is, the representation capability of consensus kernel fails to outperform single kernel in some cases [8]. Therefore, the aforementioned MKC methods with LKF scheme will not always be effective in handling non-linear clustering tasks.

Different from LKF, NKF is developed to better satisfy the non-linear clustering, and it involves two intuitive assumptions: (1) the consensus kernel is a neighborhood of candidate kernels; (2) the larger weights are assigned to more important/similar candidate kernels, and vice versa. Hence, the flexibility and reliability of consensus kernel are significantly highlighted. Recently, NKF derived a mass of excellent MKC methods [8,9,10,11]. For instance, SMKL [8] indirectly yields an affinity graph for clustering by using NKF. JMKSC [9] integrates the correntropy based NKF and block diagonal constraint to learn an affinity graph with optimal block diagonal property. Considering the suboptimal low-rank constraint existing in LKGr [7], LLMKL [10] proposes to use a low-rank substitute of consensus kernel to upgrade the affinity graph. SPMKC [11] captures the global and local structure over SMKL [8] to control similarity between consensus kernel and affinity graph, and improves the clustering performance significantly. Overall, the aforementioned LKF and NKF based MKC methods are referred to as multiple kernel subspace clustering (MKSC) [2, 6,7,8,9,10,11] methods, which typically work as follows: (1) generating multiple base kernels from single sample dataset; (2) fusing these kernels with LKF or NKF scheme to obtain a consensus kernel at the matrix level; (3) utilizing self-expressiveness subspace learning on the learned consensus kernel to adaptively learn an affinity graph for spectral clustering.

Although effective, they usually concentrate more on learning consensus kernel instead of affinity graph, which violates the intention of MKSC, i.e., the ultimate goal of MKSC is to learn an optimal affinity graph [3] for clustering. Therefore, CAGL [3] innovatively learns multiple candidate affinity graphs from multiple base kernels first, and then fuses these candidate graphs to directly learn a consensus graph at the matrix level. Note herein that CAGL adopts a two-step way to separately learn a consensus kernel graph. Although CAGL has achieved great progress compared with the other state-of-the-art MKSC methods, it still suffers from the following flaws: (1) ignoring the high-order correlations hidden in different base kernels essentially, such that the consistent and complementary information of the given multiple kernels may not be fully explored; (2) learning the weights of multiple candidate graphs in the matrix level rather than tensor level is short of correct guidance and sensitive to noise, which leads to unreliable weights; and (3) adopting a two-step way to learn the candidate affinity graphs and consensus affinity graph separately, such that the obtained solution is usually inferior.

In light of the aforementioned limitations, a novel MKSC method, dubbed auto-weighted multiple kernel tensor clustering (AMKTC) is proposed for improving the clustering. Concretely, by leveraging self-expressiveness subspace learning with multiple base kernels, multiple candidate affinity graphs are first learned. Then, we stack these graphs into graph tensor in reconstructed kernel Hilbert space, where the consistency and complementary of candidate graphs are fully exploited by imposing a tensor-singular value decomposition (t-SVD)-based tensor nuclear norm (t-TNN). Finally, with the high-quality candidate graphs at hand, an auto-weighted graph fusion scheme is developed to obtain an optimal consensus affinity graph by using the guidance of t-TNN constraint. In summary, the main contributions of this paper are summarized as follows:

This paper proposes a novel MKSC method, dubbed as AMKTC, to effectively handle non-linear data clustering. AMKTC integrates the consensus affinity graphs learning and candidate affinity graph learning into a unified framework, where they can be jointly learned via a mutually reinforce manner.
This paper proposes to explore the high-order correlations of base kernels by utilizing the t-TNN constraint on the rotated graph tensor. Meanwhile, auto-weighted graph fusion enforces the learning of consensus affinity graph with optimal weights.
Compared with the other state-of-the-art MKSC methods, the superiority of AMKTC is abundantly demonstrated by implementing extensive experiments.

The remainder of the paper is organized as follows. The next sections briefly review the several works related MKSC and tensor. The following section presents AMKTC method and its solver. The experimental results are shown in the next section. Last section presents the conclusion.

Notations and preliminaries

Notations

In this paper, matrices, vectors, and their entries are denoted as upper case letters, bold upper case letters, and lower case letters, e.g., $\textbf{B}$, $\textbf{b}$ and $b_{ij}$, respectively. Tensors are denoted as bold calligraphy letters, e.g., ${\varvec{\mathcal {B}}}$. For a 3-order tensor ${\varvec{\mathcal {B}}}$, it involves the main two definitions, i.e. fiber and slice. The fiber represents that an order is free while the other two orders are fixed, i.e. ${\varvec{\mathcal {B}}}(:, j, k), {\varvec{\mathcal {B}}}(i,:, k)$ and ${\varvec{\mathcal {B}}}(i, j,:)$. The slice indicates that only one order is fixed while the other two orders are free, i.e. horizontal slice ${\varvec{\mathcal {B}}}( i;:,:)$, lateral slice ${\varvec{\mathcal {B}}}(:, j,:)$, and frontal slice ${\varvec{\mathcal {B}}}(:,:, k)$. For convenience, ${\varvec{\mathcal {B}}}(:,:, k)$ can be simplified as $\textbf{B}^{(k)}$ or ${\varvec{\mathcal {B}}}^{(k)}$. ${\varvec{\mathcal {B}}}_{f}=\texttt {fft}({\varvec{\mathcal {B}}},[\ ], 3)$ is the fast Fourier transformation (FFT) of tensor ${\varvec{\mathcal {B}}}$, and its inverse FFT denotes as ${\varvec{\mathcal {B}}}=\texttt {ifft}(\mathbf {{\varvec{\mathcal {B}}}_{f}},[\ ], 3)$. $\texttt {bvec}({\varvec{\mathcal {B}}})=[\textbf{B}^{(1)}; \textbf{B}^{(2)}; \cdots ; \textbf{B}^{(n_{3})}] \in \mathbb {R}^{{n_1 n_3}\times n_2}$ and $\texttt {fold}(\texttt {bvec}({\varvec{\mathcal {B}}}))={\varvec{\mathcal {B}}}$ are defined as the block vectorizing and the inverse operator of ${\varvec{\mathcal {B}}}$, respectively. $\texttt {bcirc}({\varvec{\mathcal {B}}}) \in \mathbb {R}^{{n_1n_3}\times {n_2n_3}}$ and $\texttt {bdiag}({\varvec{\mathcal {B}}}) \in \mathbb {R}^{{n_1n_3}\times {n_2n_3}}$ denotes corresponding block circulant matrix and the block diagonal matrix, respectively, i.e.

$$\texttt {bcirc}({\varvec{\mathcal {B}}})=\left[ \begin{array}{cccc}\textbf{B}^{(1)} &{} \textbf{B}^{\left( n_{3}\right) } &{} \cdots &{} \textbf{B}^{(2)} \\ \textbf{B}^{(2)} &{} \textbf{B}^{(1)} &{} \cdots &{} \textbf{B}^{(3)} \\ \vdots &{} \ddots &{} \ddots &{} \vdots \\ \textbf{B}^{\left( n_{3}\right) } &{} \textbf{B}^{\left( n_{3}-1\right) } &{} \cdots &{} \textbf{B}^{(1)}\end{array}\right] $$

$$\texttt {bdiag}({\varvec{\mathcal {B}}})=\left[ \begin{array}{cccc}\textbf{B}^{(1)}&{}0&{}0&{}0 \\ 0&{}\textbf{B}^{(2)}&{}0&{}0 \\ 0&{}0&{}\ddots &{}0 \\ 0&{}0&{}0&{}\textbf{B}^{\left( n_{3}\right) }\end{array}\right] $$

where inverse block diagonalizing operator denotes as $\texttt {fold}(\texttt {bdiag}({\varvec{\mathcal {B}}}))={\varvec{\mathcal {B}}}$.

Preliminaries of tensor

In this section, to help understand tensor nuclear norm better, some definitions about matrix decompositions and tensor decompositions are introduced as below [12, 13].

Definition 1

(Identity Tensor) The identity tensor ${\varvec{\mathcal {I}}}\in \mathbb {R}^{n_1\times n_1 \times n_3}$ satisfies that its first frontal slice is the identity matrix with size $n_{1} \times n_{1}$ while the others frontal slices are zero.

Definition 2

(Orthogonal Tensor) A tensor ${\varvec{\mathcal {E}}} \in \mathbb {R}^{n_{1} \times n_{1} \times n_{3}}$ is orthogonal if

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {E}}}^{\intercal } * {\varvec{\mathcal {E}}}={\varvec{\mathcal {E}}} * {\varvec{\mathcal {E}}}^{\intercal }={\varvec{\mathcal {I}}} \end{aligned} \end{aligned}$$

(1)

Definition 3

(Tensor Transpose) Denote ${\varvec{\mathcal {B}}}^\intercal \in \mathbb {R}^{n_2\times n_1 \times n_3}$ as the transpose operator of the tensor ${\varvec{\mathcal {B}}}\in \mathbb {R}^{n_1\times n_2 \times n_3}$. And its calculation is achieved by transposing all frontal slices of ${\varvec{\mathcal {B}}}$.

Definition 4

(t-product) Let ${\varvec{\mathcal {B}}}\in \mathbb {R}^{n_1 \times n_2 \times n_3}$ and ${\varvec{\mathcal {N}}}\in \mathbb {R}^{n_2 \times n_4 \times n_3}$, the tensor-product is the $n_1 \times n_4 \times n_3$ tensor

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {B}}} * {\varvec{\mathcal {N}}}= \texttt {fold} (\texttt {bcirc}({\varvec{\mathcal {B}}}) \texttt {bvec}({\varvec{\mathcal {N}}})) \end{aligned} \end{aligned}$$

(2)

where the definition of $\texttt {bcirc}$ and $\texttt {bvec}$ as shown in “Notations”.

Definition 5

(Tensor Singular Value Decomposition) (t(-SVD)) The t-SVD of ${\varvec{\mathcal {B}}}$ is defined as

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {B}}}={\varvec{\mathcal {U}}} * {\varvec{\mathcal {S}}} * {\varvec{\mathcal {V}}}^{\intercal } \end{aligned} \end{aligned}$$

(3)

where ${\varvec{\mathcal {S}}}\in \mathbb {R}^{n_1\times n_2 \times n_3}$ is f-diagonal, as well as ${\varvec{\mathcal {U}}}\in \mathbb {R}^{n_1\times n_1 \times n_3}$ and ${\varvec{\mathcal {V}}}\in \mathbb {R}^{n_2\times n_2 \times n_3}$ are orthogonal. t-SVD operator is demonstrated in Fig. 1 for facilitate understanding.

Definition 6

(t-SVD-based Tensor Nuclear Norm (t-TNN)) $\Vert {\varvec{\mathcal {B}}}\Vert _{\circledast }$ is the t-SVD-based tensor nuclear norm of ${\varvec{\mathcal {B}}} \in \mathbb {R}^{n_1 \times n_2 \times n_3}$, which is denoted as the sum of singular values of all the frontal slices of ${\varvec{\mathcal {B}}}_{f}$, i.e.

$$\begin{aligned} \begin{aligned} \Vert {\varvec{\mathcal {B}}}\Vert _{\circledast }=\sum _{k=1}^{n_{3}}\left\| {\varvec{\mathcal {B}}}_{f}^{(k)}\right\| _{*}=\sum _{i=1}^{\min \left( n_{1}, n_{2}\right) } \sum _{k=1}^{n_{3}} \left|{\varvec{\mathcal {S}}}_{f}^{(k)}(i, i)\right|\end{aligned} \end{aligned}$$

(4)

where ${\varvec{\mathcal {S}}}_{f}^{(k)}$ is obtained from the complex-valued matrix SVD as ${\varvec{\mathcal {B}}}_{f}^{(k)}={\varvec{\mathcal {U}}}_{f}^{(k)} {\varvec{\mathcal {S}}}_{f}^{(k)} {{\varvec{\mathcal {V}}}_{f}^{(k)}}^\intercal $.

Related work

Multiple kernel subspace clustering

Given a data matrix $\textbf{X}=\left[ \textbf{x}_{1}, \textbf{x}_{2}, \ldots , \textbf{x}_{n}\right] \in \textbf{R}^{d\times n}$, where d, n, and $\textbf{x}_i$ demonstrate sample dimensionality, sample size, and the i-th sample, respectively. As far as we know, the self-expressiveness subspace learning (SESL) model [14,15,16,17,18] is formulated as

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}} \mathbf {\Psi }(\textbf{X}, \textbf{X S}) +\alpha \mathcal {R}( \textbf{S}) \end{aligned} \end{aligned}$$

(5)

where $\alpha >0$ is a regularization parameter and $\mathbf {\Psi }$ stands for the loss function, $\textbf{ S}$ is the desired coefficient matrix; and its regularization term $\mathcal {R}(\textbf{ S})$ is commonly replaced by sparse [19] or low-rank constraints [20]. Usually, affinity graph can be calculated by $\textbf{ S}=({\textbf{S}}^\intercal +\textbf{S})/{2}$ for performing spectral clustering [21, 22]. However, Eq. (5) cannot well handle the non-linear data existing in practice extensively. Consequently, Eq. (5) can be extended to kernel space by using a kernel mapping function $\phi (\cdot )$, i.e. $\textbf{x}_i \rightarrow \phi (\textbf{x}_i)$. With the r mapping function $\{\phi ^{(k)}(\cdot )\}_{k=1}^r$, multiple kernel subspace clustering (MKSC) can be achieved as

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}}\underbrace{\mathbf {\Psi }(\phi (\textbf{X}), \phi (\textbf{X})\textbf{S}) +\alpha \Phi _1(\textbf{S})}_{\text {kernel self-expressiveness subspace learning}} +\underbrace{\beta \mathcal {F}\left( \{\textbf{H}^{(k)}\}_{k=1}^r, \textbf{H}\right) }_{\text{ kernel } \text{ fusion } \text{ scheme }} \end{aligned}\nonumber \\ \end{aligned}$$

(6)

where $\textbf{H}^{(k)}=(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})$ is the pre-defined k-th kernel matrix; $\Phi _1$ indicates certain regularization terms of $\textbf{S}$; $\mathcal {F}$ expresses the linear kernel fusion [7] or non-linear kernel fusion [11] schemes; and $\beta $ is the balance parameter. Taking full advantage of complementary information beneath different base kernels, MKSC methods have demonstrated the performance enhancement compared with single kernel counterparts [2, 6,7,8,9,10,11, 23,24,25,26]. Noting herein that Eq. (6) focuses more on the consensus kernel learning to indirectly yield an affinity graph [2, 6,7,8,9,10,11]. However, the graph learning is the final goal for spectral clustering methods [3, 27,28,29]. Thus, the most advanced MKSC method, i.e. CAGL, proposes a two-step pure graph learning method to intensively learn a consensus graph rather than consensus kernel for clustering and has obtained promising achievements [3].

t-SVD-based tensor nuclear norm

Tensor computation has widely used in machine learning, signal processing, data mining, computer vision, remote sensing, and biomedical engineering [30]. Due to the validity of the nuclear norm and tensor computation, [31] extends Eq. (5) to tensor version, i.e.

$$\begin{aligned} \begin{aligned} \min \limits _{ \textbf{S}^{(k)}} \mathbf {\Psi }(\textbf{X}^{(k)},\textbf{X}^{(k)} \textbf{S}^{(k)}) +\alpha \Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*} \end{aligned} \end{aligned}$$

(7)

where $\Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*}=\sum _{k=1}^{r} \xi ^{(k)}\Vert \textbf{S}^{(k)}\Vert _{*}$, the weights $\{\xi ^{(k)}\}_{k=1}^r>0$ are equivalent and $\sum _{k=1}^{r} \xi ^{(k)}=1$ in [31]; $\hat{{\varvec{\mathcal {S}}}}$ and $\textbf{S}^{(k)}$ are the merging 3-order tensor and the unfolding matrixes, respectively; $\Vert \hat{{\varvec{\mathcal {S}}}}\Vert _{*}$ is the generalized tensor nuclear norm (g-TNN) on $\hat{{\varvec{\mathcal {S}}}}$. Note that g-TNN lacks a clear physical meaning for general tensors, and it is illogical for the same weights to penalize ranks of $\hat{{\varvec{\mathcal {S}}}}$ [13]. Different from g-TNN, t-TNN cannot only possess physical meaning clearly, but also can exploit high-order relationship among different affinity graphs [5, 32]. Besides, t-TNN works better than the unfolding-based TNN when the both are to capture the structural information for tensor [33]. Therefore, t-TNN is employed in this paper to exploit structural information of tensor.

Proposed methodology

Auto-weighted multiple kernel tensor clustering

By leveraging self-expressiveness subspace learning with the r mapping function $\{\phi ^{(k)}(\cdot )\}_{k=1}^r$, the kernelized self-expressiveness is formulated as

$$\begin{aligned} \begin{aligned}&\min \limits _{\textbf{S}^{(k)}} \Vert \phi ^{(k)}(\textbf{X})-\phi ^{(k)}(\textbf{X}) \textbf{S}^{(k)}\Vert _F^2+\alpha \mathcal {R}(\textbf{S}^{(k)})\\&\quad =\min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}[(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})-2{(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\textbf{S}^{(k)}}\\&\qquad +(\textbf{S}^{(k)})^\intercal {(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})\textbf{S}^{(k)}}]+\alpha \mathcal {R}( \textbf{S}^{(k)})\\&\quad =\min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}(\textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+{\textbf{S}^{(k)}}^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)})+\alpha \mathcal {R}( \textbf{S}^{(k)})\\ \end{aligned}\nonumber \\ \end{aligned}$$

(8)

where $\texttt {Tr}(\cdot )$ is the trace of a square matrix. Note here that the kernel trick, defined as $\mathrm{{ker}}(\textbf{x}_i,\textbf{x}_j)=\phi (\textbf{x}_i)^\intercal \phi (\textbf{x}_j)$, makes $\textbf{H}^{(k)}=(\phi ^{(k)}(\textbf{X}))^\intercal \phi ^{(k)}(\textbf{X})$. Considering $\mathcal {R}(\textbf{S}^{(k)})= \Vert \textbf{S}^{(k)}\Vert _F^2$ can avoid the trivial solutions (i.e., $\textbf{S}^{(k)}=\textbf{I}$), where $\textbf{I}$ is the identity matrix, we then have

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}{} \texttt {Tr}(\textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+\textbf{S}^{(k)^\intercal } \textbf{H}^{(k)}\textbf{S}^{(k)})+\alpha \Vert \textbf{S}^{(k)}\Vert _{F}^{2} \end{aligned}\nonumber \\ \end{aligned}$$

(9)

where $\Vert \cdot \Vert _{F}^2$ is the Frobenius norm, and r candidate affinity graphs $\{\textbf{S}^{(k)}\}_{k=1}^r$ can be learned. To fully capture both the consistent and complementary information among these r graphs, these graphs are stacked into a tensor ${\varvec{\mathcal {S}}}^{*} \in \mathbb {R}^{n \times n \times r}=\texttt {bvfold}([\textbf{S}^{(1)}; \cdots ; \textbf{S}^{(r)}])$ and rotate ${\varvec{\mathcal {S}}}^*$ from $\mathbb {R}^{n \times n\times r}$ to ${\varvec{\mathcal {S}}}\in \mathbb {R}^{n\times r \times n}$, i.e. ${\varvec{\mathcal {S}}}=\texttt {rotate}({\varvec{\mathcal {S}}}^*)$, as illustrated in Fig. 2. Compared with using ${\varvec{\mathcal {S}}}^*$, the computational complexity is largely reduced (as shown in “Computational complexity and convergence”).

The different $\textbf{S}^{(k)}$ should possess some consensus structure information since r candidate affinity graphs are originated from one source. Additionally, graph tensor ${\varvec{\mathcal {S}}}$ should enjoy low-rank property, due to the fact that the number of samples is usually much bigger than the number of clusters. Thus, we impose the low-rank tensor constraint $\Vert {\varvec{\mathcal {S}}}\Vert _\circledast $ on the rotated tensor. Different from traditional g-TNN, t-TNN possesses great properties for exploring the complementary information of candidate graphs [13]. By imposing the t-TNN constraint, Eq. (9) can be updated to

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}\sum _{k=1}^{r} \texttt {Tr}&\left( \textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\&\quad +\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {S}}}\Vert _\circledast \\ \text {s.t.}\ {}&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$

(10)

where $\alpha $ is a balance parameter, $\texttt {bvec}({\varvec{\mathcal {S}}})=[\textbf{S}^{(1)}; \textbf{S}^{(2)}; \cdots ; \textbf{S}^{(n_{3})}]$ is the block vectorizing function of ${\varvec{\mathcal {S}}}$, and $\beta >0$ is a parameter for controlling the contribution of t-TNN. Doing so implies that Eq. (10) can simultaneously explore consistent and complementary information in kernelized low-rank tensor space.

Once ${\varvec{\mathcal {S}}}$ is learned by Eq. (10), the averaged $\textbf{S}=\frac{\sum _{k=1}^{r}\textbf{S}^{(k)}}{r}$ can be obtained, and then performing spectral clustering. Nevertheless, in doing so, the equal weight for different graphs ignores the importance among graphs, so that the unreliable graphs may significantly decrease the clustering performance. In order to solve the above problem, a well-designed self-weighted strategy [34] can be considered to assign the appropriate weights for different graphs, and its model is formulated as

$$\begin{aligned} \begin{aligned}&\min \limits _{\textbf{Z},{w}^{(k)}}\sum _{k=1}^{r}({w}^{(k)})^{\lambda }\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \\&\text {s.t.} \quad z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1, \sum _{k=1}^r w^{(k)}=1, {w}^{(k)}\ge {0}\\ \end{aligned} \end{aligned}$$

(11)

where $\lambda > 0$ is the scalar to control the weights distribution; and $z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1$ are used to guarantee probability property. Inspired by similar form applied in [35], Eq. (11) can be converted to the following form:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{Z},\textbf{w}}&\sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}+\lambda \Vert \textbf{w}\Vert _2^2 \\&\text {s.t.}\quad z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1, \sum _{k=1}^r w^{(k)}=1, {w}^{(k)}\ge {0} \end{aligned} \end{aligned}$$

(12)

where $\textbf{w}=[w^{(1)},\ldots ,w^{(r)}]$, and $\lambda > 0$ is a regularization parameter to control the contributions of different $\textbf{S}^{(k)}$ to $\textbf{Z}$. On the one hand, if $\lambda \rightarrow 0$, Eq. (12) may obtain a trivial solution, i.e. only the smallest distance between $\textbf{S}^{(k)}$ and $\textbf{Z}$ corresponds to a weight $w^{(k)}=1$ and the other weights are zeros. While for $\lambda \rightarrow \infty $, all the elements of $\textbf{w}$ are equivalent to $\frac{1}{r}$. Although effective, the linear combination of candidate graphs largely limit the representation ability of consensus graph. Besides, noise and outliers may result in inappropriate assignments for optimizing weights. On the other hand, $\lambda $ in Eq. (12) also requires to be tuned in a large range, i.e. $0 \rightarrow \infty $, where the ideal $\lambda $ with the optimal clustering performance varies for different datasets.

To address the aforementioned problems, we make the assumption that each candidate affinity graph can be deemed as a perturbation of consensus graph. It is worth noting that more important candidate graphs to the consensus graph should receive large weights, and smaller weights for suboptimal graphs. Motivated by this assumption, we develop an auto-weighted strategy to remove $\lambda $ without degrading clustering performance by the following optimization problem:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{Z},{w}^{(k)}}\sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \quad \text {s.t.} \ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \\ \end{aligned} \end{aligned}$$

(13)

where the constraints of ${w}^{(k)}$ are also removed so as to give the weights more freedom. The ${w}^{(k)}$ of Eq. (13) can be solved via Theorem 1.

Theorem 1

The weight corresponds to the k-th candidate affinity graph is computed by $w^{(k)}=\frac{1}{2 \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}}$.

Proof

Motivated by the iteratively re-weighted method in [36], an auxiliary problem without $\textbf{w}$ is introduced as

$$\begin{aligned} \begin{aligned} \min _{\textbf{Z}} \sum _{k=1}^{r} \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}\quad \text {s.t.}\ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \end{aligned} \end{aligned}$$

(14)

which leads a Lagrange function as follows:

$$\begin{aligned} \begin{aligned} \sum _{k=1}^{r} \sqrt{\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}+\Phi _2(\mathbf {\Lambda }, \textbf{Z}) \end{aligned} \end{aligned}$$

(15)

where $\mathbf {\Lambda }$ and $\Phi _2(\mathbf {\Lambda }, \textbf{Z})$ indicates the Lagrange multiplier and indicator function of $\textbf{Z}$, respectively. Taking the derivative of Lagrange function w.r.t $\textbf{Z}$, Eq. (15) becomes

$$\begin{aligned} \begin{aligned} \sum _{k=1}^{r}\widehat{w}^{(k)}\frac{\partial {\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}}}{\partial \textbf{Z}} +\frac{\partial \Phi _2 (\Lambda ,\textbf{Z})}{\partial \textbf{Z}}=0 \end{aligned} \end{aligned}$$

(16)

where $\widehat{w}^{(k)}=1/(2\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F})$. It is obvious that Eq. (16) is also the derivation of the Lagrange function of Eq. (13). Note herein that $\widehat{w}^{(k)}$ is dependent on the $\textbf{Z}$ and thus cannot be solved directly. However, if we make $\textbf{Z}$ stationary, $\widehat{w}^{(k)}$ can be deemed as a solution for problem Eq. (16). Theoretically, to avoid dividing zero, $\widehat{w}^{(k)}$ can be converted to

$$\begin{aligned} \begin{aligned} {w}^{(k)}=\frac{1}{2\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}+\zeta } \end{aligned} \end{aligned}$$

(17)

where $\zeta $ is infinitely close to zero. $\square $

Learning $\textbf{S}^{(k)}$ and $\textbf{Z}$ separately will cause the suboptimal solution. In addition, the tensor low-rank can suppress noise to further improve reliability of weights. To this end, we seamlessly integrating Eqs. (10) and (13) in a unified framework. Overall, the ultimate objection of AMKTC is formulated as

$$\begin{aligned} \min \limits _{\textbf{S}^{(k)},\textbf{Z},\textbf{w}}{} & {} \sum _{k=1}^{r} \texttt {Tr}\left( \textbf{H}^{(k)}-2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \nonumber \\ +{} & {} \alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {S}}}\Vert _\circledast + \gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2}\nonumber \\ \text {s.t.}\ {}{} & {} z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1\nonumber \\ {}{} & {} {\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}$$

(18)

The learned consensus graph of Eq. (18) has superiorities as follows: (1) $\textbf{Z}$ can capture the non-linear relationships well; (2) by utilizing t-TNN constraint on ${\varvec{\mathcal {S}}}$, the high-order information can be captured underlying multiple base kernels, so that the more exact candidate affinity graphs $\{{\varvec{\textbf{S}}}^{(k)}\}_{k=1}^{r}$ can lead to build the more effective and reasonable weights. Motivated by these merits, a high-quality $\textbf{Z}$ can be obtained for improving clustering performance.

Optimization

Based on the alternating direction method of multipliers (ADMM) [37], an auxiliary variable ${\varvec{\mathcal {A}}}$ is first introduced to make Eq. (18) separable as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}, {\varvec{\mathcal {A}}},\textbf{Z},\textbf{w}}&\sum _{k=1}^{r} \texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\ +&\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \\ \text {s.t.}\quad&{\varvec{\mathcal {A}}}={\varvec{\mathcal {S}}},z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1\\&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$

(19)

whose augmented Lagrangian function is formed as follows:

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)},{\varvec{\mathcal {A}}},\textbf{Z},\textbf{w}}&\sum _{k=1}^{r} \texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) \\ +&\alpha \Vert \textbf{S}^{(k)}\Vert _F^2+\beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\frac{\mu }{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}}+\frac{{\varvec{\mathcal {Y}}}}{\mu } \Vert _F^2\\ +&\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} \quad \text {s.t.}\ z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1 \\&{\varvec{\mathcal {S}}} = \texttt {rotate}\left( \texttt {bvfold}([\textbf{S}^{(1)};\cdots ; \textbf{S}^{(r)}])\right) \end{aligned}\nonumber \\ \end{aligned}$$

(20)

where $\mu $ and ${\varvec{\mathcal {Y}}}$ are the corresponding the penalty parameter and Lagrangian multiplier. Equation (20) can be solved by in turn calculating each variable with the remaining variables fixed.

Step 1. $\textbf{S}$-subproblem: By fixing the ${\varvec{\mathcal {A}}}$, $\textbf{w}$ and $\textbf{Z}$, we update $\{\textbf{S}^{(k)}\}_{k=1}^r$ via

$$\begin{aligned} \begin{aligned} \min \limits _{\textbf{S}^{(k)}}&\texttt {Tr}\left( -2 \textbf{H}^{(k)}\textbf{S}^{(k)}+(\textbf{S}^{(k)})^\intercal \textbf{H}^{(k)}\textbf{S}^{(k)}\right) +\alpha \Vert \textbf{S}^{(k)}\Vert _F^2\\ {}&+\gamma \sum _{k=1}^{r}{w}^{(k)}\Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _{F}^{2} +\frac{\mu }{2}\Vert \textbf{A}^{(k)}-\textbf{S}^{(k)}+\frac{\textbf{Y}^{(k)}}{\mu } \Vert _F^2 \\ \end{aligned}\nonumber \\ \end{aligned}$$

(21)

whose closed-form solution can be obtained by taking the derivative of above equation w.r.t $\textbf{S}^{(k)}$ and setting to zero. Thus, we then have

$$\begin{aligned} \begin{aligned} \left( \textbf{S}^{(k)}\right) ^*=(2\textbf{H}^{(k)}+\textbf{J}_1^{(k)})^{-1}(2\textbf{H}^{(k)}+\textbf{J}_2^{(k)}) \end{aligned} \end{aligned}$$

(22)

where $\textbf{J}_1^{(k)}=(2\alpha +2\gamma w^{(k)}+\mu )\textbf{I}$ and $\textbf{J}_2^{(k)}=2\gamma w^{(k)}\textbf{Z}+\mu \textbf{A}^{(k)}-\textbf{Y}^{(k)}$.

Step 2. ${\varvec{\mathcal {A}}}$-subproblem: By fixing $\{\textbf{S}^{(k)}\}_{k=1}^r$ and removing the irrelevant items, the optimization problem of ${\varvec{\mathcal {A}}}$ is

$$\begin{aligned} \begin{aligned} \min \limits _{{\varvec{\mathcal {A}}}}\ \beta \Vert {\varvec{\mathcal {A}}}\Vert _\circledast +\frac{\mu }{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}}+\frac{{\varvec{\mathcal {Y}}}}{\mu } \Vert _F^2 \\ \end{aligned} \end{aligned}$$

(23)

Let ${\varvec{{\varvec{\mathcal {M}}}}}={\varvec{\mathcal {S}}}-\frac{{\varvec{\mathcal {Y}}}}{\mu }$, this t-TNN minimization problem can be solved by applying the tensor tubal-shrinkage via Theorem 2 [13].

Theorem 2

For two 3-order tensor ${\varvec{\mathcal {A}}}\in \mathbb {R}^{n_1\times n_2\times n_3}$, ${\varvec{{\varvec{\mathcal {M}}}}}\in \mathbb {R}^{n_1\times n_2\times n_3}$ and given a scalar $\rho >0$, an optimization problem of low-rank tensor is

$$\begin{aligned} \begin{aligned} \min _{{\varvec{\mathcal {A}}}} \rho \Vert {\varvec{\mathcal {A}}}\Vert _{\circledast }+\frac{1}{2}\Vert {\varvec{\mathcal {A}}}-{\varvec{{\varvec{\mathcal {M}}}}}\Vert _{F}^{2} \end{aligned} \end{aligned}$$

(24)

whose the global optimal solution can be achieved via the following tensor tubal-shrinkage operator:

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {A}}}={\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{{\varvec{\mathcal {M}}}}})={\varvec{\mathcal {U}}} * {\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{\mathcal {S}}}) * {\varvec{\mathcal {V}}}^{\intercal } \end{aligned} \end{aligned}$$

(25)

where ${\varvec{{\varvec{\mathcal {M}}}}}={\varvec{\mathcal {U}}} * {\varvec{\mathcal {S}}} * {\varvec{\mathcal {V}}}^{\intercal }$, ${\varvec{\mathcal {U}}}$ and ${\varvec{\mathcal {V}}}$ are orthogonal tensor with the size of $n_1 \times n_1 \times n_3$ and $n_2 \times n_2 \times n_3$, and ${\varvec{\mathcal {C}}}_{n_{3} \rho }({\varvec{\mathcal {S}}})={\varvec{\mathcal {S}}}*{\varvec{\mathcal {Q}}}$. ${\varvec{\mathcal {Q}}}\in \mathbb {R}^{n_1\times n_2\times n_3}$ denotes an f-diagonal tensor, whose each diagonal element denotes as ${\varvec{\mathcal {Q}}}_f(i,i,j)=(1-\frac{n_3 \rho }{{\varvec{\mathcal {S}}}^{(j)}(i,i)})_+$.

Step 3. $\textbf{Z}$-subproblem: Eq. (18) becomes to the following subproblem:

$$\begin{aligned} \begin{aligned} \min \limits _{z_{ij}\ge 0, \textbf{z}_i^\intercal \textbf{1}=1}&\sum _{k=1}^r w^{(k)} \Vert \textbf{Z}-\textbf{S}^{(k)}\Vert _F^2 \end{aligned} \end{aligned}$$

(26)

which can be changed to the form of vectors as

$$\begin{aligned} \begin{aligned} \min _{\textbf{z}_i} \sum _{k=1}^{r}\Vert \textbf{z}_i-\textbf{s}_i^{(k)}\Vert _{2}^{2}\quad s.t. \ 0 \le \textbf{z}_i, \textbf{1}^\intercal \textbf{z}_i=1 \end{aligned} \end{aligned}$$

(27)

which can be effectively solved as [3].

Step 4. $\textbf{w}$-subproblem: Ignoring the irrelevant items and fixing the other variables, we update $\textbf{w}$ via Theorem 1.

Step 5. Multipliers-subproblem: The multipliers involved ADMM is given by

$$\begin{aligned} \begin{aligned} {\varvec{\mathcal {Y}}}={\varvec{\mathcal {Y}}}+\mu ({\varvec{\mathcal {A}}}-{\varvec{\mathcal {S}}})\\ \mu =\min \left( \tau _1 \mu , \mu _{\max }\right) \end{aligned} \end{aligned}$$

(28)

where $\tau _1$ and $\mu _{max}$ are scalars of ADMM. The stopping criterion is to satisfy the residuals condition $\max \{ |\textrm{obj}^{t+1}-\textrm{obj}^{t}|,\Vert {\varvec{\mathcal {S}}}^{t+1}-{\varvec{\mathcal {S}}}^{t}\Vert _{F}^{2}\} \le \epsilon $, where obj, t and $\epsilon $ are the objective value of Eq. (18), the number of iteration, and the pre-defined threshold tolerance, respectively. Finally, the optimization procedure of AMKTC is summarized in Algorithm 1.

Computational complexity and convergence

Algorithm 1 involves five main subproblems, including problem (22), problem (23), problem (27), problem (17), and problem (28). And their corresponding computational complexities are as follows: Problem (22) involves the computational complexity of $\textbf{w}$ with $\mathcal {O}(r n^{2})$. The complexity of (23) is $\mathcal {O}(rn^3)$ due to its matrix inversion operator. The complexity of (27) first requires $\mathcal {O}(rn^2\texttt {log}(n))$ due to the FFT and inverse FFT of the tensor ${\varvec{\mathcal {S}}}\in \mathbb {R}^{n\times r \times n}$, and then the SVD of graph tensor ${\varvec{\mathcal {S}}}$ needs $\mathcal {O}(rn^2)\texttt {log}(n)$. Thus, the complexity of problem (28) is $\mathcal {O}(rn^2\texttt {log}(n)+r^2n^2)$ in total under $\texttt {rotate}$ operator. Note that if problem (28) does not performed $\texttt {rotate}$ operator in Fig. 2, the complexity will increase to $\mathcal {O}(rn^2\texttt {log}(r)+rn^3)$, where $r\ll n$. Both problems (17) and (28) only involve the computational complexity with $\mathcal {O}(n)$. Theoretically, the computational cost of Algorithm 1 is approximate to $\mathcal {O}(tn^3)$, where t denotes the total number of iterations. Since $t\ll n$ and a small r in practice, the overall cost for our AMKTC can be deemed as $\mathcal {O}(n^3)$. Despite the computation complexity between the competitors and AMKTC is similar, clustering performance of the competitors is largely inferior to AMKTC.

Table 1 Summaries of the seven used datasets

Full size table

Table 2 The choice of kernel function

Full size table

Although Algorithm 1 is difficult to generally prove the convergence, Algorithm 1 can converge to the local optimum in a high probability since the subproblems are all have closed-form solutions. In addition, empirical evidence of “Optimization” demonstrates that Algorithm 1 has good convergence behavior.

Experiments

In this section, the effectiveness of our AMKTC is verified by conducting experiments on seven widely used datasets.

Benchmark datasets and kernel setting

In experiments, seven widely used datasets including Yale, Jaffe, AR, ORL, binaryalphadigs (BA), COIL-20, and Deep CIAR-10 (DC) are employed to evaluate the clustering performance of AMKTC, where CIFAR is a dataset of deep learning feature constructed as the work [3]. The details of these datasets are summarized in Table 1. Following the settings in [7, 8, 10, 23], we construct 12 base kernels (i.e., $r=12$ is the number of kernel functions. ) and form a kernel pool as mentioned in Table 2, including seven radial basis function (RBF) kernels with ${ker}(x_{i}, x_{j})=\exp (-\Vert x_{i}-x_{j}\Vert _{2}^{2} /(2 \tau _2 \sigma ^{2}))$, where $\tau _2$ varies in the range of $\{0.01,0.05,0.1,1,10,50,100\}$ and $\sigma $ is the maximum distance between samples; four polynomial kernels with ${ker}\left( x_{i}, x_{j}\right) =\left( a+x_{i}^{T} x_{j}\right) ^{b}$, where $a=\{0,1\}$ and $b=$ $\{2,4\}; $ a cosine kernel with ${\text {ker}}\left( x_{i}, x_{j}\right) =\left( x_{i}^{T} x_{j}\right) /\left( \left\| x_{i}\right\| \cdot \left\| x_{j}\right\| \right) \cdot $ Finally, all the kernel matrices $\left\{ \textbf{H}^{(k)}\right\} _{k=1}^{r}$ are normalized to [0, 1] range.

Evaluation metrics

Three most used evaluation metrics, including clustering accuracy (ACC), normalized mutual information (NMI), and purity are used to evaluate the clustering performance of our proposed AMKTC. The higher the value indicates the clustering performance is better.

Table 3 The ACC of some compared MKC methods on seven datasets

Full size table

Table 4 The NMI of some compared MKC methods on seven datasets

Full size table

Table 5 The purity of some compared MKC methods on seven datasets

Full size table

Our AMKTC are evaluated by comparing with eleven state-of-the-art methods, including MKKM [38], RMKKM [24], MVCLFA [25], LKGr [7], SCMK [6], SMKL [8], JMKSC [9], LLMKL [10], SPMKC-E [11], SPMKC [11], CAGL [3], and AMKTC. The clustering results are reported in Tables 3, 4 and 5.

For fairness, we tune parameters of competitors by following the recommended values of respective authors in their papers. Further, we repeat competitors for 20 times and report the average results. The concrete details about these k-means based methods are described as follows:

MKKM [24] is a multiple kernel extension of the fuzzy k-means.
RMKKM [24] is the robust extension of the MKKM.
MVCLFA [25] divides pre-defined kernel matrixes into multiple low-dimensional partitions, which are to learn a consensus partition for MKC.

Please see the details of these rest MKSC comparison methods in “Introduction”.

Experimental results and analysis

Tables 3, 4 and 5 report the average clustering results of competitors, where the standard deviations in these tables are not shown since they are less that $1\%$.

As shown in Tables 3, 4 and 5, the proposed AMKTC consistently achieves significant improvements under all evaluation metrics compared with three k-means based MKC and eight MKSC methods. Noticeably, the average performance of six datasets outperforms the second best method with approximately $19.8\%$, $16.5\%$ and $18.4\%$ in terms of ACC, NMI and purity, respectively. Especially for Yale, BA and Deep CIFAR-10 datasets, AMKTC outperforms the second best method with approximately $20.0\%$, $44.0\%$, $18.6\%$ in terms of ACC, $23.8\%$, $32.4\%$, $24.3\%$ in terms of NMI, and $19.4\%$, $34.3\%$, $17.8\%$ in terms of purity, respectively.
Theoretically, the k-means based MKC [24, 25] methods mainly focus on learning a consensus kernel, and then employs kernel k-means to obtain the final clustering results. They face the problem of narrow solution set and cannot preserve the complex structure hidden in data well [3].
While for MKSC methods, they adopt the widely used self-expressiveness subspace learning (SESL) to learn an affinity graph, and then employs spectral clustering to obtain the final clustering results, so as to effectively capture the subspace structure. Moreover, it can be observed that MKSC methods perform better than k-means based MKC ones in most cases. In these MKSC methods, it is noted that only CAGL and the proposed AMKTC are pure graph learning ones, while the other methods persist in LKF or NKF schemes. This violates the intention that the high-quality affinity graph is the expectation of MKSC methods [3]. From the results, it is obvious that CAGL and AMKTC are superior to the other competitors.
Compared with CAGL, it mainly suffers from following flaws (see “Introduction”): (1) it neglects the high-order correlations among base kernels; and (2) it adopts a two-step way resulting in a suboptimal solution. Inversely, by introducing t-TNN, AMKTC can directly learn a consensus graph with the optimal weights; meanwhile it can capture the high-order correlations in the kernelized tensor space. From the results, the superiority of our method is abundantly demonstrated.

Parameter sensitivity

Three parameters $\alpha $, $\beta $ and $\gamma $ of AMKTC are required to be set properly. For simplicity, $\gamma $ is fixed as $\gamma =0.0001$, which can always obtain the best clustering performance in terms of all evaluation datasets experimentally. Specially, the parameters $\alpha $ and $\beta $ are to balance the effect of $\Vert \textbf{S}^{(k)}\Vert _F^2$ and ${\varvec{\mathcal {S}}}_\circledast $, respectively. By leveraging a grid search technique with step size 10, we tune both $\alpha $ and $\beta $ from $10^{-5}$ to $10^2$. Take the AR, ORL, COIL-20, BA and Deep CIFAR-10 datasets for example, as shown in Figs. 3 and 4, the best performance of AMKTC on COIL-20 dataset can be achieved by setting $\alpha \in [10^{-2},10^{-1}]$ and $\beta \in [10^{-5},10^{-4}]$, while on the residual evaluation datasets, the best performance can be obtained when setting $\alpha \in [10^{-2},1]$ and $\beta \in [10^{-5},10^{-3}]$. The aforementioned analyses demonstrate that AMKTC is not sensitive to the setting of parameters.

Visualization of clustering result

Original data and the clustering results of two methods (i.e., our AMKTC and the second best competitor CAGL) are visualized on ORL and DC datasets in Fig. 5. First, the clustering results of AMKTC and CAGL can clearly achieve the better structure distribution than the original data in both datasets. In addition, although the different color points of CAGL in Fig. 5b and e have the scattered structure distribution, it still exists some incorrect cluster partitions compared to our method. While in contrast, Fig. 5c and f of AMKTC only has a fewer wrong partitions than the second best competitor CAGL. Especially as mentioned in Fig. 5f, there is just a little bit of wrong partitions. This phenomenon demonstrates that our method has good extendibility of partition for different datasets.

Convergence

In this section, we evaluate the convergence of the proposed AMKTC method on seven datasets (i.e., AR, BA, CIFAR-10, COIL-20, ORL and Yale), and then the results are shown in Fig. 6. Obviously, the iterator algorithm converges rapidly till to the stable point, even some of which meet the stop criterion within just eight epochs.

Conclusion

In this paper, a novel multiple kernel subspace clustering method, namely AMKTC, is proposed to address the challenging problems of capturing high-order correlations hidden in the different base kernels. Meanwhile, a non-linear auto-weighted graph fusion scheme is used to learn consensus affinity graph with optimal weights. In AMKTC, we integrate the candidate affinity graph learning, the graph tensor learning and auto-weighted consensus graph learning in a unified objective function, such that consensus affinity graph with optimal weights is learned for clustering. Compared to the state-of-the-art MKC methods, the superior performance of our methods are demonstrated via extensive experimental results. In the future, we will study tensor train and tensor ring to further enhance the representation capability of graph tensor.

References

Mehta V, Bawa S, Singh J (2021) Weclustering: word embeddings based text clustering technique for large datasets. Complex Intell Syst 7(6):3211–3224
Article Google Scholar
Zhou S, Liu X, Li M, Zhu E, Liu L, Zhang C, Yin J (2020) Multiple kernel clustering with neighbor-kernel subspace segmentation. IEEE Trans Neural Netw Learn Syst 31(4):1351–1362
Article MathSciNet Google Scholar
Ren Z, Yang SX, Sun Q, Wang T (2021) Consensus affinity graph learning for multiple kernel clustering. IEEE Trans Cybern 51(6):3273–3284
Article Google Scholar
Mehra PS (2022) E-FUCA: enhancement in fuzzy unequal clustering and routing for sustainable wireless sensor network. Complex Intell Syst 8(1):393–412
Article Google Scholar
Ren Z, Mukherjee M, Bennis M, Lloret J (2020) Multikernel clustering via non-negative matrix factorization tailored graph tensor over distributed networks. IEEE J Sel Areas Commun 39(7):1946–1956
Article Google Scholar
Kang Z, Peng C, Cheng Q, Xu Z (2018) Unified spectral clustering with optimal graph. In: Proceedings of the AAAI conference on artificial intelligence, vol 32(1). New Orleans, Louisiana, USA, p 3366–3373
Kang Z, Wen L, Chen W, Xu Z (2019) Low-rank kernel learning for graph-based clustering. Knowl Based Syst 163:510–517
Article Google Scholar
Kang Z, Lu X, Yi J, Xu Z (2018) Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden, pp 2312–2318
Yang C, Ren Z, Sun Q, Wu M, Yin M, Sun Y (2019) Joint correntropy metric weighting and block diagonal regularizer for robust multiple kernel subspace clustering. Inf Sci 500:48–66
Article MathSciNet MATH Google Scholar
Ren Z, Li H, Yang C, Sun Q (2019) Multiple kernel subspace clustering with local structural graph and low-rank consensus kernel learning. Knowl Based Syst 188:105040
Article Google Scholar
Ren Z, Sun Q (2021) Simultaneous global and local graph structure preserving for multiple kernel clustering. IEEE Trans Neural Netw Learn Syst 32(5):1839–1851
Article MathSciNet Google Scholar
Wu J, Xie X, Nie L, Lin Z, Zha H (2020) Unified graph and low-rank tensor learning for multi-view clustering. Proc AAAI Conf Artif Intell 34(4):6388–6395
Google Scholar
Xie Y, Tao D, Zhang W, Liu Y, Zhang L, Qu Y (2018) On unifying multi-view self-representations for clustering by tensor multi-rank minimization. Int J Comput Vis 126(11):1157–1179
Article MathSciNet MATH Google Scholar
Lu C, Feng J, Lin Z, Mei T, Yan S (2019) Subspace clustering by block diagonal representation. IEEE Trans Pattern Anal Mach Intell 41(2):487–501
Article Google Scholar
Tang C, Liu X, Li M, Wang P, Chen J, Wang L, Li W (2018) Robust unsupervised feature selection via dual self-representation and manifold regularization. Knowl Based Syst 145:109–120
Article Google Scholar
Zhang G-Y, Zhou Y-R, He X-Y, Wang C-D, Huang D (2020) One-step kernel multi-view subspace clustering. Knowl Based Syst 189:105126
Article Google Scholar
Li H, Ren Z, Mithun M, Huang Y, Sun Q, Li X, Chen L (2020) Robust energy preserving embedding for multi-view subspace clustering. Knowl Based Syst 210:106489
Article Google Scholar
Kang Z, Peng C, Cheng Q, Liu X, Peng X, Xu Z, Tian L (2021) Structured graph learning for clustering and semi-supervised classification. Pattern Recognit 110:107627
Article Google Scholar
Zheng W, Zhu X, Wen G, Zhu Y, Yu H, Gan J (2020) Unsupervised feature selection by self-paced learning regularization. Pattern Recognit Lett 132:4–11
Article Google Scholar
Chen J, Mao H, Wang Z, Zhang X (2021) Low-rank representation with adaptive dictionary learning for subspace clustering. Knowl Based Syst 223:107053
Article Google Scholar
Wang H, Yang Y, Liu B (2020) GMC: graph-based multi-view clustering. IEEE Trans Knowl Data Engi 32(6):1116–1129
Article Google Scholar
WangH, Zong L, Liu B,YangY, ZhouW(2019) Spectral perturbation meets incomplete multi-view data. In: IJCAI. Macao, China, pp 3677–3683
Ren Z, Sun Q, Wei D (2021) Multiple kernel clustering with kernel k-means coupled graph tensor learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 35. Vancouver, Canada, pp 9411–9418
Du L, Zhou P, Shi L, Wang H, Fan M, Wang W, Shen Y-D (2015) Robust multiple kernel k-means using l21-norm. In: Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina, pp 3476–3482
Wang S, Liu X, Zhu E, Tang C, Liu J, Hu J, Xia J, Yin J (2019) Multi-view clustering via late fusion alignment maximization. In: Twenty-eighth international joint conference on artificial intelligence. AAAI Press, pp 3778–3784
Huang H-C, Chuang Y-Y, Chen (2012) Affinity aggregation for spectral clustering. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 773–780
Liu B-Y, Huang L, Wang C-D, Lai J-H, Yu P (2020) Multi-view consensus proximity learning for clustering. IEEE Trans Knowl Data Eng 34:3405–3417
Google Scholar
Wang H, Yang Y, Li T (2016) Multi-view clustering via concept factorization with local manifold regularization. In: IEEE 16th International Conference on Data Mining (ICDM). Barcelona, Spain, pp 1245–1250
Ren Z, Mukherjee M, Lloret J, Venu P (2020) Multiple kernel driven clustering with locally consistent and selfish graph in industrial iot. IEEE Trans Ind Inform 17(4):2956–2963
Article Google Scholar
Liu Y, Liu J, Long Z, Zhu C (2022) Tensor computation for data analysis. Springer, Berlin
Book Google Scholar
Zhang C, Fu H, Liu S, Liu G, Cao X (2015) Low-rank tensor constrained multiview subspace clustering. In: Proceedings of the IEEE international conference on computer vision. Santiago, Chile, pp 1582–1590
Xiao X, Chen Y, Gong Y-J, Zhou Y (2021) Prior knowledge regularized multiview self-representation and its applications. IEEE Trans Neural Netw Learn Syst 32(3):1325–1338
Article MathSciNet Google Scholar
Wu J, Lin Z, Zha H (2019) Essential tensor learning for multi-view spectral clustering. IEEE Trans Image Process 28(12):5910–5922
Article MathSciNet MATH Google Scholar
Li Y, Nie F, Huang H, Huang J (2015) Large-scale multi-view spectral clustering via bipartite graph. In: Twenty-ninth AAAI conference on artificial intelligence. Austin, Texas, USA, pp 2750–2756
Karasuyama M, Mamitsuka H (2013) Multiple graph label propagation by sparse integration. IEEE Trans Neural Netw Learn Syst 24(12):1999–2012
Article Google Scholar
Nie F, Li J, Li X (2016) Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi supervised classification. In: The twenty-fifth international joint conference on artificial intelligence. New York, USA, pp 1881–1887
Ren Z, Sun Q, Wu B, Zhang X, Yan W (2020) Learning latent low-rank and sparse embedding for robust image feature extraction. IEEE Trans Image Process 29(1):2094–2107
Article MATH Google Scholar
Huang H-C, Chuang Y-Y, Chen C-S (2012) Multiple kernel fuzzy clustering. IEEE Trans Fuzzy Syst 20(1):120–134
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Media Engineering, Communication University of Zhejiang, Hangzhou, 310018, Zhejiang, China
Yanlong Wang, Jinhua Liu, Cun Chang & Zhenwen Ren
School of Information Engineering, Southwest University of Science and Technology, Mianyang, 621010, China
Zhenwen Ren
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210008, China
Zhenwen Ren

Authors

Yanlong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinhua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Cun Chang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenwen Ren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenwen Ren.

Ethics declarations

Conflict of interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Liu, J., Chang, C. et al. Auto-weighted multiple kernel tensor clustering. Complex Intell. Syst. 9, 6863–6874 (2023). https://doi.org/10.1007/s40747-023-01112-5

Download citation

Received: 11 March 2022
Accepted: 16 May 2023
Published: 05 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01112-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Auto-weighted multiple kernel tensor clustering

Abstract

Similar content being viewed by others

Clustering via multiple kernel k-means coupled graph and enhanced tensor learning

Coupled Learning for Kernel Representation and Graph Tensor in Multi-view Subspace Clustering

Nonconvex low-rank and sparse tensor representation for multi-view subspace clustering

Explore related subjects

Introduction

Notations and preliminaries

Notations

Preliminaries of tensor

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

Related work

Multiple kernel subspace clustering

t-SVD-based tensor nuclear norm

Proposed methodology

Auto-weighted multiple kernel tensor clustering

Theorem 1

Proof

Optimization

Theorem 2

Computational complexity and convergence

Experiments

Benchmark datasets and kernel setting

Evaluation metrics

Experimental results and analysis

Parameter sensitivity

Visualization of clustering result

Convergence

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation