1 Introduction

Clustering is typically thought of as a single view problem in computer vision, where an algorithm groups individual data samples based on their overall qualities. These samples, however, may be the outcome of various interpretations or representations of the underlying data. For instance, we can generate different sets of samples as Gabor [1], CLD [2] and HOG [3] descriptors of the images. These representations may hold complementary properties that can be leveraged for improved clustering. This fact has recently piqued interest of the computer vision community, resulting in an emerging topic of multi-view clustering (MVC) [4,5,6,7,8,9,10,11,12].

Another contemporary line of research for image clustering favors graph-based methods [13,14,15,16,17]. The main benefit of graphs for the clustering problem is that they naturally have the capacity to encode data structure information. For instance, methods like [13, 14, 18,19,20,21,22] leverage trained Graph Convolutional Networks (GCNs) for images to reason about the linkage likelihoods between a given node and its neighbours for graph completion, thereby achieving more accurate clusters.

Fig. 1
figure 1

Illustrations of concepts used in the text. (a) Classic Contrastive Graph Learning (CGL) - learns and contrasts homogeneous graphs at node and graph level. (b) Nodes in a homogeneous graph can have multiple views. (c) Relations among multiple views constitute a heterogeneous affinity graph. \(v_{i}\) indicates a random original node in the dataset, and \({m_{i}^{j}}\) is the j-th view of \(v_{i}\). Different colors of nodes indicate different views. (d) A heterogeneous affinity graph can be broken down into multiple affinity homogeneous graphs. (e) Our Local Feature Propagation (LFP) explores local neighbourhood relations. (f) Our Influence-aware Feature Propagation (IFP) explores relationships from a target node to the influential node in each view

In general, graph-based methods are known to benefit from Contrastive Learning (CL) [23], which induces models using self-supervision. During training, it maximises the agreement between its predictions and the transformed samples of the original sample. For graphs, the analogous Contrastive Graph Learning (CGL) paradigm aims to maximise the prediction agreement on different views of the same underlying graph [4,5,6,7, 24]. These views are created by applying random operations, e.g., adding/deleting nodes/edges and dropping features, to an original graph. In line with the negative sample creation in CL, the CGL considers other original graphs as the negative samples. It learns node-level (intra-view) or graph-level (inter-view) representations - illustrated in Fig. 1(a) - with a graph neural network and a contrastive loss function.

The self-supervised CGL paradigm naturally suites to the multi-view perspective. For instance, [25] and [26] created different graph views and then utilised node-level and graph-level representations for multi-view contrastive learning. These methods consider structural semantics as global information for learning the node-level embeddings, neglecting the fact that each node can also have various features to provide more information. Coming back to our main problem of multi-view image clustering, existing methods generally first compute a data affinity matrix for raw features or learned representation under multiple views, and then perform clustering using the affinity matrix [27,28,29,30,31,32,33,34]. These methods concatenate multiple views to construct a denoised homogeneous graph for image clustering. We provide a simple illustration of a multi-view homogeneous graph for image data in Fig. 1(b), where views are defined using compositional properties. The graph denoising operations, however, can lead to the loss of important semantic information. Additionally, the heterogeneous properties of multi-view data may become meaningless if several views are combined into a homogeneous graph. Theoretically, by treating images as nodes in heterogeneous graphs, it is possible to use more complementary information for multi-view image clustering - Fig. 1(c).

Considering the above narrative, in this work, we propose an inductive Multi-view Image Clustering framework with self-supervised contrastive heterogeneous Graph co-learning (GoMIC). In GoMIC, we maintain the relationships between different views as a heterogeneous affinity graph, while preserving the uniqueness and independence of each view. Our heterogeneous graph consists of several homogeneous affinity graphs - Fig. 1(d). Each node can readily get the local neighbourhood data from each view by creating the heterogeneous affinity graph. To understand the propagation of node features, we created two encoding schemes. In the first, we propagate feature from a node to its neighbourhood in its own and other views through several hops - Fig. 1(e). The second strategy is influence-aware propagation that learns how each node feature propagates towards the densest nodes - Fig. 1(f). Both encoding strategies employ a proposed attention mechanism to control the feature influence of different nodes. Furthermore, to better exploit the learned embeddings under CGL, we explicitly contrast each pair of nodes for mutual information maximisation. We also customise the contrastive loss function to fit our contrastive objective.

Our key contributions are summarised below.

  1. 1.

    To the best of our knowledge, this is a pioneering multi-view image clustering method using heterogeneous information networks that leverages contrastive graph learning.

  2. 2.

    We devise two novel heterogeneous information network encoder strategies and an influence attention mechanism to learn the embedding of each node according to its local feature propagation and influence-aware feature propagation, respectively.

  3. 3.

    We enhance the loss function for contrastive graph learning to consider the mutual first neighbouring nodes as the mutual positive samples.

  4. 4.

    We conduct comprehensive experiments on three benchmark datasets, not only demonstrating large improvements over the existing self-supervised heterogeneous graph methods, but achieving better results than popular supervised methods across the datasets.

We discuss related work in Section 2. Section 3 demonstrates our proposed framework, GoMIC. In Section 4, we introduce three open multi-view image datasets and evaluate our proposed framework with comparing to 6 state-of-the-art MVC methods. Finally, Section 5 concludes this paper.

2 Prior art & background

We discuss the related work below. This discussion also includes analytical details that are later utilised in discussing the methodology.

2.1 Heterogeneous graph neural network

In recent years, heterogeneous graphs are becoming increasingly popular in neural network research [35, 36]. For instance, [37] studied the use of hierarchical attentions to depict node-level and semantic-level structures in heterogeneous graphs. Similarly, [38] incorporated intermediate nodes of meta-paths in the networks. [39] developed GTN to automatically identify useful graph connections. A technique dubbed MAHINDER is proposed in [40, 41] to employ and encode meta-paths over different views with attention on the importance of attributes and data views. In an unsupervised setting, a heterogeneous graph neural network is proposed in [42], which samples a fixed size of neighbours and fuses their features using LSTMs [43]. [44, 45] focused on network schema and preserved pairwise and network schema proximity simultaneously. [46] devised node- and edge-type dependent parameters to characterise heterogeneous attention over graph edges. The above methods rely strongly on supervised signals of data to encode graphs, whereas graph structures among the nodes are neglected. In [47], a heterogeneous network HeCo is proposed, which generates meta-paths and network schemas and exploits contrastive learning to further use signals of data in a self-supervised manner. [48] and [49] created item clusters and entity clusters to organise the objects and their nearby entities in the knowledge graph. After that, the hierarchically combining the heterogeneity data derived from the clusters with the weights produced by the hierarchical attention layers yields the representations. Nevertheless, encoding graphs and nodes while comprehensively considering node relations and graph structures still remains largely unresolved for the method.

2.2 Feature propagation

Considering that we devise feature propagation scheme in our technique, it is imperative to discuss related research in this direction in more detail. Expanding a node’s feature by propagation is commonly conducted under a generalisation of pagerank equation [50, 51], which can be expressed as

$$\begin{aligned} \widetilde{\varvec{X}}=\varvec{X} \mathcal {W}_{1}+A \widetilde{\varvec{X}} \mathcal {W}_{2}, \end{aligned}$$
(1)

where \(\varvec{X}\) contains the original features, A encodes the adjacency, \(\mathcal {W}_{1}\) and \(\mathcal {W}_{2}\) are coefficient matrices. However,  (1) is not naturally invertible. Therefore, [52] modified it with the degree matrix D as follows

$$\begin{aligned} \widetilde{\varvec{X}}=X \mathcal {W}_{1}+D^{-1} A \widetilde{\varvec{X}} \mathcal {W}_{2}. \end{aligned}$$
(2)

The above is convergent if \(\mathcal {W}_{2}\) is non-negative along other conditions. Still, this can only allow feature propagation on homogeneous graphs.

[53] attempted to extend feature propagation to heterogeneous graphs. In a heterogeneous graph, they assumed that there are two different sorts of nodes. With a threshold sparsifying the feature similarities, they first created a learnt feature similarity network for each type. Next, they generated the feature propagation graph for each type. Finally, the overall feature graph is obtained via channel attention [39]. Incidentally, there is a huge computational footprint of this method when dealing with a heterogeneous graph with large number of relations. More importantly, this method is more directed to heterogeneous graph sparsification rather than heterogeneous feature propagation.

2.3 Contrastive loss function

Contrastive Graph Learning (CGL) is derived from contrastive learning (CL) [23] for graph learning. CGL has been increasingly researched recently [4,5,6,7, 25] and has achieved excellent performance on graph or node classification by generating and contrasting positive and negative graph view pairs. Here, we organise our review by mainly focusing on contrastive loss function of the related CGL studies, which is helpful in understanding our contribution in Section 3.5. In [4, 7, 25, 54,55,56], the authors adopt the learning objective of CL rather straightforwardly. In doing so, they focused on node-level representations, and neglected the graph-level information. In [5, 6], for any node \(v_{i}\), its embedding generated in one view \(v^{\prime }_{i}\) and the embedding in the other view \(v^{\prime \prime }_{i}\), form a positive pair, whereas embeddings of other nodes are negative samples. The pairwise objective for each positive pair \((v^{\prime }_{i},v^{\prime \prime }_{i})\) is defined as

$$\begin{aligned} \ell \left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) =-\log \frac{\text {exp}(\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) / \tau )}{\underbrace{\text {exp}(\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) / \tau )_{\text {positive pair }}+\mathbbm {E}+\mathbbm {A}}} \end{aligned}$$
(3)

where \(\text {sim}\) denotes the function computing cosine similarity, \(\tau\) is the temperature parameter, \(\mathbbm {E}\) identifies contrasting of inter-view negative pairs as \(\mathbbm {E} = \displaystyle {\sum _{k=1}^{N} \mathbbm {1}_{\left[ k \ne i\right] } \text {exp}({\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{k}\right) / \tau )}}\), and \(\mathbbm {A}\) denotes the contrasting of intra-view negative pairs as \(\mathbbm {A} = \displaystyle { \sum _{k=1}^{N} \mathbbm {1}_{[k \ne i]} \text {exp}({\text {sim}\left( v^{\prime }_{i}, v^{\prime }_{k}\right) / \tau )}}\), where \(\mathbbm {1}_{[k \ne i]} \in \{0,1\}\) is an indicator function.

Connecting the above back to the heterogeneous graph neural networks, [47] proposed collaboratively contrastive optimization to expand the scope of defining positive samples and used it for self-supervised learning for heterogeneous graphs. However, in this state-of-the-art contrastive objective for heterogeneous graph learning, there is still a lack of consideration on feature propagation in graphs. Also, the frequent use of thresholds in the existing techniques decreases the feasibility of proposed models.

3 Proposed approach

Fig. 2
figure 2

Overview of the proposed GoMIC framework. For a single-view image dataset, we utilise M feature descriptors to generate M views. Then, the heterogeneous affinity graph G(.) with \(M+1\) views (including the original) is constructed, where \(M+1\) homogeneous affinity graphs are connected with the relations between the original feature view and each descriptor-based view. Next, we propose two feature propagation encoders, i.e. Local Feature Propagation (LFP) and Influence-aware Feature Propagation (IFP), for generating contrasting representations with cross-view contrastive attention and semantic attention respectively. Before the final contrastive objective, we extend the definition of positive samples to include the first neighbour of each node, which are discovered with the help of LFP

In this section, we discuss the proposed multi-view image clustering with self-supervised contrastive heterogeneous graph co-learning (GoMIC), illustrated in Fig. 2. Our method encodes nodes from the local neighbourhood context and influence-oriented context, which fully captures contrastive structure of the heterogeneous affinity graph. This ensures that our approach well involves clustering boundary nodes (i.e., vague nodes) in the computations. During the encoding, we design innovative attention mechanisms, which learn feature propagation embeddings naturally. Moreover, in our method, a novel contrastive graph learning framework enables embeddings to supervise each other informatively.

3.1 Preliminaries

In order to describe our method for multi-view image clustering with contrastive heterogeneous graph learning, we first formalise the following pertinent notions for better understanding.

Definition 1

(Multi-view Image Data) Given an image dataset \(\varvec{V} = \{v_{i}\}_{i=1}^{N}\) and its feature set \(\varvec{X} = \{\varvec{x}_{i}\}_{i=1}^{N}\), where N denotes the number of data samples. Each node \(v_{i}\) can be represented in multiple views \(\{{m_{i}^{n}}\}_{n=1}^{M}\) with multi-view features \(\{\varvec{x}_{i}^{n}\}_{n=1}^{M}\), where M is the number of views.

Definition 2

(Heterogeneous Affinity Graph) A heterogeneous affinity graph \(G = (V, E, \phi , \psi )\) is constructed from given and multi-view data. V and E are the node and the edge sets of the original \(\varvec{X}\), \(\phi\) is the view-based node type mapping function, and \(\psi\) is the view-based edge type mapping function. We let \(\phi _{u}(V)\) denote the node embedding of the u-th view, and \(\phi _{0}(V) = V\), i.e., the original node features are the 0-th view. By analogy, \(\psi _{0}(E) = E\). Also, we let \(\phi _{u}(v_{i}) = {m_{i}^{u}}\) and \(\psi _{u}(e_{i}) = {p_{i}^{u}}\), where \({p^{u}_{i}}\) indicates an edge \(e_{i}\) in the u-th view.

Definition 3

(Feature Propagation) Given the node feature set \(\varvec{X} = \{\varvec{x}_{i}\}_{i=1}^{N}\) and the edge weight set \(\varvec{W} = \{w_{i,j}\}\), where \(i, j \in [1,N]\) and \(i \ne j\), the propagated feature of a node \(v_{i}\) is governed by

$$\begin{aligned} \widetilde{\varvec{x}_{i}}=\mathcal {P}(\varvec{x}_{i}, \{\widetilde{\varvec{x}_{j}}\},\{w_{i,j}\}; \theta _{p}), \end{aligned}$$
(4)

where \(\widetilde{\varvec{x}_{j}}\) indicates the propagated feature set of \(v_{i}\) neighbours, \(\{w_{i,j}\}\) is the edge weight set of edges between \(v_{i}\) and its neighbours, \(\mathcal {P}\) is feature propagation function and \(\theta _{p}\) is the propagation parameter.

Definition 4

(Local Propagation Network) The local propagation network \(G^{\prime }(v_{i}) = (V^{\prime }(v_{i}), E^{\prime }(v_{i}),\phi , \psi )\) of a node \(v_{i}\) is a directed graph, which consists of several local directed subgraphs. A local directed subgraph starts from \(v_{i}\) though l hops, each of which finds \(k_{l}\) nearest neighbours extracting the neighbourhood of \(v_{i}\). Let \(v_{a}\) be in the \((h-1)\)-th hop and \(v_{b}\) in the h-th hop. If \(v_{a}\) and \(v_{b}\) are the mutual first neighbours, this path stops walking at \(v_{b}\), i.e., a path will stop walking when a node in it meets its mutual first neighbour in the next step, or when the path reaches the l-th hop. From the first to l-th hop, the feature propagation influence decreases.

Definition 5

(Influential Node) The target node \(v_{i}\) can walk l steps to find its influential node \(v_{i+l}\), which identifies a node with maximum degrees (i.e., degree centrality) and/or maximum density (i.e., density centrality) in the view-based subgraph of \(v_{i}\).

Definition 6

(Influence-aware Propagation Network) The influence-aware propagation network \(G^{\prime \prime }(v_{i}) = (V^{\prime \prime }(v_{i}), E^{\prime \prime }(v_{i}),\phi , \psi )\) of a node \(v_{i}\) is a directed graph, which consists of several paths. A path starts from \(v_{i}\) to a influential node \(v_{i+l}\) through the shortest distance in the M-th view, which is in the form of \({m_{i}^{M}} \longrightarrow m_{i+1}^{M} \longrightarrow \cdots \longrightarrow m_{i+l}^{M}\). From \({m_{i}^{M}}\) to \(m_{i+l}^{M}\), the feature propagation influence decreases.

3.2 Heterogeneous affinity graph construction

We construct the heterogeneous affinity graph \(G = (V, E, \phi , \psi )\) from multiple views of images based on feature similarity. An edge in our graph indicates the possibility of two nodes having the same label. The graph G consists of \(M+1\) homogeneous affinity graphs, which are related by the connections between the original view and each descriptor view, i.e., there are \(M+2\) edge types in G. For each node \(v_{i}\) and the original feature \(\mathbf {x}_{i}\), its u-th view feature is \(\phi _{u}\left( \mathbf {x}_{i}\right)\). According to these features, we adjust instance pivot subgraph (IPS)  [13] to build the heterogeneous affinity graph G following the steps mentioned below.

Step 1: Feature extraction. In a single-view image dataset, given a node \(v_{i}\), we utilise M different descriptors (e.g., Gabor [1], HOG [3]) to generate multiple views of \(v_{i}\) - Fig. 2(a). This results in \(M+1\) views of \(v_{i}\) and different view-based feature vectors, including the original \(\varvec{x}_{i}\), where the n-th view-based feature vector of \(v_{i}\) is denoted as \(\varvec{x}_{i}^{n}\). We note that, for benchmark multi-view image datasets [57, 58], standard multiple views of images are already available.

Step 2: Neighbourhood construction. In each view, we utilise h-hop kNN to build its neighbourhood-based subgraph. Let \(k_{t}\) denote the k nearest neighbours at the t-th hop, where \(t = 1,2,...,h\). As t increases, the neighbourhood influence towards \(v_{i}\) decreases. Hence, the number of connecting nearest neighbours \(k_{t}\) decreases as well. We add graph edges and their weights along with the neighbourhood discovery. For instance, for an edge \(e_{i,j}\) between \(v_{i}\) and \(v_{j}\), their distance \(d_{i,j}\) is computed using their sparse construction error \(c_{i,j}\) as

$$\begin{aligned} d_{i,j}=\left\| \varvec{x}_{i}-c_{i,j} \varvec{x}_{j}\right\| _{2}^{2}. \end{aligned}$$
(5)

Then, the weight between \(v_{i}\) and \(v_{j}\), i.e. \(w_{i,j}\) is defined as

$$\begin{aligned} w_{i,j}= {\left\{ \begin{array}{ll}1 &{} \text{ if } i=j \\ 1-\left( s_{i j}+s_{j i}\right) / 2 &{} \text{ if } i \ne j\end{array}\right. }, \end{aligned}$$
(6)

where \(s_{ij}\) is the similarity score (i.e., Euclidean distance) between node \(v_{i}\) and \(v_{j}\), and \(s_{ij} = s_{ji}\). Thus, we get the homogeneous affinity graph of each view. To constitute these affinity graphs as a heterogeneous affinity graph \(G = (V, E, \phi , \psi )\) - Fig. 2(b), we connect each node \(v_{i}\) with its corresponding other view-based nodes, where each edge weight is kept 1.

Step 3: Node density calculation. Based on the constructed heterogeneous affinity graph, we define the density for a node \(v_{i}\) in the graph as

$$\begin{aligned} \rho _{v_{i}}=\frac{1}{\vert \mathcal {N}_{k_{1}}(v_{i})\vert } \sum _{v_{j} \in \mathcal {N}_{k_{1}}(v_{i})} \tilde{\varvec{x}_{i}}^{\intercal } \tilde{\varvec{x}}_{j}, \end{aligned}$$
(7)

where \(\mathcal {N}_{k_{1}}(v_{i})\) is the first hop k nearest neighbours of the node \(v_{i}\), and \(\tilde{\varvec{x}_{i}}\), \(\tilde{\varvec{x}_{j}}\) are the \(\ell _{2}\)-normalised feature embeddings of node \(v_{i}\) and \(v_{j}\). According to this formula, the density of \(v_{i}\) is equal to the average of its similarity with its neighbours. Higher density nodes are considered more discriminative and are more influential in identifying cluster centres.

3.3 Local feature propagation encoder

The Local Feature Propagation (LFP) in our technique governs the interaction among the neighbourhood nodes. We aim to learn the propagated feature embedding of node \(v_{i}\) through its neighbourhood, i.e. to find its neighbours with similar features. Here, we conduct feature propagation on each view-based homogeneous affinity graph, then process them with cross-view contrastive learning - Fig. 2(c). General aggregation strategies like mean-pooling and max-pooling cannot identify if nodes are important, i.e., their mutual first neighbours cannot be emphasised. We employ the following two steps to obtain the embedding of each node \(v_{i}\) from the sight of local neighbourhood.

Step 1: Feature propagation. Based on  (2) and discussion in Section 2.2, we incorporate influence of the first neighbours in feature propagation. To explain the concept, we take the n-th view-based node \({m_{i}^{n}}\) as an example. Its feature propagated embedding, which considers feature information and structural information simultaneously by emphasising the importance of mutual first neighbours, is computed as

$$\begin{aligned} \widetilde{\varvec{X}({m_{i}^{n}})}= & {} \mathcal {W}_{1} \varvec{X}({m_{i}^{n}})+\mathcal {W}_{2} \sum _{{m_{j}^{n}} \in \mathcal {N}({m_{i}^{n}})} \frac{1}{d_{i}} \widetilde{\varvec{X}({m_{j}^{n}})}\cdot \gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]} \nonumber \\+ & {} \mathcal {W}_{3} \sum _{{m_{j}^{n}} \in \mathcal {N}({m_{i}^{n}})} {\varvec{X}({m_{j}^{n}})} \cdot \gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]}, \end{aligned}$$
(8)

where \(\gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]}\) is the indicator function for the mutual first neighbour enhancement during feature propagation. It governs the weight of the first neighbour’s influence towards the target node \({m_{i}^{n}}\). In  (8) \(d_{i} \in D\), where D is the degree matrix. Other notations follow from Section 2.2. For conciseness, the text below also avoids repeating explanation of other notations.

Step 2: Cross-view contrastive learning. After having the feature propagated embeddings of nodes in different M views, we feed them to a shared MLP which has a hidden layer to prepare them for the contrastive loss. In this case, we employ the conventional tactic of designating positive and negative samples. In other words, some view-based nodes that are formed from the same originating node form positive node pairs, whilst others form negative node pairs. However, our method aims at multi-view contrastive learning. Therefore, for the propagated n-th view-based node \(\widetilde{{m_{i}^{n}}}\), we define the following contrastive loss for LFP

$$\begin{aligned} \ell (\widetilde{{m_{i}^{n}}}) = -\log \frac{\sum \limits _{k = 0}^{M}\text {exp}(\text {sim}\left( \widetilde{{m_{i}^{n}}}, \widetilde{{m_{i}^{k}}}\right) / \tau )}{\sum \limits _{k=0}^{M}\sum \limits _{v_{b} \in \mathcal {N}(v_{i})}\mathbbm {1}_{[i \ne b]}\text {exp}(\text {sim}\left( \widetilde{{m_{i}^{n}}}, \widetilde{{m_{b}^{k}}}\right) / \tau )}, \end{aligned}$$
(9)

where \(\mathcal {N}(v_{i})\) indicates the nodes in the \(v_{i}\)-oriented subgraph, and \(\mathbbm {1}_{[i \ne b]} \in \{0,1\}\) is an indicator function that equals to 1 if \(i \ne b\). Then, the overall cost objective is given as

$$\begin{aligned} \mathcal {J}=\frac{1}{M\cdot \vert \mathcal {N}(v_{i})\vert } \sum \limits _{n=0}^{M}\ell (\widetilde{{m_{i}^{n}}}). \end{aligned}$$
(10)

Through the cross-view contrastive objective, we optimise the encoder via back-propagation and learn the embeddings \(z_{i}^{\text {LPF}}\) of nodes for \(v_{i}\).

3.4 Influence-aware feature propagation encoder

We also aim to learn the embedding of node \(v_{i}\) under influence-aware feature propagation (IFP) - Fig. 2(d). For the target node \(v_{i}\), different views contributes differently to its embedding. The embedding can not only be affected by the nearest neighbours, but also by the influential nodes in its neighbourhood of various views. Thus, we devise the influence-aware feature propagation encoder at node-level and view-level to hierarchically aggregate underlying information through the shortest paths from the target node \(v_{i}\) towards the influential node \(v_{i+l}\) (excluding the edge between \(v_{i}\) and \(v_{i+l}\)) in different view-based subgraphs (see Def. 6). Take the n-th view as an example, we define a path from \({m_{i}^{n}}\) to its influential node \(m_{i+1}^{n}\) as \(p ({m_{i}^{n}}) = \{{m_{i}^{n}},\cdots ,m_{i+l}^{n}\}\) (see Def. 6).

Each path is processed via a GCN-based encoder to learn feature propagation. As shown Fig. 2(d), for each node (as a target node), its corresponding paths in various views will be the input. Having the path \(p ({m_{i}^{n}}) = \{{m_{i}^{n}},\cdots ,m_{i+l}^{n}\}\), we define the affinity matrix \(\varvec{A}({m_{i}^{n}})\in \mathbb {R}^{\vert p \left( {m_{i}^{n}}\right) \vert \times \vert p ({m_{i}^{n}})\vert }\) with the initial feature matrix denoted as \(\varvec{X}({m_{i}^{n}})\). In the k-th layer of GCN, we update the feature matrix as

$$\begin{aligned} \varvec{X}^{k}\left( {m_{i}^{n}}\right)= & {} \sigma (\alpha \cdot \varvec{X}^{k-1}\left( {m_{i}^{n}}\right) \nonumber \\+ & {} (1-\alpha ) \varvec{D}^{-1}\left( {m_{i}^{n}}\right) \varvec{A}\left( {m_{i}^{n}}\right) \varvec{X}^{k-1}\left( {m_{i}^{n}}\right) \mathcal {W}^{k-1}), \end{aligned}$$
(11)

where \(\varvec{X}^{k-1}\left( {m_{i}^{n}}\right)\) denotes the updated features of the k-1-th GCN layer for all nodes on the path of \(\left( {m_{i}^{n}}\right)\), \(\varvec{D}^{-1}\left( {m_{i}^{n}}\right)\) is the diagonal matrix with \(\varvec{D}_{i, i}\left( {m_{i}^{n}}\right) =\sum _{j} \varvec{A}_{i, j}\left( {m_{i}^{n}}\right)\), \(\sigma\) is the \(\text {ReLU}\) activation, \(\alpha\) is a learnable parameter that balances the importance of the updated features, and \(\mathcal {W}^{k-1}\) is the transformation parameter. Through GCN, we can receive the embedding of \({m_{i}^{n}}\) as \({h_{i}^{n}}\). Next, we employ attention mechanism in view-level to hierarchically aggregate context information from other views to the target node \(v_{i}\). We firstly compute the importance of the n-th view as

$$\begin{aligned} \beta _{n}=\frac{\exp \left( \frac{1}{\vert V \vert } \sum _{v_{i} \in V} \varvec{A}_{\text {IFP}}^{\top } \cdot \tanh \left( \mathcal {W}_{\text {IFP}} {h_{i}^{n}}+\mathcal {B}_{\text {IFP}}\right) \right) }{\sum \limits _{j=1}^{M} \exp \left( \frac{1}{\vert V \vert } \sum _{v_{i} \in V} \varvec{A}_{\text {IFP}}^{\top } \cdot \tanh \left( \mathcal {W}_{\text {IFP}} {h_{i}^{j}}+\mathcal {B}_{\text {IFP}}\right) \right) } \end{aligned}$$
(12)

where \(\varvec{A}_{\text {IFP}}^{\top }\) indicates the transpose of affinity graph \(\varvec{A}({m_{i}^{n}})\), \(\tanh (\cdot )\) is the hyperbolic tangent function, \(\mathcal {W}_{\text {IFP}}\) are learnable parameters for IFP, and \(\mathcal {B}_{\text {IFP}}\) denotes view-level attention. Then, we compute the final embedding as follows

$$\begin{aligned} z_{i}^{\text {IPF}}=\sum _{n=1}^{M} \beta _{n} \cdot {h_{i}^{n}}. \end{aligned}$$
(13)

3.5 Dual-context contrastive graph learning

To maximise the mutual information between each pair of embeddings generated from LFP and IFP, we design a dual-context contrastive loss. To that end, we extend the definition of positive samples - Fig. 2(e). That is, for a node \(v_{i}\), not only its LFP and IFP embeddings (i.e., \(z_{i}^{\text {LFP}}\) and \(z_{i}^{\text {IFP}}\)) are mutually positive, but also the embeddings of its mutual first neighbour \(v_{j}\) (i.e., \(z_{j}^{\text {LFP}}\) and \(z_{j}^{\text {IFP}}\)) would be considered as its positive samples. We denote the positive sample set of \(v_{i}\) as \(\mathbb {P}_{i}\). This aims to contribute to nailing down clustering centres. According to the extension of positive sample definition, we formulate the dual-context contrastive loss function as

$$\begin{aligned} \mathfrak {L}(v_{i})=\mathfrak {L}(z_{i}^{\text {LFP}})+\mathfrak {L}(z_{i}^{\text {IFP}}). \end{aligned}$$
(14)

We describe the computation of \(\mathfrak {L}(z_{i}^{\text {LFP}})\) below. The \(\mathfrak {L}(z_{i}^{\text {IFP}})\) is computed analogously. We let

$$\begin{aligned} \mathfrak {L}(z_{i}^{\text {LFP}}) = -\log \frac{\sum \limits _{z_{j} \in \mathbb {P}_{i}}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{j}\right) / \tau )}{\text {Neg.}}. \end{aligned}$$
(15)

In (15), \(\text {Neg.}\) refers to contrasting against negative samples, which is defined as

$$\begin{aligned} \text {Neg.}= & {} \sum \limits _{k=1}^{N}\mathbbm {1}_{[i \ne k]}\mathbbm {1}_{[z_{k}^{\text {LFP}} \notin \mathbb {P}_{i} ]}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{k}^{\text {LFP}}\right) / \tau )\nonumber \\+ & {} \sum \limits _{a=1}^{N}\mathbbm {1}_{[i \ne a]}\mathbbm {1}_{[z_{a}^{\text {LFP}} \notin \mathbb {P}_{i}]}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{a}^{\text {IFP}}\right) / \tau ), \end{aligned}$$
(16)

where \(z_{a}^{\text {IFP}}\) represents the IFP embedding of other nodes for contrasting, and two indicators separate out the negative samples. The overall contrastive objective of maximising mutual information is then

$$\begin{aligned} \mathcal {J}=\frac{1}{N} \sum _{i=1}^{N}\left[ \theta \cdot \mathfrak {L}(z_{i}^{\text {LFP}})+(1-\theta )\cdot \mathfrak {L}(z_{i}^{\text {IFP}})\right] , \end{aligned}$$
(17)

where hyper-parameter \(\theta\) controls the relative importance of the two embeddings.

4 Evaluation

4.1 Experimental Setup

Datasets: To establish the effectiveness of our technique, we perform bechmarking of our method on popular standard multi-view image datasets.

  1. 1.

    COIL-20 [57] has 1,440 gray-scale images of 20 classes and each class contains 72 images. Each image has been extracted under 3 views, where the first is a 1024-dimensional intensity feature, the second is a 3304-dimensional LBP feature, and the third is a 6750-dimensional Gabor feature.

  2. 2.

    Caltech7 [58] contains 1,474 images of 7 classes. Each image has 6 views of features extracted. These views include 48-dimensional Gabor feature, 40-dimensional WM feature, 254-dimensional CENTRIST feature, 1984 dimensional HOG feature, 512-dimensional GIST feature, and 928-dimensional LBP feature.

  3. 3.

    CASIA-WebFace (CWF) [59] is cleaned to contain 466,169 single-view face images of 10,575 real identities collected from the web. To obtain a relatively large-scale dataset while constraining the image numbers for the available hardware, we select 18 classes of 10,791 images to conduct our experiments. To generate multi-view data, we utilise the 512-dimensional feature vectors extracted by CNN architecture to obtain 3 views (512-dimensional color histogram, 26-dimensional LBP and 512-dimensional Gabor) from each image.

Baseline Methods: For benchmarking, we compare with the following six existing techniques that use a variety of approaches for multi-view image clustering.

  1. 1.

    MIC [60] adopts the same strategy used by best single view (BSV) [61] to fill missing instances and then learns a non-negative low-dimensional consensus representation for all views. K-means is applied to the learned consensus representation for final clustering.

  2. 2.

    DCCAE [62] extends the method of combining the deep encoder with Canonical Correlation Analysis by introducing a deep decoder, which has deep canonically correlated auto-encoders coordinating and extracting graph representations in a pairwise manner.

  3. 3.

    MVGL [27] learns individual graphs and then integrate the multiple learned graphs into a global graph with exactly k components.

  4. 4.

    MCGC [28] learns a consensus graph by minimising the disagreement between different views and constraining the rank of the Laplacian matrix.

  5. 5.

    RHLC-CAGL [31] automatically captures a latent common Laplacian that is shared by all views.

  6. 6.

    HeCo [47] is a self-supervised heterogeneous GNN method, which leverages meta-path to extract two views for contrastive learning.

Implementation Details: In GoMIC, for the single-view image dataset (i.e., CASIA-WebFace), we utilise commonly used feature descriptors, i.e., intensity, LBP and Gabor, to generate multiple views of each image. There are 2 hyperparameters for heterogeneous affinity graph construction in our method, namely, the number of hops \(\mathbbm {h}\), and the number of each node’s nearest neighbours in each hop \(\{k_{i}\}, i = 1, 2, ..., \mathbbm {h}\). \(\mathbbm {h}\) and \(\{k_{i}\}\) are defined according to the used dataset. We set \(\mathbbm {h} = 4, \{k_{i}\} = \{10, 5, 1, 1\}\) for COIL-20, \(\mathbbm {h} = 4, \{k_{i}\} = \{20, 5, 1, 1\}\) for Caltech7, and \(\mathbbm {h} = 4, \{k_{i}\} = \{50, 3, 1, 1\}\) for CASIA-WebFace. Also, while setting up the GCN-based contrastive model, we optimise the learning rate in the range [\(1e-4\), \(5e-3\)]. For the dropout function in encoding, we used the range [0.1, 0.5] with a step size 0.05, and \(\tau\) is tuned in the range [0.5, 0.9] with a step size 0.05, and \(\alpha\) and \(\theta\) are both tuned in the range [0.1, 0.9] with a step size 0.1. Moreover, encoders only conduct aggregation once, i.e., we use 2-layer GCN on LFP and IFP. At the end of GoMIC, the images to be clustered are represented as a graph. Each edge in the graph is associated with a similarity weight in [0, 1]. To generate the clusters, we visit every node and only preserve its neighborhood nodes with the largest weight, i.e., the other neighbor nodes are disconnected from the node. Thus, the clusters get formed in an efficient manner.

Table 1 Comparison of clustering performance on three datasets. NMI, PUR and ACC respectively stand for normalised mutual information, purity and accuracy. Self-supervised graph learning-based methods are green highlighted. The best results in each column are bold faced

4.2 Comparison with the existing state-of-the-art

Table 1 reports the results of several popular multi-view clustering methods and a state-of-the-art self-supervised heterogeneous contrastive graph learning technique, HeCo. The results reveal that multi-view clustering methods which make use of more views usually achieve higher performance. This explains why MIC [60] and DCCAE [62] generally have inferior clustering results. Since MIC relies on the best single view to conduct representation learning, and DCCAE correlates view-based graphs for embeddings pair-by-pair, they are not able to perform as well as other methods. In contrast, methods proposed more recently (MVGL [28], MCGC [28] and RHLC-CAGL [31]) aim to leverage more information from multiple views, which achieves better performance. In the table, HeCo [47] deals with natural heterogeneous networks instead of directly dealing with multi-view images. Nevertheless, it performs reasonably well on this benchmark due to the suitability of heterogeneous graphs to the problem. This is inline with our intuition of exploiting heterogeneous properties of image views for clustering. It is therefore not surprising that the performance of our method, GoMIC, outperforms these powerful baselines. Our approach not only contrasts numerous views but also makes use of two recently developed feature propagation encoding schemes for enhanced contrastive learning.

Table 2 Ablation study results for GoMIC. GoLFP only uses LFP encoder. GoIFP uses only the IFP encoder and GoMIC-n/e does not use the extended definition of positive samples, but uses both LFP and IFP. Difference of these variants to GoMIC highlights the contribution of the components
Fig. 3
figure 3

Parameter analysis on Caltech7 [58]. The best performance is achieved when both \(\alpha\) and \(\theta\) are chosen in the range \(\{0.3,0.6\}\)

4.3 Ablation study and parameter analysis

GoMIC encodes multi-view based graphs with two innovative encoding schemes, namely, LFP and IFP. Also, for better clustering centres, we adjust the contrastive loss function by extending the positive sample definition. To understand the impact of these factors in the overall performance of our technique, we introduce three variants of GoMIC to conduct an ablation study. These three variants are respectively denoted as: (1) GoLFP, which contains only LFP as the encoder; (2) GoIFP, which only contains IFP as the encoder and (3) GoMIC-n/e, which has no extension of the positive sample definition. The results of these variants on all three datasets are summarised in Table 2. In the table, we can observe that GoMIC eventually outperforms all these variants by a considerable margin, establishing the benefits of synergising these proposed components. Furthermore, the performances of GoLFP and GoIFP decrease differently when applied to different datasets. This highlights that, for different cases, LFP and IFP are able to contribute differently to the overall performance. A consistent gain of GoMIC over GoMIC-n/e also ascertains the importance of our positive sample definition extension. From the parameters viewpoint, GoMIC has two major hyper-parameters, \(\alpha\) and \(\theta\). We show the influence of adjusting their values on performance in Fig. 3. The chosen range values are \(\{0.1, 0.3, 0.6, 0.9\}\).

5 Conclusions

In order to understand and exploit relationships within image datasets from each node’s local neighbourhood and influence-aware context, we introduced an innovative multi-view image clustering method called GoMIC. GoMIC takes advantage of the heterogeneous properties of multi-view image data under contrastive graph learning. To extract and exploit more underlying information, we devised two strategies to encode the graphs, Local Feature Propagation (LFP) and Influence-aware Feature Propagation (IFP), to represent each node-based subgraph in two contrasting contexts. Also, we employed two contrastive loss functions, and adjusted them to fit the use of LFP and IFP. The first loss function aims to integrate multiple view-based LFP embeddings, and the second nails down the clustering centres with an extended positive sample definition for contrastive graph learning. Experimental results show that our proposed method consistently outperforms the state-of-the-art methods on multi-view clustering benchmarks. Also, our ablation study demonstrates explicit contribution of each novel aspect in our overall technique. Currently, the framework works with the common assumption of balanced data. In the future, we will extend it to also handle imbalanced data.