Introduction

With the development of Internet technology and widespread use of social media, more and more people are willing to upload their lives to the Internet. People tag image on social media according to their interests and cultural backgrounds, which generates a large amount of social image data. Social image is a kind of cross-modal data, which consists of visual images and textual tags. Obviously, the visual images and user tags are heterogeneous in low-level features but related in high-level semantics. Integrating this visual and semantic information of social images can improve the performance of social image clustering [1,2,3, 3,4,5,6,7].

Fig. 1
figure 1

The framework of the KAPC method. For the image modality, KAPC employs a deep neural network to extract image features. For tags modality, KAPC extracts structured knowledge of tags in knowledge graph to construct semantic similarity matrix, which introduces human prior knowledge and completes missing semantics. Then a progressive iteration method is proposed to bridge the heterogeneous gap between different modalities. It allows the modalities to guide each other and to obtain more refined clustering results for social images in iteration

There are two key issues in the task of social image clustering. The first is the heterogeneous gap between visual images and textual tags [8,9,10,11,12,13,14]. Finding the correlation and shared information of multiple modalities is the intuitive way to bridge the heterogeneous gap. One of the common methods is to find a shared subspace for multiple modalities, which maps the features of different modalities into a common subspace by defining a mapping rule [4,5,6, 10, 11, 15,16,17]. However, the traditional subspace learning method only focuses on the low-level features of heterogeneous data and ignores the semantic information contained in the feature of each modality, which will degrade the performance of the method. In addition, the amount of visual information is significantly larger than textual information in social image. Mapping the information of the two imbalanced modalities directly into a subspace will result in too large weight of visual information to achieve satisfactory results.

Secondly, the tags provided by users are sparse and incomplete because users may have errors, omissions, and tendencies in tagging. In recent years, embedded learning based on knowledge graph has attracted extensive attention of researchers [18,19,20], which models tags relationships extracted from human knowledge into a low-dimensional embedding vector. However, most of the existing embedded learning methods only consider the relationship between entities and ignore the large amount of background information that exist in the interpretation of tags, which undoubtedly results in the loss of necessary information [21].

In this work, we propose a novel knowledge-aware progressive clustering (KAPC) method, which uses human priori knowledge to guide the clustering process of social images, as illustrated in Fig. 1. First, to cope with the problem of sparse and incomplete tags, we design a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge in the knowledge graph. Specifically, we consider both entity relations and background knowledge of tags, and compute two different semantic similarities simultaneously to mine human knowledge. The complemented tags are formalized as a semantic similarity matrix and considered as the representation of textual information. Secondly, we define an objective function based on information theory to bridge the heterogeneous gap, which involves a mutual information loss and a cross-entropy loss. The former is designed to preserve intra-modal knowledge while the latter is employed to align the inter-modal cluster distribution to explore the correlation between visual and textual information. Finally, a progressive iteration method is designed to make the textual and visual information guide each other, which is based on an efficient “draw-and-merge” process to converge the KAPC objective function to a locally optimum. Extensive experiments are conducted on four large-scale real-world social images datasets to verify the superiority of the KAPC approach.

In summary, the main contributions of this work are as follows:

  • We propose a knowledge-aware progressive clustering (KAPC) method, which employs human knowledge to guide the cross-modal clustering of social images.

  • We design a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge in the knowledge graph, which considers entity relations and background knowledge of tags and compute two different semantic similarities simultaneously to mine human knowledge.

  • We define an objective function based on information theory to achieve intra-modal knowledge retention and inter-modal correlation exploration.

  • A progressive iteration method is designed to make textual and visual information guide each other and converge the objective function to a locally optimum.

Related work

Massive amount of methods have been proposed to explore social image clustering. In this section, we briefly review the work related to heterogeneous gap and tag sparsity problem in social images.

Heterogeneous gap Heterogeneous gap between different modalities is an issue of great concern, and many methods have been proposed to explore it [6, 10, 22, 23]. The most popular approach is subspace learning, which aims to learn shared subspace representations for different modalities to explore the relationship between heterogeneous data from different modalities. At present, subspace learning approaches have shown the effectiveness to cross-modal learning tasks [6, 10, 23]. [23] introduces a deep model-structure autoencoder. It maps input data into a nonlinear subspace while preserving the global and local subspace structure. [6] proposes a consistent and specific subspace learning method which maps multiple views into a subspace to learn shared information in multi-views. However, the traditional subspace learning method only focuses on the low-level features and does not consider the relationship between the high-level semantic of each modality, which will degrade the performance.

Tags sparsity problem To solve the problem that tags of social images are sparse and incomplete, researchers have proposed several methods, which can be roughly divided into tags completion [24, 25] and tags refinement [2]. Tags completion method generates tags for unlabeled images by calculating a visual similarity between images. The purpose of tags refinement is to obtain more accurate and high-quality tags.

With the rapid development of knowledge graph, embedded learning based on the knowledge graph is attracting more and more attention [18,19,20]. [18] preserves structure knowledge and uncertainty information of relations facts in the embedding space through knowledge graph, and designs a credibility score to learn the uncertainty information. [19] proposes a label propagation mechanism, which uses the recurrent neural network to propagate the label in the characteristic knowledge graph, and finally learns the weight representation vector of the label in the network. However, the above embedding learning methods only consider the relationship between entities in the knowledge graph and ignores the background information which may contain key relationships and clues in the semantics of tags.

Knowledge-aware progressive clustering method

The proposed KAPC is a cross-modal clustering method based on mutual guidance between visual and textual modalities. At the beginning of the KAPC, we propose a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge, which constructs a semantic similarity matrix to extend tags by entity relationship and background knowledge of tags in knowledge graph. Then the visual and textual modalities guide each other to complete progressive clustering. Specifically, we randomly regard one modality as the guidance modality and the other modality as the iteration modality. Our goal is to guide the clustering process of iteration modality by the clustering structure of guidance modality. To achieve this goal, we design an objective function based on mutual information and cross-entropy, in which the mutual information measures the preservation of the iteration modality information, and the cross-entropy explores the correlation between the clustering structure of two modalities. By optimizing this objective function, we can obtain the clustering structure that contains correlation of two modalities while maximally preserving the single modality information. To fully integrate the information of two modalities, we exchange the iteration modality and guidance modality and progressively obtain better clustering results in each clustering process.

Problem formulation

In this work, two modalities are collected from the social image dataset, namely the images modality Img and the tags modality Tag. To make these two modalities guide each other, we randomly divide the two modalities into the iteration modality \(X_{iter}\) and guidance modality \(X_{guid}\). For the \(X_{iter}\), we define its original data, features and clustering structure as X, Y and T. The clustering structure of \(X_{guid}\) is represented as K. The purpose of KAPC method is to learn the clustering result T of the iteration modality under the constraint of K. In other words, the task of KAPC method is to find a optimal cluster assignment p(t|x) of X to T under the guidance of K. The notations are displayed in Table 1.

Table 1 Nomenclature

Dual-similarity semantic expansion

Due to the sparse and incomplete tags, it is difficult to directly learn social image knowledge by cross-modal clustering. We propose a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge, which consists of interpretation similarity and concept similarity. The interpretation similarity reflects the rich background knowledge of tags in natural language interpretation. The concept similarity represents the entity relationship in the knowledge graph. The interpretation similarity and concept similarity of words \(S_1\) and \(S_2\) is defined as \(simi_{IS}(S_1,S_2)\) and \(simi_{CS}(S_1,S_2)\). In this study, we calculate these two similarities through WordNet [26].

Fig. 2
figure 2

The illustration of construction process of feature vector. Our aim is to expand the semantics of each word and learn its feature vectors through WordNet. First, the interpretation of the word is queried in WordNet and combined into a one-level vector. Next, the interpretation of each word in the one level vector is queried and summed to form the feature vector

To calculate the \(simi_{IS}(S_1,S_2)\), we need to construct semantic feature for \(S_1\) and \(S_2\) by WordNet. Tasking \(S_1\) as an example, we first initialize a semantic space consisting of all the entities in WordNet, whose base element is the number of occurrences of each entity. Then, we query the interpretation of \(S_1\) in WordNet, collect the entities in the paraphrases and update their occurrences into the semantic space to form the feature vector of \(S_1\). In order to fully explore the background knowledge of \(S_1\), we perform “query-collection” operation again for each entity in feature vector of \(S_1\) and update the number of newly obtained entities into the semantic space to obtain the final feature vector of \(S_1\). Finally, Calculate cosine similarity of feature vector of \(S_1\) and \(S_2\) to get the \(simi_{IS}(S_1,S_2)\). Figure 2 shows the construction process of the feature vector.

Then, we collect the node distance of \(S_1\) and \(S_2\) in WordNet to calculate the \(simi_{CS}(S_1,S_2)\). Suppose the node distance of words \(S_1\) and \(S_2\) is \(length(S_1,S_2)\), we can get the \(simi_{CS}(S_1,S_2)\):

$$\begin{aligned} \begin{aligned}&simi_{CS}(S_1,S_2) = - \log \left( \frac{\textrm{length}(S_1,S_2)}{2\times \max \limits _{c \subseteq WordNet } dep(c)}\right) \end{aligned} \end{aligned}$$
(1)

where dep(c) is the maximum node distance and is set to 16.

After obtaining interpretation similarity and concept similarity, we combine them to calculate the totally similarity \(simi(S_1,S_2)\) between two words based on the part of speech. If \(S_1\) and \(S_2\) are nouns, \(simi(S_1,S_2)\) is defined as the average of the two similarity, otherwise, \(simi(S_1,S_2)\) is defined as interpretation similarity. Suppose the feature words in the tag A are {\(a_1,a_2,...,a_n\)} and the feature words in the tag B are {\(b_1,b_2,...,b_m\)}, the similarity between A and B is:

$$\begin{aligned} \begin{aligned}&sim(A,B)=\frac{\sum _{i=1}^n simi(a_i,b) + \sum _{i=1}^m simi(a,b_i)}{2} \end{aligned} \end{aligned}$$
(2)

where

\(simi(a_i,b) = \max \limits _{i=1,2,...,m} simi(a_i,b_j)\), \(simi(a,b_i) = \max \limits _{j=1,2,...,n} simi(a_j,b_i)\)

The Eq. 2 formalizes the sparse tags of social images as a semantic similarity matrix, which is considered as the representation of tags in the next.

Objective function

The purpose of KAPC approach is to discover a good clustering results of the social image collection. To achieve this goal, we maximize the intra-modal information of \(X_\mathrm{{iter}}\) on the one hand, and align the cluster distribution of \(X_\mathrm{{iter}}\) and \(X_\mathrm{{guid}}\) to explore the inter-modal correlation on the other hand. Specifically, we introduce the mutual information I(TY) and cross-entropy H(TK) to preserve intra-modal information and align inter-modal cluster distributions. We maximize I(TY) so that the clustering distribution T retains the information in feature Y to obtain better clustering results, while minimizing H(TK) implies aligning the inter-modal cluster distribution, which allows \(X_\mathrm{{guid}}\) to guide the clustering process of \(X_\mathrm{{iter}}\). The basic definitions for the objective function and constraints is a great challenge [27,28,29]. Combining the cross-entropy and mutual information, we propose the objective function of KAPC method:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{\max }[p(t|x)] = I(T,Y) - H(T,K) \end{aligned} \end{aligned}$$
(3)

where the term I(TY) indicates the mutual information between the clustering partition T and the feature vector Y, which embodies the preserving of the single modality information. H(TK) is the cross-entropy of the T and K, which embodies the constraints of the guidance modality on the iteration modality. By maximizing the objective function, KAPC can find a clustering structure T that conforms to both modality knowledge by minimizing the difference between the guidance modality and the iteration modality while preserve the single modality information.

Progressive iteration

Both internal optimization and overall optimization are great challenges [30,31,32,33,34]. We design a progressive iteration method to optimize the objective function of KAPC in two steps. Firstly, it realizes the internal optimization of the iteration modality under the constraints of the guidance modality through a process of “draw-and-merge”. Secondly, the \(X_\mathrm{{iter}}\) and \(X_\mathrm{{guid}}\) are exchanged and T is taken as the K of the next iteration modality internal optimization. These two steps are repeated until T no longer changes, which achieves the overall optimization of the objective function. In addition, KAPC method can use a typical single modality clustering method to obtain the clustering structure K of the guidance modality during initialization, such as Kmeans or sIB [35].

Optimization in the iteration modality

To find an optimal cluster structure T of iteration modality \(X_\mathrm{{iter}}\), we adopt a sequential “draw-and-merge” optimization approach that maintains a clustering partition T with exactly M clusters. The “draw-and-merge” optimization method randomly divides the data in \(X_\mathrm{{iter}}\) into M clusters to complete the initialization of T. Then a potential \(x \subseteq X_\mathrm{{iter}}\) is “drawn” from its current cluster in the iterative phase. We regard it as a new singleton cluster x. To ensure that the number of clusters does not change, the singleton x cluster must be allocated into cluster \(\tilde{t}\), i.e., \(\{\{x\},t\} \Rightarrow \tilde{t}\). Therefore, we can see:

$$\begin{aligned} \begin{aligned} p (\tilde{t} )&= p(x) + p(t) \\ p\left( Y|\tilde{t} \right)&= \frac{p(x)}{p(\tilde{t})} p(Y|x) + \frac{p(t)}{p(\tilde{t})} p(Y|t) \end{aligned} \end{aligned}$$
(4)

To optimize the objective function, the value of the objective function should be increased in each “draw-and-merge” procedure. The key issue in this process is how to choose the optimal cluster. The objective function before “draw-and-merge” procedure is denoted as \(F_\mathrm{{bef}}\), and T is denoted as \(T_\mathrm{{bef}}\). After the singleton x cluster is allocated into \(\tilde{t}\), the objective function is denoted as \(F_{aft}\) and T is denoted as \(T_\mathrm{{aft}}\). The merger cost \(\textrm{cost}(\tilde{t})\) can be defined as Eq. 5, which represents the reduction of mutual information in the merging process.

$$\begin{aligned} \begin{aligned}&\textrm{cost}(\tilde{t}) =\Delta F = F_\mathrm{{bef}} - F_\mathrm{{aft}} = I(T_\mathrm{{bef}}|Y) - \\&I(T_\mathrm{{aft}}|Y) - H(T_\mathrm{{bef}},K) + H(T_\mathrm{{aft}},K) \end{aligned} \end{aligned}$$
(5)

It can be seen from Eq. 5 that \(H(T_\mathrm{{bef}},K)\) is unchanged in \(\mathrm{{cost}}(T)\) no matter which cluster is selected for merging. So we remove it to simplify the calculation. By combining Eq. 4 and Eq. 5, the merging cost can be written as:

$$\begin{aligned} \mathrm{{cost}}(\{x\},t)= & {} p(x) \sum _{y \subseteq Y} p(y|x) \log {\frac{p(y|x)}{p(y)}}\nonumber \\{} & {} + p(t) \sum _{y \subseteq Y} p(y|t) \log {\frac{p(y|t)}{p(y)}} -\sum _{y \subseteq Y} p(y|x) \log {\frac{p(y|\tilde{t})}{p(y)}}\nonumber \\{} & {} -\sum _{y \subseteq Y} p(t) p(y,\tilde{t}) \log {\frac{p(y,\tilde{t})}{p(y)}} + \sum _{i \subseteq {\left| K \right| } }H(x,k_{center_i})\nonumber \\= & {} p(x) D_{KL} [p(Y|x)\parallel p(Y|\tilde{t})] + p(t) D_{KL} [p(Y|t)\parallel p(Y|\tilde{t})] \nonumber \\{} & {} + \sum _{i \subseteq {\left| K \right| } }H(x,k_{center_i})\nonumber \\= & {} [p(x) + p(t)]\cdot JS_{\pi _1 \pi _2}[p(Y|x) \parallel p(Y|t)] \nonumber \\{} & {} +\sum _{i \subseteq {\left| K \right| } }H(x,k_{center_i})\nonumber \\ \end{aligned}$$
(6)

where \(\pi _1 = \frac{p(x)}{p(x)+p(t)}\), \(\pi _2 = \frac{p(t)}{p(x)+p(t)}\)

Algorithm 1
figure a

The KAPC Algorithm

The last two steps of Eq. 6 use the Kullback–Leibler (KL) [36] distance and the Jensen–Shannon (JS) [36] distance. The KL distance describes the degree of difference between the two probability distributions. The JS distance describes the probability that the probability distributions p(x) and q(x) come from the same distribution.

Since \(JS_{\pi _1 \pi _2}\ge 0\) and \(H(x,K_\mathrm{{center}}) \ge 0\), we know \(\textrm{cost}(\{x\},t) \ge 0\), which means that each time the “merge” process causes the loss of mutual information, the process of selecting the optimal cluster is equivalent to finding the cluster that minimize the reduction of mutual information. Continue the “draw-and-merge” procedure until the cluster to which each elements x belong unchange, so that we can get the clustering structure T in the single modality.

Overall optimization

After the “draw-and-merge” process converges, we obtain a clustering structure T of a single modality under the constraint of guidance modality. To fully integrate the information of the two modalities, we exchange the guidance modality and the iteration modality, and start a new “draw-and-merge” process, in which T is regarded as a new K. Continue this exchange process until T unchange. In this process, the value of objective function is non-monotonic decreasing. Obviously, KAPC method progressively obtains a superior T in each iteration, thus, it is called progressive iteration. The detailed steps and flowchart of KAPC are shown in Algorithm 1 and Fig 3.

In addition, we discuss the advantages and disadvantages of the proposed KAPC method in Table 2. Here we focus on the limitations of the proposed method. In the phase of dual-similarity semantic expansion, the performance of knowledge graph has a significant impact on our method. On the one hand, semantic extension is limited when words of tags are not included in the knowledge graph. On the other hand, the performance of the learned semantic similarity matrix is also affected by the structure of the knowledge graph. In the stage of progressive iteration, KAPC requires all modalities to have the same ground truth, i.e., to have the same clustering division, which is a prerequisite for modalities to guide each other. Finally, KAPC applies to cross-modal clustering between two modalities. Further exploration of the progressive iteration strategy is needed when the modalities increase.

Table 2 The advantages and disadvantages of the proposed KAPC method
Fig. 3
figure 3

The flowchart of proposed KAPC algorithm

Complexity analysis

In this section, we give the complexity analysis of the KAPC method. When calculating the \(cost(\{x\},t)\) of each cluster in step 8, the time taken is O(M|T||Y|), where M is the number of clusters. Since this process is repeated n times (n is a finite constant), the total time complexity is O(nM|T||Y|).

Experiments

Table 3 Statistics of the five datasets with the number of clusters and tags
Table 4 The ACC (%) and NMI (%) of KAPC method compared with the single modality clustering method sIB
Table 5 The ACC (%) and NMI (%) of KAPC method compared with the single modality clustering method Kmeans

Experiments setup

In this work, we conduct extensive experiments on four publicly available and widely used datasets: NUS-WIDE [37], IAPR TC-12 Benchmark [38], MIRFlickr [39] and ESP-Game [40]. Table 3 gives the details of the datasets.

NUS-WIDE [37]. It is an image dataset created by Laboratory of Media Search in the National University of Singapore, where data are taken from real-world networks. The dataset includes 269,648 images and the associated tags. To increase the experimental difference and remove excessive noise data, we respectively select 10,003 images and 2969 images to compose BigNUS dataset and NUS data. Each dataset contains 6 classes with an average tag number of 7.

IAPR TC-12 benchmark [38]. It is a publicly annotated dataset produced by Cross-Language Evaluation Forum, which contains 20,000 images collected from the real world. Each image is accompanied by a short text description. After removing the noisy images, we select 3095 images to make IAPR dataset, which contains 6 classes, and each image contains 6 labels on average.

MIRFlickr [39]. It contains 25,000 images collected from the website flickr, with 1386 user-provided tags appearing at least 20 times. It also provides the ground-truth annotations of 38 concepts. After de-noising and de-duplication, we select 5 categories of 4920 images to compose the Flickr dataset. Each image contains an average of 9 tags.

ESP-Game [40]. The data of ESP-Game come from a web image annotation game, which contains 20,770 images and corresponding tags. We remove some noisy images and label, and select 5 labels containing the most pictures as 5 classes of the dataset, which contain 7869 images, with an average of 5 labels per image.

We select several state-of-the-art cross-modal clustering approaches as the baselines to evaluate the effectiveness of KAPC: Anchor-based Partial Multi-view Clustering (APMC) [3], Highly-economized Scalable Image Clustering (HSIC) [4], Co-regularized Multi-view Spectral Clustering (CRSC) [8], Multi-View Kernel Spectral Clustering (MVKSC) [9], Multi-level Feature Learning for Contrastive Multi-view Clustering (MFLVC [7]), Cluster-based Similarity Partitioning Algorithm (CSPA) [41], HyperGraph Partitioning Algorithm (HGPA) [41], Locally Weighted Graph Partitioning (LWGP) [22], Probability Trajectory Based Graph Partitioning (PTGP) [42], Deep Mutual Information Maximin (DMIM [16]), Consistent and Specific Multi-View Subspace Clustering (CSMSC) [6], Split Multiplicative Multi-View Subspace Clustering (SMMSC) [10], Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit (SSCOMP) [15], Binary Multi-View Clustering (BMVC) [5], Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering(SDMVC [11]).

We regard Tag and Img as two different views on multi-view clustering methods. For ensemble clustering methods, we get the basic clustering results of Tag and Img through the Kmeans method. Specifically, the results of 50 Kmeans method on Img and Tag are composed of 100-dimensional basic clusterings as input of ensemble clustering. In addition, due to the number of classes included in the result of LWGP and PTGP may not be the same as the ground-truth, the ACC of LWGP and PTGP cannot be calculated. For those algorithms whose results may fluctuate, we run them ten times to take the average value to reduce the uncertainty caused by random initialization, and the standard deviation is given at the same time.

Table 6 The ACC (%) and NMI (%) of KAPC method compared with other state-of-the-art cross-modal clustering methods
Table 7 The ACC (%) and NMI (%) of KAPC method compared with other state-of-the-art cross-modal clustering methods
Table 8 The ACC (%) and NMI (%) of KAPC method compared with other state-of-the-art cross-modal clustering methods

In this study, the Clustering Accuracy (ACC) and Normalized Mutual Information (NMI) are selected to evaluate cluster quality. ACC is defined as follows:

$$\begin{aligned} \begin{aligned} \textrm{ACC} = \frac{1}{n} \sum _{i=1}^{n} \delta \left( l_i, \textrm{map}(t_i) \right) \times 100 \% \end{aligned} \end{aligned}$$
(7)

where \(t_i\) is the clustering of the data \(x_i\), \(l_i\) is the grand truth of data \(x_i\), and n is the data set size. If \(x=y\) then \(\delta (x,y)=1\), otherwise \(\delta (x,y)=0\).

NMI is used to measure the similarity between the clustering result and the grand truth of the data, and to some extent reflects the cohesion of each cluster, which is defined as follows:

$$\begin{aligned} \begin{aligned} NMI = \frac{\sum _{h,l} n_{h,l} \log { (\frac{n \cdot n_{h,l}}{n_h n_l}})}{\sqrt{(\sum _h n_h \log {\frac{n_h}{n}})(\sum _l n_l \log {\frac{n_l}{n}})}} \end{aligned} \end{aligned}$$
(8)

Where \(n_h\) is the amount of data divided into category h, \(n_l\) is the amount of data divided into category l. \(n_{h,l}\) is the amount of data divided into category h as well as into category l. When the value of NMI is 1, the clustering result is exactly the same as the ground truth.

Comparing with single modality clustering method

In this section, we compare KAPC with a single modality sIB method and Kmeans method. Tables 4 and 5 show the comparison of KAPC method with sIB and Kmeans on ACC and NMI. From Tables 4 and 5, we can see that the results of sIB on visual and textual modalities are different. The Img modality gets better results on BigNUS, IAPR and Flickr datasets while the Tag modality gets better results on NUS and ESP datasets. This demonstrates that it is unwise to only use one modality for clustering.

By calculating the similarity of the two modalities on the semantic level, the KAPC method avoids the heterogeneous gap and has achieved significant performance improvement on the five datasets. As illustrated in Tables 4 and 5, the KAPC method obtains 28.58%, 23.26%, 25.57%, 19.20% and 15.20% improvement on the five datasets compared with the best performing of single modality method in ACC and 20.96%, 14.62%, 21.01%, 67.49%, 8.74% improvement in NMI, which demonstrates the effectiveness of the KAPC method.

Fig. 4
figure 4

The ACC comparison of Kmeans on semantic similarity matrix (SSM), BERT, Doc2vec and Word2vec. It can be seen that the features learned by SSM achieve the best results on all datasets, demonstrating the effectiveness of SSM

Fig. 5
figure 5

The ACC (a) and NMI (b) of KAPC method with Tag and Img as guidance modality on each dataset. The values of ACC and NMI for the different guidance modalities are essentially the same on the five datasets, indicating that KAPC is insensitive to the initialization choices of the guidance modalities

Comparing with the state-of-the-art cross-modal clustering method

Tables 67, and 8 shows the comparison results between the proposed KAPC algorithm and the other cross-modal clustering algorithms in terms of ACC and NMI criteria. As illustrated in Tables 67, and 8, we take the best performance of all the state-of-the-art comparison methods in each row as the benchmark, which validates the effectiveness of the proposed KAPC method.

Comparing with the different tags extension method

According to the characteristics of social image tags, we propose a semantic similarity matrix to extend tags in Section 3. To evaluate its effectiveness, we choose several NLP models as comparison: Bidirectional encoder representations from transformers (BERT) [20], Doc2vec [43] and Word2vec [44]. We utilize three models for the original tags of social images to obtain the feature vectors, and cluster these feature vectors by Kmeans. The results are shown in Fig. 4. We can see that the semantic similarity matrix including the interpretation similarity and concept similarity achieve higher clustering accuracy on five datasets.

The impact of guidance modality selection

In section “Knowledge-aware progressive clustering method”, we randomly select one of the two modalities as the guidance modality. So it is necessary to explore the impact of the selection of different guidance modality on the results of KAPC. When KAPC is initialized, we select Img and Tag respectively as the guidance modality and calculate the ACC and NMI of their results. Figure 5 shows the results of the method on five datasets in these two cases.

Table 9 The superiority analysis of the proposed KAPC methods
Fig. 6
figure 6

The value of the objective function increases monotonically with the number of repetitions. The value of the objective function rises rapidly in the first three iterations and remains stable in the subsequent iterations, indicating that the KAPC can converge to the local optimum quickly

Fig. 7
figure 7

ACC changes with the increase of regular parameter \(\lambda \) on four datasets

As illustrated in Fig. 5, the ACC of KAPC method with different modalities as the guidance modality is basically equivalent on the five datasets. Apart from Flickr dataset, the NMI is basically the same on the remaining four datasets. This demonstrates that the KPAC method is insensitive to the random selection of guidance modality at initialization.

Convergence of KAPC method

The convergence of the KAPC method are presented in Fig. 6 on the five social image collections. It can be seen that the value of the objective function monotonically increases in each iteration and tends to stabilize as the number of iterations increases. We can see that 15 iterations are enough for convergence on all datasets. Based on the above experimental results, we give the superiority analysis of the proposed KAPC methods compared to other methods in Table 9.

The impact of regularization parameter

As can be seen from Eq. 3 in section “Objective function”, the objective function of KAPC consists of two parts: I(TY) and H(TK). In previous experiments, there are no parameters in the objective function, i.e., the weight of I(TY) and H(TK) are both 1. In this section, to explore the influence of the weight of these two parts on the performance of KAPC, we add a weighting parameter \(\lambda \) to H(TK). Therefore, we can get the new objective function \(\mathcal {L}_{\max }[p(t|x)] = I (T, Y) - \lambda \cdot H (T, K)\). We explore the influence of the weighting parameter \(\lambda \) on the performance of KAPC on NUS, IAPR, Flickr and ESP datasets. The results are shown in Fig. 7. We can see that ACC increases gradually with the increase of weighting parameter \(\lambda \) on NUS and Flickr datasets, and decreases with the increase of \(\lambda \) on ESP and IAPR datasets.

Here we analyze this phenomenon theoretically. In the objective function of KAPC, I(TY) denotes the retention of intra-modal knowledge, H(TK) denotes the mining of inter-modal correlation, and \(\lambda \) is the weight of H(TK). When \(\lambda \) is 0, the model degenerates into a unimodal clustering method that no longer exploits inter-modal correlations, which results in the model not benefiting from other modalities. Thus when \(\lambda \) is increased from 0, the performance of the model increases on all datasets.

Next, for most of the datasets (BigNUS, NUS, Flickr), the performance of the model increases as \(\lambda \) increases. This is because the model prefers inter-modal correlation mining, which facilitates the modalities to guide each other to learn a reasonable clustering structure.

Finally, for both the ESP and IAPR datasets, the performance of the model decreases as \(\lambda \) increases. This is because the performance difference between img modality and txt modality is too large, resulting in poorer modality interfering with model optimization. As can be seen from Tables 4 and 5, the ACC and NMI of img is half of that of txt in the ESP dataset. In the IAPR dataset, the ACC and NMI of img are more than double that of txt. Therefore, when the performance difference between the two modalities is too large, the larger \(\lambda \) leads to a decrease in the performance of the model.

Conclusion

In this study, we propose a novel knowledge-aware progressive clustering method, which employs human knowledge to guide the cross-modal clustering of social images. We design a dual-similarity semantic expansion strategy to complement the sparse tags with human knowledge and propose a progressive iteration method to bridge the heterogeneous gap. Experimental results on four social image datasets demonstrate the effectiveness of KAPC compared with the state-of-the-art methods. In the future, we will focus on two aspects of research. First, we will explore the application of KAPC on multiple modalities, which may be given in advance or arrive continuously over time. When confronted with these complex multi-modal data, KAPC should have more flexible training strategies to mine the correlations among multiple modalities. Second, we will explore finer-grained intra-modal knowledge retention. There exists a large amount of redundant information within modalities that is not relevant to the clustering task. How to remove this redundant information while preserving intra-modal knowledge is a key issue in unsupervised task.