Abstract
Social image data refer to the annotated image with tags in social media, in which the tags are always labeled by users. Integrating the visual and textual information of social image can obtain accurate and comprehensive feature and improve clustering performance. However, the heterogeneous gap between tags and images makes it difficult to reasonably organize the social images. In addition, the tags are often sparse and incomplete due to personal preference and cognition differences of users. To solve these problems, we propose a novel knowledge-aware progressive clustering (KAPC) method, which employs human knowledge to guide the cross-modal clustering of social images. Firstly, we design a dual-similarity semantic expansion strategy to complement the sparse tags with human knowledge, which constructs a more complete semantic similarity matrix for tags through knowledge graphs. Secondly, we define an objective function based on information theory to bridge the heterogeneous gap, which align inter-modal cluster distribution to explore the correlation between visual and textual information. Finally, a progressive iteration method is designed to make the two modalities guide each other and obtain better performance of social image clustering. Extensive experiments on four social image datasets verify the effectiveness of the proposed KAPC method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
With the development of Internet technology and widespread use of social media, more and more people are willing to upload their lives to the Internet. People tag image on social media according to their interests and cultural backgrounds, which generates a large amount of social image data. Social image is a kind of cross-modal data, which consists of visual images and textual tags. Obviously, the visual images and user tags are heterogeneous in low-level features but related in high-level semantics. Integrating this visual and semantic information of social images can improve the performance of social image clustering [1,2,3, 3,4,5,6,7].
The framework of the KAPC method. For the image modality, KAPC employs a deep neural network to extract image features. For tags modality, KAPC extracts structured knowledge of tags in knowledge graph to construct semantic similarity matrix, which introduces human prior knowledge and completes missing semantics. Then a progressive iteration method is proposed to bridge the heterogeneous gap between different modalities. It allows the modalities to guide each other and to obtain more refined clustering results for social images in iteration
There are two key issues in the task of social image clustering. The first is the heterogeneous gap between visual images and textual tags [8,9,10,11,12,13,14]. Finding the correlation and shared information of multiple modalities is the intuitive way to bridge the heterogeneous gap. One of the common methods is to find a shared subspace for multiple modalities, which maps the features of different modalities into a common subspace by defining a mapping rule [4,5,6, 10, 11, 15,16,17]. However, the traditional subspace learning method only focuses on the low-level features of heterogeneous data and ignores the semantic information contained in the feature of each modality, which will degrade the performance of the method. In addition, the amount of visual information is significantly larger than textual information in social image. Mapping the information of the two imbalanced modalities directly into a subspace will result in too large weight of visual information to achieve satisfactory results.
Secondly, the tags provided by users are sparse and incomplete because users may have errors, omissions, and tendencies in tagging. In recent years, embedded learning based on knowledge graph has attracted extensive attention of researchers [18,19,20], which models tags relationships extracted from human knowledge into a low-dimensional embedding vector. However, most of the existing embedded learning methods only consider the relationship between entities and ignore the large amount of background information that exist in the interpretation of tags, which undoubtedly results in the loss of necessary information [21].
In this work, we propose a novel knowledge-aware progressive clustering (KAPC) method, which uses human priori knowledge to guide the clustering process of social images, as illustrated in Fig. 1. First, to cope with the problem of sparse and incomplete tags, we design a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge in the knowledge graph. Specifically, we consider both entity relations and background knowledge of tags, and compute two different semantic similarities simultaneously to mine human knowledge. The complemented tags are formalized as a semantic similarity matrix and considered as the representation of textual information. Secondly, we define an objective function based on information theory to bridge the heterogeneous gap, which involves a mutual information loss and a cross-entropy loss. The former is designed to preserve intra-modal knowledge while the latter is employed to align the inter-modal cluster distribution to explore the correlation between visual and textual information. Finally, a progressive iteration method is designed to make the textual and visual information guide each other, which is based on an efficient “draw-and-merge” process to converge the KAPC objective function to a locally optimum. Extensive experiments are conducted on four large-scale real-world social images datasets to verify the superiority of the KAPC approach.
In summary, the main contributions of this work are as follows:
-
We propose a knowledge-aware progressive clustering (KAPC) method, which employs human knowledge to guide the cross-modal clustering of social images.
-
We design a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge in the knowledge graph, which considers entity relations and background knowledge of tags and compute two different semantic similarities simultaneously to mine human knowledge.
-
We define an objective function based on information theory to achieve intra-modal knowledge retention and inter-modal correlation exploration.
-
A progressive iteration method is designed to make textual and visual information guide each other and converge the objective function to a locally optimum.
Related work
Massive amount of methods have been proposed to explore social image clustering. In this section, we briefly review the work related to heterogeneous gap and tag sparsity problem in social images.
Heterogeneous gap Heterogeneous gap between different modalities is an issue of great concern, and many methods have been proposed to explore it [6, 10, 22, 23]. The most popular approach is subspace learning, which aims to learn shared subspace representations for different modalities to explore the relationship between heterogeneous data from different modalities. At present, subspace learning approaches have shown the effectiveness to cross-modal learning tasks [6, 10, 23]. [23] introduces a deep model-structure autoencoder. It maps input data into a nonlinear subspace while preserving the global and local subspace structure. [6] proposes a consistent and specific subspace learning method which maps multiple views into a subspace to learn shared information in multi-views. However, the traditional subspace learning method only focuses on the low-level features and does not consider the relationship between the high-level semantic of each modality, which will degrade the performance.
Tags sparsity problem To solve the problem that tags of social images are sparse and incomplete, researchers have proposed several methods, which can be roughly divided into tags completion [24, 25] and tags refinement [2]. Tags completion method generates tags for unlabeled images by calculating a visual similarity between images. The purpose of tags refinement is to obtain more accurate and high-quality tags.
With the rapid development of knowledge graph, embedded learning based on the knowledge graph is attracting more and more attention [18,19,20]. [18] preserves structure knowledge and uncertainty information of relations facts in the embedding space through knowledge graph, and designs a credibility score to learn the uncertainty information. [19] proposes a label propagation mechanism, which uses the recurrent neural network to propagate the label in the characteristic knowledge graph, and finally learns the weight representation vector of the label in the network. However, the above embedding learning methods only consider the relationship between entities in the knowledge graph and ignores the background information which may contain key relationships and clues in the semantics of tags.
Knowledge-aware progressive clustering method
The proposed KAPC is a cross-modal clustering method based on mutual guidance between visual and textual modalities. At the beginning of the KAPC, we propose a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge, which constructs a semantic similarity matrix to extend tags by entity relationship and background knowledge of tags in knowledge graph. Then the visual and textual modalities guide each other to complete progressive clustering. Specifically, we randomly regard one modality as the guidance modality and the other modality as the iteration modality. Our goal is to guide the clustering process of iteration modality by the clustering structure of guidance modality. To achieve this goal, we design an objective function based on mutual information and cross-entropy, in which the mutual information measures the preservation of the iteration modality information, and the cross-entropy explores the correlation between the clustering structure of two modalities. By optimizing this objective function, we can obtain the clustering structure that contains correlation of two modalities while maximally preserving the single modality information. To fully integrate the information of two modalities, we exchange the iteration modality and guidance modality and progressively obtain better clustering results in each clustering process.
Problem formulation
In this work, two modalities are collected from the social image dataset, namely the images modality Img and the tags modality Tag. To make these two modalities guide each other, we randomly divide the two modalities into the iteration modality \(X_{iter}\) and guidance modality \(X_{guid}\). For the \(X_{iter}\), we define its original data, features and clustering structure as X, Y and T. The clustering structure of \(X_{guid}\) is represented as K. The purpose of KAPC method is to learn the clustering result T of the iteration modality under the constraint of K. In other words, the task of KAPC method is to find a optimal cluster assignment p(t|x) of X to T under the guidance of K. The notations are displayed in Table 1.
Dual-similarity semantic expansion
Due to the sparse and incomplete tags, it is difficult to directly learn social image knowledge by cross-modal clustering. We propose a dual-similarity semantic expansion strategy to complement the sparse tags with rich human knowledge, which consists of interpretation similarity and concept similarity. The interpretation similarity reflects the rich background knowledge of tags in natural language interpretation. The concept similarity represents the entity relationship in the knowledge graph. The interpretation similarity and concept similarity of words \(S_1\) and \(S_2\) is defined as \(simi_{IS}(S_1,S_2)\) and \(simi_{CS}(S_1,S_2)\). In this study, we calculate these two similarities through WordNet [26].
The illustration of construction process of feature vector. Our aim is to expand the semantics of each word and learn its feature vectors through WordNet. First, the interpretation of the word is queried in WordNet and combined into a one-level vector. Next, the interpretation of each word in the one level vector is queried and summed to form the feature vector
To calculate the \(simi_{IS}(S_1,S_2)\), we need to construct semantic feature for \(S_1\) and \(S_2\) by WordNet. Tasking \(S_1\) as an example, we first initialize a semantic space consisting of all the entities in WordNet, whose base element is the number of occurrences of each entity. Then, we query the interpretation of \(S_1\) in WordNet, collect the entities in the paraphrases and update their occurrences into the semantic space to form the feature vector of \(S_1\). In order to fully explore the background knowledge of \(S_1\), we perform “query-collection” operation again for each entity in feature vector of \(S_1\) and update the number of newly obtained entities into the semantic space to obtain the final feature vector of \(S_1\). Finally, Calculate cosine similarity of feature vector of \(S_1\) and \(S_2\) to get the \(simi_{IS}(S_1,S_2)\). Figure 2 shows the construction process of the feature vector.
Then, we collect the node distance of \(S_1\) and \(S_2\) in WordNet to calculate the \(simi_{CS}(S_1,S_2)\). Suppose the node distance of words \(S_1\) and \(S_2\) is \(length(S_1,S_2)\), we can get the \(simi_{CS}(S_1,S_2)\):
where dep(c) is the maximum node distance and is set to 16.
After obtaining interpretation similarity and concept similarity, we combine them to calculate the totally similarity \(simi(S_1,S_2)\) between two words based on the part of speech. If \(S_1\) and \(S_2\) are nouns, \(simi(S_1,S_2)\) is defined as the average of the two similarity, otherwise, \(simi(S_1,S_2)\) is defined as interpretation similarity. Suppose the feature words in the tag A are {\(a_1,a_2,...,a_n\)} and the feature words in the tag B are {\(b_1,b_2,...,b_m\)}, the similarity between A and B is:
where
\(simi(a_i,b) = \max \limits _{i=1,2,...,m} simi(a_i,b_j)\), \(simi(a,b_i) = \max \limits _{j=1,2,...,n} simi(a_j,b_i)\)
The Eq. 2 formalizes the sparse tags of social images as a semantic similarity matrix, which is considered as the representation of tags in the next.
Objective function
The purpose of KAPC approach is to discover a good clustering results of the social image collection. To achieve this goal, we maximize the intra-modal information of \(X_\mathrm{{iter}}\) on the one hand, and align the cluster distribution of \(X_\mathrm{{iter}}\) and \(X_\mathrm{{guid}}\) to explore the inter-modal correlation on the other hand. Specifically, we introduce the mutual information I(T, Y) and cross-entropy H(T, K) to preserve intra-modal information and align inter-modal cluster distributions. We maximize I(T, Y) so that the clustering distribution T retains the information in feature Y to obtain better clustering results, while minimizing H(T, K) implies aligning the inter-modal cluster distribution, which allows \(X_\mathrm{{guid}}\) to guide the clustering process of \(X_\mathrm{{iter}}\). The basic definitions for the objective function and constraints is a great challenge [27,28,29]. Combining the cross-entropy and mutual information, we propose the objective function of KAPC method:
where the term I(T, Y) indicates the mutual information between the clustering partition T and the feature vector Y, which embodies the preserving of the single modality information. H(T, K) is the cross-entropy of the T and K, which embodies the constraints of the guidance modality on the iteration modality. By maximizing the objective function, KAPC can find a clustering structure T that conforms to both modality knowledge by minimizing the difference between the guidance modality and the iteration modality while preserve the single modality information.
Progressive iteration
Both internal optimization and overall optimization are great challenges [30,31,32,33,34]. We design a progressive iteration method to optimize the objective function of KAPC in two steps. Firstly, it realizes the internal optimization of the iteration modality under the constraints of the guidance modality through a process of “draw-and-merge”. Secondly, the \(X_\mathrm{{iter}}\) and \(X_\mathrm{{guid}}\) are exchanged and T is taken as the K of the next iteration modality internal optimization. These two steps are repeated until T no longer changes, which achieves the overall optimization of the objective function. In addition, KAPC method can use a typical single modality clustering method to obtain the clustering structure K of the guidance modality during initialization, such as Kmeans or sIB [35].
Optimization in the iteration modality
To find an optimal cluster structure T of iteration modality \(X_\mathrm{{iter}}\), we adopt a sequential “draw-and-merge” optimization approach that maintains a clustering partition T with exactly M clusters. The “draw-and-merge” optimization method randomly divides the data in \(X_\mathrm{{iter}}\) into M clusters to complete the initialization of T. Then a potential \(x \subseteq X_\mathrm{{iter}}\) is “drawn” from its current cluster in the iterative phase. We regard it as a new singleton cluster x. To ensure that the number of clusters does not change, the singleton x cluster must be allocated into cluster \(\tilde{t}\), i.e., \(\{\{x\},t\} \Rightarrow \tilde{t}\). Therefore, we can see:
To optimize the objective function, the value of the objective function should be increased in each “draw-and-merge” procedure. The key issue in this process is how to choose the optimal cluster. The objective function before “draw-and-merge” procedure is denoted as \(F_\mathrm{{bef}}\), and T is denoted as \(T_\mathrm{{bef}}\). After the singleton x cluster is allocated into \(\tilde{t}\), the objective function is denoted as \(F_{aft}\) and T is denoted as \(T_\mathrm{{aft}}\). The merger cost \(\textrm{cost}(\tilde{t})\) can be defined as Eq. 5, which represents the reduction of mutual information in the merging process.
It can be seen from Eq. 5 that \(H(T_\mathrm{{bef}},K)\) is unchanged in \(\mathrm{{cost}}(T)\) no matter which cluster is selected for merging. So we remove it to simplify the calculation. By combining Eq. 4 and Eq. 5, the merging cost can be written as:
where \(\pi _1 = \frac{p(x)}{p(x)+p(t)}\), \(\pi _2 = \frac{p(t)}{p(x)+p(t)}\)
The last two steps of Eq. 6 use the Kullback–Leibler (KL) [36] distance and the Jensen–Shannon (JS) [36] distance. The KL distance describes the degree of difference between the two probability distributions. The JS distance describes the probability that the probability distributions p(x) and q(x) come from the same distribution.
Since \(JS_{\pi _1 \pi _2}\ge 0\) and \(H(x,K_\mathrm{{center}}) \ge 0\), we know \(\textrm{cost}(\{x\},t) \ge 0\), which means that each time the “merge” process causes the loss of mutual information, the process of selecting the optimal cluster is equivalent to finding the cluster that minimize the reduction of mutual information. Continue the “draw-and-merge” procedure until the cluster to which each elements x belong unchange, so that we can get the clustering structure T in the single modality.
Overall optimization
After the “draw-and-merge” process converges, we obtain a clustering structure T of a single modality under the constraint of guidance modality. To fully integrate the information of the two modalities, we exchange the guidance modality and the iteration modality, and start a new “draw-and-merge” process, in which T is regarded as a new K. Continue this exchange process until T unchange. In this process, the value of objective function is non-monotonic decreasing. Obviously, KAPC method progressively obtains a superior T in each iteration, thus, it is called progressive iteration. The detailed steps and flowchart of KAPC are shown in Algorithm 1 and Fig 3.
In addition, we discuss the advantages and disadvantages of the proposed KAPC method in Table 2. Here we focus on the limitations of the proposed method. In the phase of dual-similarity semantic expansion, the performance of knowledge graph has a significant impact on our method. On the one hand, semantic extension is limited when words of tags are not included in the knowledge graph. On the other hand, the performance of the learned semantic similarity matrix is also affected by the structure of the knowledge graph. In the stage of progressive iteration, KAPC requires all modalities to have the same ground truth, i.e., to have the same clustering division, which is a prerequisite for modalities to guide each other. Finally, KAPC applies to cross-modal clustering between two modalities. Further exploration of the progressive iteration strategy is needed when the modalities increase.
Complexity analysis
In this section, we give the complexity analysis of the KAPC method. When calculating the \(cost(\{x\},t)\) of each cluster in step 8, the time taken is O(M|T||Y|), where M is the number of clusters. Since this process is repeated n times (n is a finite constant), the total time complexity is O(nM|T||Y|).
Experiments
Experiments setup
In this work, we conduct extensive experiments on four publicly available and widely used datasets: NUS-WIDE [37], IAPR TC-12 Benchmark [38], MIRFlickr [39] and ESP-Game [40]. Table 3 gives the details of the datasets.
NUS-WIDE [37]. It is an image dataset created by Laboratory of Media Search in the National University of Singapore, where data are taken from real-world networks. The dataset includes 269,648 images and the associated tags. To increase the experimental difference and remove excessive noise data, we respectively select 10,003 images and 2969 images to compose BigNUS dataset and NUS data. Each dataset contains 6 classes with an average tag number of 7.
IAPR TC-12 benchmark [38]. It is a publicly annotated dataset produced by Cross-Language Evaluation Forum, which contains 20,000 images collected from the real world. Each image is accompanied by a short text description. After removing the noisy images, we select 3095 images to make IAPR dataset, which contains 6 classes, and each image contains 6 labels on average.
MIRFlickr [39]. It contains 25,000 images collected from the website flickr, with 1386 user-provided tags appearing at least 20 times. It also provides the ground-truth annotations of 38 concepts. After de-noising and de-duplication, we select 5 categories of 4920 images to compose the Flickr dataset. Each image contains an average of 9 tags.
ESP-Game [40]. The data of ESP-Game come from a web image annotation game, which contains 20,770 images and corresponding tags. We remove some noisy images and label, and select 5 labels containing the most pictures as 5 classes of the dataset, which contain 7869 images, with an average of 5 labels per image.
We select several state-of-the-art cross-modal clustering approaches as the baselines to evaluate the effectiveness of KAPC: Anchor-based Partial Multi-view Clustering (APMC) [3], Highly-economized Scalable Image Clustering (HSIC) [4], Co-regularized Multi-view Spectral Clustering (CRSC) [8], Multi-View Kernel Spectral Clustering (MVKSC) [9], Multi-level Feature Learning for Contrastive Multi-view Clustering (MFLVC [7]), Cluster-based Similarity Partitioning Algorithm (CSPA) [41], HyperGraph Partitioning Algorithm (HGPA) [41], Locally Weighted Graph Partitioning (LWGP) [22], Probability Trajectory Based Graph Partitioning (PTGP) [42], Deep Mutual Information Maximin (DMIM [16]), Consistent and Specific Multi-View Subspace Clustering (CSMSC) [6], Split Multiplicative Multi-View Subspace Clustering (SMMSC) [10], Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit (SSCOMP) [15], Binary Multi-View Clustering (BMVC) [5], Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering(SDMVC [11]).
We regard Tag and Img as two different views on multi-view clustering methods. For ensemble clustering methods, we get the basic clustering results of Tag and Img through the Kmeans method. Specifically, the results of 50 Kmeans method on Img and Tag are composed of 100-dimensional basic clusterings as input of ensemble clustering. In addition, due to the number of classes included in the result of LWGP and PTGP may not be the same as the ground-truth, the ACC of LWGP and PTGP cannot be calculated. For those algorithms whose results may fluctuate, we run them ten times to take the average value to reduce the uncertainty caused by random initialization, and the standard deviation is given at the same time.
In this study, the Clustering Accuracy (ACC) and Normalized Mutual Information (NMI) are selected to evaluate cluster quality. ACC is defined as follows:
where \(t_i\) is the clustering of the data \(x_i\), \(l_i\) is the grand truth of data \(x_i\), and n is the data set size. If \(x=y\) then \(\delta (x,y)=1\), otherwise \(\delta (x,y)=0\).
NMI is used to measure the similarity between the clustering result and the grand truth of the data, and to some extent reflects the cohesion of each cluster, which is defined as follows:
Where \(n_h\) is the amount of data divided into category h, \(n_l\) is the amount of data divided into category l. \(n_{h,l}\) is the amount of data divided into category h as well as into category l. When the value of NMI is 1, the clustering result is exactly the same as the ground truth.
Comparing with single modality clustering method
In this section, we compare KAPC with a single modality sIB method and Kmeans method. Tables 4 and 5 show the comparison of KAPC method with sIB and Kmeans on ACC and NMI. From Tables 4 and 5, we can see that the results of sIB on visual and textual modalities are different. The Img modality gets better results on BigNUS, IAPR and Flickr datasets while the Tag modality gets better results on NUS and ESP datasets. This demonstrates that it is unwise to only use one modality for clustering.
By calculating the similarity of the two modalities on the semantic level, the KAPC method avoids the heterogeneous gap and has achieved significant performance improvement on the five datasets. As illustrated in Tables 4 and 5, the KAPC method obtains 28.58%, 23.26%, 25.57%, 19.20% and 15.20% improvement on the five datasets compared with the best performing of single modality method in ACC and 20.96%, 14.62%, 21.01%, 67.49%, 8.74% improvement in NMI, which demonstrates the effectiveness of the KAPC method.
Comparing with the state-of-the-art cross-modal clustering method
Tables 6, 7, and 8 shows the comparison results between the proposed KAPC algorithm and the other cross-modal clustering algorithms in terms of ACC and NMI criteria. As illustrated in Tables 6, 7, and 8, we take the best performance of all the state-of-the-art comparison methods in each row as the benchmark, which validates the effectiveness of the proposed KAPC method.
Comparing with the different tags extension method
According to the characteristics of social image tags, we propose a semantic similarity matrix to extend tags in Section 3. To evaluate its effectiveness, we choose several NLP models as comparison: Bidirectional encoder representations from transformers (BERT) [20], Doc2vec [43] and Word2vec [44]. We utilize three models for the original tags of social images to obtain the feature vectors, and cluster these feature vectors by Kmeans. The results are shown in Fig. 4. We can see that the semantic similarity matrix including the interpretation similarity and concept similarity achieve higher clustering accuracy on five datasets.
The impact of guidance modality selection
In section “Knowledge-aware progressive clustering method”, we randomly select one of the two modalities as the guidance modality. So it is necessary to explore the impact of the selection of different guidance modality on the results of KAPC. When KAPC is initialized, we select Img and Tag respectively as the guidance modality and calculate the ACC and NMI of their results. Figure 5 shows the results of the method on five datasets in these two cases.
As illustrated in Fig. 5, the ACC of KAPC method with different modalities as the guidance modality is basically equivalent on the five datasets. Apart from Flickr dataset, the NMI is basically the same on the remaining four datasets. This demonstrates that the KPAC method is insensitive to the random selection of guidance modality at initialization.
Convergence of KAPC method
The convergence of the KAPC method are presented in Fig. 6 on the five social image collections. It can be seen that the value of the objective function monotonically increases in each iteration and tends to stabilize as the number of iterations increases. We can see that 15 iterations are enough for convergence on all datasets. Based on the above experimental results, we give the superiority analysis of the proposed KAPC methods compared to other methods in Table 9.
The impact of regularization parameter
As can be seen from Eq. 3 in section “Objective function”, the objective function of KAPC consists of two parts: I(T, Y) and H(T, K). In previous experiments, there are no parameters in the objective function, i.e., the weight of I(T, Y) and H(T, K) are both 1. In this section, to explore the influence of the weight of these two parts on the performance of KAPC, we add a weighting parameter \(\lambda \) to H(T, K). Therefore, we can get the new objective function \(\mathcal {L}_{\max }[p(t|x)] = I (T, Y) - \lambda \cdot H (T, K)\). We explore the influence of the weighting parameter \(\lambda \) on the performance of KAPC on NUS, IAPR, Flickr and ESP datasets. The results are shown in Fig. 7. We can see that ACC increases gradually with the increase of weighting parameter \(\lambda \) on NUS and Flickr datasets, and decreases with the increase of \(\lambda \) on ESP and IAPR datasets.
Here we analyze this phenomenon theoretically. In the objective function of KAPC, I(T, Y) denotes the retention of intra-modal knowledge, H(T, K) denotes the mining of inter-modal correlation, and \(\lambda \) is the weight of H(T, K). When \(\lambda \) is 0, the model degenerates into a unimodal clustering method that no longer exploits inter-modal correlations, which results in the model not benefiting from other modalities. Thus when \(\lambda \) is increased from 0, the performance of the model increases on all datasets.
Next, for most of the datasets (BigNUS, NUS, Flickr), the performance of the model increases as \(\lambda \) increases. This is because the model prefers inter-modal correlation mining, which facilitates the modalities to guide each other to learn a reasonable clustering structure.
Finally, for both the ESP and IAPR datasets, the performance of the model decreases as \(\lambda \) increases. This is because the performance difference between img modality and txt modality is too large, resulting in poorer modality interfering with model optimization. As can be seen from Tables 4 and 5, the ACC and NMI of img is half of that of txt in the ESP dataset. In the IAPR dataset, the ACC and NMI of img are more than double that of txt. Therefore, when the performance difference between the two modalities is too large, the larger \(\lambda \) leads to a decrease in the performance of the model.
Conclusion
In this study, we propose a novel knowledge-aware progressive clustering method, which employs human knowledge to guide the cross-modal clustering of social images. We design a dual-similarity semantic expansion strategy to complement the sparse tags with human knowledge and propose a progressive iteration method to bridge the heterogeneous gap. Experimental results on four social image datasets demonstrate the effectiveness of KAPC compared with the state-of-the-art methods. In the future, we will focus on two aspects of research. First, we will explore the application of KAPC on multiple modalities, which may be given in advance or arrive continuously over time. When confronted with these complex multi-modal data, KAPC should have more flexible training strategies to mine the correlations among multiple modalities. Second, we will explore finer-grained intra-modal knowledge retention. There exists a large amount of redundant information within modalities that is not relevant to the clustering task. How to remove this redundant information while preserving intra-modal knowledge is a key issue in unsupervised task.
Data availability
The data that support the findings of this study are openly available in NUS-WIDE at https://dl.acm.org/doi/10.1145/1646396.1646452, reference number [37]; IAPR TC-12 Benchmark at https://www.imageclef.org/photodata, reference number [38]; MIRFlickr at https://doi.org/10.1145/1460096.1460104, reference number [39]; ESP-Game at https://dl.acm.org/doi/10.1145/985692.985733, reference number [40].
References
Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083
Zhang J, Wu Q, Zhang J, Shen C, Lu J (2018) Kill two birds with one stone: weakly-supervised neural network for image annotation and tag refinement. In: The AAAI Conference on Artificial Intelligence, pp 7550–7557
Guo J, Ye J (2019) Anchors bring ease: an embarrassingly simple approach to partial multi-view clustering. In: The AAAI Conference on Artificial Intelligence, pp 118–125
Zhang Z, Liu L, Qin J, Zhu F, Shen F, Xu Y, Shao L, Tao Shen H (2018) Highly-economized multi-view binary compression for scalable image clustering. In: The European Conference on Computer Vision, pp 731–748
Zhang Z, Liu L, Shen F, Shen HT, Shao L (2019) Binary multi-view clustering. IEEE Trans Pattern Anal Mach Intell 41(7):1774–1782
Luo S, Zhang C, Zhang W, Cao X (2018) Consistent and specific multi-view subspace clustering. In: The AAAI Conference on Artificial Intelligence, pp 3730–3737
Xu J, Tang H, Ren Y, Peng L, Zhu X, He L (2022) Multi-level feature learning for contrastive multi-view clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 16030–16039
Kumar A, Rai P, III HD (2011) Co-regularized multi-view spectral clustering. In: Advances in Neural Information Processing Systems, pp 1413–1421
Houthuys L, Langone R, Suykens JAK (2018) Multi-view kernel spectral clustering. Inform Fusion 44:46–56
Yang Z, Xu Q, Zhang W, Cao X, Huang Q (2019) Split multiplicative multi-view subspace clustering. IEEE Trans Image Process 28(10):5147–5160
Xu J, Ren Y, Tang H, Yang Z, Pan L, Yang Y, Pu X, Yu PS, He L (2022) Self-supervised discriminative feature learning for deep multi-view clustering. IEEE Transactions on Knowledge and Data Engineering, pp 1–12
Yan X, Mao Y, Ye Y, Yu H (2023) Cross-modal clustering with deep correlated information bottleneck method. IEEE Transactions on Neural Networks and Learning Systems Early access, pp 1–15
Yan X, Ye Y, Qiu X, Manic M, Yu H (2020) CMIB: unsupervised image object categorization in multiple visual contexts. IEEE Trans Indus Inf 16(6):3974–3986
Yan X, Mao Y, Ye Y, Yu H, Wang F (2022) Explanation guided cross-modal social image clustering. Inf Sci 593:1–16
You C, Robinson DP, Vidal R (2016) Scalable sparse subspace clustering by orthogonal matching pursuit. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3918–3927
Mao Y, Yan X, Guo Q, Ye Y (2021) Deep mutual information maximin for cross-modal clustering. In: The AAAI Conference on Artificial Intelligence, pp 8893–8901
Yan X, Mao Y, Li M, Ye Y, Yu H (2023) Multitask image clustering via deep information bottleneck. IEEE Transactions on Cybernetics Early access, pp 1–14
Chen X, Chen M, Shi W, Sun Y (2019) Embedding uncertain knowledge graphs. In: The AAAI Conference on Artificial Intelligence, pp 3363–3370
Lee C, Fang W, Yeh C, Wang YF (2018) Multi-label zero-shot learning with structured knowledge graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1576–1585
Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Guan N, Song D, Liao L (2019) Knowledge graph embedding with concepts. Knowl Based Syst 164:38–44
Huang D, Wang C-D, Lai J-H (2018) Locally weighted ensemble clustering. IEEE Transactions on Cybernetics, pp 1460–1473
Peng X, Feng J, Xiao S, Yau W, Zhou JT, Yang S (2018) Structured autoencoders for subspace clustering. IEEE Trans Image Process 27(10):5076–5086
Tang J, Shu X, Li Z, Jiang Y, Tian Q (2019) Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans Pattern Anal Mach Intell 41(8):2027–2034
Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Fellbaum C, Miller GA (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
Kumar PS (2019) Intuitionistic fuzzy solid assignment problems: a software-based approach. Int J Syst Assur Eng Manage 10(4):661–675
Kumar PS (2018) Psk method for solving intuitionistic fuzzy solid transportation problems. Int J Fuzzy Syst Appl (IJFSA) 7(4):62–99
Kumar PS (2016) A simple method for solving type-2 and type-4 fuzzy transportation problems. Int J Fuzzy Logic Intell Syst 16(4):225–237
Kumar PS (2020) Algorithms for solving the optimization problems using fuzzy and intuitionistic fuzzy set. Int J Syst Assur Eng Manage 11(1):189–222
Kumar PS (2022) Computationally simple and efficient method for solving real-life mixed intuitionistic fuzzy 3d assignment problems. Int J Softw Sci Comput Intell (IJSSCI) 14(1):1–42
Kumar PS (2020) Developing a new approach to solve solid assignment problems under intuitionistic fuzzy environment. Int J Fuzzy Syst Appl (IJFSA) 9(1):1–34
Kumar PS (2018) A note on a new approach for solving intuitionistic fuzzy transportation problem of type-2. Int J Logist Syst Manage 29(1):102–129
Kumar PS (2020) Intuitionistic fuzzy zero point method for solving type-2 intuitionistic fuzzy transportation problem. Int J Oper Res 37(3):418–451
Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Tthe International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR.ACM, pp 129–136
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, New York
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: A real-world web image database from national university of Singapore. In: The ACM International Conference on Image and Video Retrieval, CIVR.ACM, pp 48–1489
Michael G, Paul C, Henning M, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: The International Conference on Language Resources and Evaluation, pp 13–23
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: The 11th ACM SIGMM International Conference on Multimedia Information Retrieval, MIR.ACM, pp 39–43
Ahn L, Dabbish L (2004) Labeling images with a computer game. In: The Conference on Human Factors in Computing Systems, CHI, pp 319–326
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Huang D, Lai J-H, Wang C-D (2016) Robust ensemble clustering using probability trajectories. IEEE Transactions on Knowledge and Data Engineering, pp 1312–1326
Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Rep4NLP@ACL, pp 78–86
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp 3111–3119
Funding
National Natural Science Foundation of China (62102368), Joint Construction Project for Medical Science and Technology of Henan Province (LHGJ20200318), National Natural Science Foundation of China (62206251), Joint Construction Project for Medical Science and Technology of Henan Province (LHGJ20220431).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, M., Dong, Y., Liu, D. et al. Knowledge-aware progressive clustering for social image. Complex Intell. Syst. 10, 2173–2185 (2024). https://doi.org/10.1007/s40747-023-01267-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-023-01267-1