Abstract
In many real-world datasets, different aspects of information are combined, so the data is usually represented as heterogeneous graphs whose nodes and edges have different types. Learning representations in heterogeneous networks is one of the most important topics that can be utilized to extract important details from the networks with the embedding methods. In this paper, we introduce a new framework for embedding heterogeneous graphs. Our model relies on weighted heterogeneous networks with star structures that take structural and attributive similarity into account as well as semantic knowledge. The target nodes form the center of the star and the different attributes of the target nodes form the points of the star. The edge weights are calculated based on three aspects, including the natural language processing in texts, the relationship between different attributes of the dataset and the co-occurrence of each attribute pair in target nodes. We strengthen the similarities between the target nodes by examining the latent connections between the attribute nodes. We find these indirect connections by considering the approximate shortest path between the attributes. By applying the side effect of the star components to the central component, the heterogeneous network is reduced to a homogeneous graph with enhanced similarities. Thus, we can embed this homogeneous graph to capture the similar target nodes. We evaluate our framework for the clustering task and show that our method is more accurate than previous unsupervised algorithms for real-world datasets.
Similar content being viewed by others
Notes
developer.twitter.com.
Abbreviations
- \(\mathcal {N}\) :
-
Target set
- \(\mathcal {A}_i\) :
-
Information set
- \(\mathcal {M}\) :
-
Main attribute set
- \(\mathcal {R}\) :
-
Relational attribute set
- \(\mathcal {T}\) :
-
Textual attribute set
- C :
-
Clustered set
- \(t_j\) :
-
Text object
- \(\textbf{t}_j\) :
-
Word vector of \(t_j\)
- \(\mathbf {t_j^e}\) :
-
Embedded vector of \(t_j\)
- \(\overrightarrow{\textsf {BERT}}(.)\) :
-
BERT embedding function
- \(\textsf {TF}(.)\) :
-
Rank weighted density function
- \(m_j\) :
-
Number of elements of \(t_j\)
- \(\textrm{t}_i^j\) :
-
i-th word of vector \(\textbf{t}_j\)
- \(\textrm{x}_{ih}^j\) :
-
h-th element of \(\overrightarrow{\textsf {BERT}}(\textrm{t}_i^j)\)
- \(\mathcal {B}_h^j\) :
-
h-th element of \(\mathbf {t_j^e}\)
- \(\mathbb {D}\) :
-
Feature space size
- \(\mathbb {f}(.)\) :
-
Term frequency in target set
- \(\mathbb {H}(.)\) :
-
Term frequency in feature space
- \(\mathbb {L}(.)\) :
-
Text length
- \(G =(\mathcal {V}, \mathcal {E}, \mathcal {W})\) :
-
Star heterogeneous graph
- \(G_c=(\mathcal {V}_c, \mathcal {E}_c, \mathcal {W})\) :
-
Core graph
- \(G_s^i=(\mathcal {V}^{i}_s, \mathcal {E}^{i}_s,\mathcal {W})\) :
-
\(\mathcal {M}_i\) shell graph
- \(\overline{G}_{c}=(\mathcal {V}_c, \mathcal {E}_c, \overline{\mathcal {W}})\) :
-
Homogeneous core graph
- \(V_\mathcal {N}\) :
-
Vertex of target set
- \(V_\mathcal {M}\) :
-
Vertex of main attribute set
- \(E_\mathcal {I}\) :
-
Internal link set
- \(E_\mathcal {O}\) :
-
External link set
- \(d_{xy}\) :
-
Euclidean distance of x, y
- \(w_\mathcal {R}(.,.)\) :
-
Relational weight
- \(w_\mathcal {T}(.,.)\) :
-
Textual weight
- \(w_\mathcal {J}(.,.)\) :
-
Joint presence weight
- w(., .):
-
Total weight
- \(p({{\mathcal {M}}_i})\) :
-
\({{\mathcal {M}}_i}-\)path
- \(\rho \langle .,p({{\mathcal {M}}_i}),.\rangle \) :
-
\({\mathcal {M}}_i\) auxiliary path
- \(\rho _i\langle .,.\rangle \) :
-
\({\mathcal {M}}_i\) shortest path
- \(\mathcal {R}(G)\) :
-
Remapped graph
- \(\pi (.)\) :
-
Remapped function
- \(W(.), \overline{W}(.)\) :
-
Path weight function
- \(\mathcal {H}\) :
-
Spanner graph
- \(P_i=\{c_i, \sigma _i, \kappa _i\}\) :
-
Truncation parameters
- \(\alpha =\{\alpha _\mathcal {R}, \alpha _\mathcal {T}, \alpha _\mathcal {J}\}\) :
-
Weighting coefficients
- \(\beta _i\) :
-
Main attribute impact factor
- \(\theta _i\) :
-
Scaling parameter
- \(\mu _i\) :
-
Spanner parameter
- \(\mathbb {N}\) :
-
Normalization operator
References
Churchill R, Singh L (2021) Topic-noise models: modeling topic and noise distributions in social media post collections. In: 2021 IEEE international conference on data mining (ICDM), pp. 71–80
Li Y, Yu R, Shahabi C, Liu Y (2017) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. arXiv preprint arXiv:1707.01926
Atwood J, Towsley D (2016) Diffusion-convolutional neural networks. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems. 30th Conference on neural information processing systems (NIPS 2016), Barcelona, Spain), vol 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/390e982518a50e280d8e2b535462ec1f-Paper.pdf
Shi C, Li Y, Zhang J, Sun Y, Philip SY (2016) A survey of heterogeneous information network analysis. IEEE Trans Knowl Data Eng 29(1):17–37
Moscato V, Sperli G (2021) A survey about community detection over on-line social and heterogeneous information networks. Knowl-Based Syst 224:107112
Wang X, Bo D, Shi C, Fan S, Ye Y, Philip SY (2022) A survey on heterogeneous graph embedding: methods, techniques, applications and sources. IEEE Tran Big Data 9(2):415–436
Dong Y, Chawla NV, Swami A (2017) metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144
Fu T-y, Lee W-C, Lei Z (2017) Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp. 1797–1806
Li X, Wu Y, Ester M, Kao B, Wang X, Zheng Y (2017) Semi-supervised clustering in attributed heterogeneous information networks. In: Proceedings of the 26th international conference on World Wide Web, pp. 1621–1629
Carranza AG, Rossi RA, Rao A, Koh E (2020) Higher-order clustering in complex heterogeneous networks. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 25–35
Fu X, Zhang J, Meng Z, King I (2020) MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of The Web Conference 2020, pp. 2331–2341
Malliaros FD, Vazirgiannis M (2013) Clustering and community detection in directed networks: a survey. Phys Rep 533(4):95–142
Rokach L, Maimon O (2005) Clustering methods. Springer, Berlin
Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Mining text data. Springer, Boston, MA, pp 163–222. https://doi.org/10.1007/978-1-4614-3223-4_6
Leskovec J, Rajaraman A, Ullman JD (2020) Mining of massive data sets. Cambridge University Press, Cambridge
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Jiang Z, Zheng Y, Tan H, Tang B, Zhou H (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In: Proceedings of the 26th international joint conference on artificial intelligence, pp. 1965–1972
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, pp. 478–487
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained \(k\)-means clustering with background knowledge. In: ICML, vol. 1, pp. 577–584
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2001/file/801272ee79cfde7fa5960571fee36b9b-Paper.pdf
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 855–864
Perozzi B, Al-Rfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 701–710
Belkin M, Niyogi P (2001) Laplacian Eigenmaps and spectral techniques for embedding and clustering. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. https://proceedings.neurips.cc/paper_files/paper/2001/file/f106b7f99d2cb30c3db1c3cc0fde9ccb-Paper.pdf
Li J, Wu L, Guo R, Liu C, Liu H (2019) Multi-level network embedding with boosted low-rank matrix approximation. In: Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 49–56
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Li P-Z, Huang L, Wang C-D, Lai J-H (2019) EdMot: an edge enhancement approach for motif-aware community detection. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data Mining, pp. 479–487
Epasto A, Lattanzi S, Paes Leme R (2017) Ego-splitting framework: from non-overlapping to overlapping clusters. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 145–154
Rozemberczki B, Davies R, Sarkar R, Sutton C (2019) GEMSEC: graph embedding with self clustering. In: Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 65–72
Xie Y, Wang X, Jiang D, Xu R (2019) High-performance community detection in social networks using a deep transitive autoencoder. Inf Sci 493:75–90
Jia Y, Zhang Q, Zhang W, Wang X (2019) Communitygan: community detection with generative adversarial nets. In: The World Wide Web Conference, pp. 784–794
Rostami M, Oussalah M, Berahmand K, Farrahi V (2023) Community detection algorithms in healthcare applications: a systematic review. IEEE Access 11:30247
Wang X, Ji H, Shi C, Wang B, Ye Y, Cui P, Yu PS (2019) Heterogeneous graph attention network. In: The World Wide Web Conference, pp. 2022–2032
Zhang C, Song D, Huang C, Swami A, Chawla NV (2019) Heterogeneous graph neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 793–803
Sun Y, Yu Y, Han J (2009) Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 797–806
Forouzandeh S, Berahmand K, Sheikhpour R, Li Y (2023) A new method for recommendation based on embedding spectral clustering in heterogeneous networks (reschet). Expert Syst Appl 231:120699
Sheikhpour R, Berahmand K, Forouzandeh S (2023) Hessian-based semi-supervised feature selection using generalized uncorrelated constraint. Knowl-Based Syst 269:110521
Chang Y, Chen C, Hu W, Zheng Z, Zhou X, Chen S (2022) MEGNN: meta-path extracted graph neural network for heterogeneous graph representation learning. Knowl-Based Syst 235:107611
Cheng H, Zhou Y, Yu JX (2011) Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans Knowl Discov Data 5(2):1–33
Sabbah T, Selamat A, Selamat MH, Al-Anzi FS, Viedma EH, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206
Schrijver A (2003) Combinatorial optimization: polyhedra and efficiency. Springer, Berlin
Narasimhan G, Smid M (2007) Geometric spanner networks. Cambridge University Press, Cambridge
Althöfer I, Das G, Dobkin D, Joseph D, Soares J (1993) On sparse spanners of weighted graphs. Discret Comput Geom 9(1):81–100
Thorup M, Zwick U (2005) Approximate distance oracles. J ACM 52(1):1–24
Deutsch M, Krauss RM (1965) Social psychology. Basic Books, New York
Isenberg DJ (1986) Group polarization: a critical review and meta-analysis. J Pers Soc Psychol 50(6):1141
You J, Ying R, Leskovec J (2019) Position-aware graph neural networks. In: International conference on machine learning, pp. 7134–7143
Ding C, Li T (2007) Adaptive dimension reduction using discriminant analysis and k-means clustering. In: Proceedings of the 24th international conference on machine learning, pp. 521–528
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baharifard, F., Motaghed, V. Similarity enhancement of heterogeneous networks by weighted incorporation of information. Knowl Inf Syst 66, 3133–3156 (2024). https://doi.org/10.1007/s10115-023-02050-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02050-x