Advertisement

Data Mining and Knowledge Discovery

, Volume 32, Issue 6, pp 1735–1767 | Cite as

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

  • Chenguang Wang
  • Yangqiu Song
  • Haoran Li
  • Ming Zhang
  • Jiawei Han
Article
  • 408 Downloads

Abstract

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

Keywords

Heterogeneous information network Similarity Text categorization 

Notes

Acknowledgements

Chenguang Wang, Haoran Li, and Ming Zhang gratefully acknowledge the support by the National Natural Science Foundation of China (NSFC Grant Nos. 61772039, 91646202 and 61472006). Yangqiu Song was supported by China 973 Fundamental R&D Program (No. 2014CB340304) and the Early Career Scheme (ECS, No.  26206717) from Research Grants Council in Hong Kong. Jiawei Han was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and Grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied by these agencies. We also thank the conference versions’ and journal version’s anonymous reviewers for their valuable comments and suggestions that help improve the quality of this manuscript.

References

  1. Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Mining text data. Springer, pp 163–222Google Scholar
  2. Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Proceedings of the IEEE annual symposium on foundations of computer science (FOCS), pp 475–486Google Scholar
  3. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Zachary I (2007) A nucleus for a web of open data. Springer, DbpediaGoogle Scholar
  4. Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2670–2676Google Scholar
  5. Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 59–68Google Scholar
  6. Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on Freebase from question-answer pairs. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1533–1544Google Scholar
  7. Berg C, Christensen JP, Ressel P (1984) Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100 of graduate texts in mathematics, 1st edn. Springer, BerlinCrossRefGoogle Scholar
  8. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022zbMATHGoogle Scholar
  9. Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1247–1250Google Scholar
  10. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  11. Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47CrossRefGoogle Scholar
  12. Cai H, Zheng VW, Chang KC (2018) A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Trans Knowl Data Eng (TKDE)Google Scholar
  13. Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) workshop on information integration, pp 73–78Google Scholar
  14. Cui P, Wang X, Pei J, Wenwu Z (2017) A survey on network embedding. CoRR. arXiv:1711.08752
  15. Do Q, Roth D, Sammons M, Tu Y, Vydiswaran VGV (2009) Robust, light-weight approaches to compute lexical similarity. In: Computer science research and technical reports. University of Illinois, pp 94–94Google Scholar
  16. Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 601–610Google Scholar
  17. Dong Y, Chawla NV, Swami A (2017) metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 135–144Google Scholar
  18. Etzioni O, Cafarella M, Downey D (2004) Webscale information extraction in knowitall (preliminary results). In: Proceedings of the international world wide web conference (WWW), pp 100–110Google Scholar
  19. Fu T-Y, Lee W-C, Lei Z (2017) Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1797–1806Google Scholar
  20. Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1048–1053Google Scholar
  21. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1606–1611Google Scholar
  22. Ganesan P, Garcia-Molina H, Widom J (2003) Exploiting hierarchical domain structure to compute similarity. ACM Trans Inf Syst (TOIS) 21(1):64–93CrossRefGoogle Scholar
  23. Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl Based Syst 151:78–94CrossRefGoogle Scholar
  24. Han J, Sun Y, Yan X, Yu PS (2010) Mining knowledge from databases: an information network analysis approach. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1251–1252Google Scholar
  25. Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. In: Proceedings of the international conference on semantic computing (ICSC), pp 242–249Google Scholar
  26. He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Proceedings of the neural information processing systems (NIPS), pp 507–514Google Scholar
  27. Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 541–544Google Scholar
  28. Hu J, Fang L, Cao Y, Zeng H-J, Li H, Yang Q, Chen Z (2008) Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 179–186Google Scholar
  29. Hu X, Sun N, Zhang C, Chua T-S (2009a) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 919–928Google Scholar
  30. Hu X, Zhang X, Lu C, Park EK, Zhou X (2009b) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 389–396Google Scholar
  31. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference (NZCSRSC), pp 49–56Google Scholar
  32. Huang Z, Mamoulis N (2017) Heterogeneous information network embedding for meta path based proximity. CoRR. arXiv:1701.05291
  33. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, BerlinGoogle Scholar
  34. Kim J, Rousseau F, Vazirgiannis M (2015) Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 775–780Google Scholar
  35. Kong X, Yu PS, Ding Y, Wild DJ (2012) Meta path-based collective classification in heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1567–1571Google Scholar
  36. Kong X, Zhang J, Yu PS (2013) Inferring anchor links across multiple heterogeneous social networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 179–188Google Scholar
  37. Lakkaraju P, Gauch S, Speretta M (2008) Document similarity based on concept tree distance. In: Proceedings of the nineteenth ACM conference on hypertext and hypermedia, pp 127–132Google Scholar
  38. Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the international conference on machine learning (ICML), pp 331–339CrossRefGoogle Scholar
  39. Lao N, Cohen WW (2010) Relational retrieval using a combination of path-constrained random walks. Mach Learn 81(1):53–67MathSciNetCrossRefGoogle Scholar
  40. Lao N, Mitchell T, Cohen WW (2011) Random walk inference and learning in a large scale knowledge base. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 529–539Google Scholar
  41. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res (JMLR) 5:361–397Google Scholar
  42. Li Y, Wang C, Han F, Han J, Roth D, Yan X (2013) Mining evidences for named entity disambiguation. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1070–1078Google Scholar
  43. Lu Q, Getoor L (2003) Link-based classification. In: Proceedings of the international conference on machine learning (ICML), pp 496–503Google Scholar
  44. Luss R, d’Aspremont A (2008) Support vector machine classification with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 953–960Google Scholar
  45. McCallum A, Nigam K, et al (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop, vol 752, pp 41–48Google Scholar
  46. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the neural information processing systems (NIPS), pp 3111–3119Google Scholar
  47. Mitchell TM, Cohen WW, Hruschka ER Jr, Talukdar PP, Betteridge J, Carlson A, Mishra BD, Gardner M, Kisiel B, Krishnamurthy J, Lao N, Mazaitis K, Mohamed T, Nakashole N, Platanios EA, Ritter A, Samadi M, Settles B, Wang RC, Wijaya DT, Gupta A, Chen X, Saparov A, Greaves M, Welling J (2015) Never-ending learning. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2302–2310Google Scholar
  48. Mooney RJ (2007) Learning for semantic parsing. In: Proceedings of the international conference on computational linguistics and intelligent text processing (CICLing), pp 311–324CrossRefGoogle Scholar
  49. Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127–152MathSciNetCrossRefGoogle Scholar
  50. Paul C, Rettinger A, Mogadala A, Knoblock CA, Pedro S (2016) Efficient graph-based document similarity. In: International conference on the semantic web. latest advances and new domains, pp 334–349Google Scholar
  51. Ponzetto SP, Strube M (2007) Deriving a large-scale taxonomy from Wikipedia. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 1440–1445Google Scholar
  52. Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the SIGNLL conference on computational natural language learning (CoNLL), pp 147–155Google Scholar
  53. Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: Annual meeting of the association for computational linguistics (ACL), pp 1702–1712Google Scholar
  54. Sahami M (1998) Using machine learning to improve information access. PhD thesis, Stanford UniversityGoogle Scholar
  55. Schuhmacher M, Ponzetto SP (2014) Knowledge-based graph document modeling. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 543–552Google Scholar
  56. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47MathSciNetCrossRefGoogle Scholar
  57. Shang J, Qu M, Liu J, Kaplan LM, Han J, Peng J (2016) Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. CoRR. arXiv:610.09769
  58. Siolas G, d’Alché-Buc F (2000) Support vector machines based on a semantic kernel for text categorization. In: Proceedings of international joint conference on neural networks (IJCNN), pp 205–209Google Scholar
  59. Song Y, Roth D (2015) Unsupervised sparse vector densification for short text similarity. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1275–1280Google Scholar
  60. Song Y, Pan S, Liu S, Zhou MX, Qian W (2009) Topic and keyword re-ranking for LDA-based topic modeling. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 1757–1760Google Scholar
  61. Song Y, Wang H, Wang Z, Li H, Chen W (2011) Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2330–2336Google Scholar
  62. Song Y, Wang S, Wang H (2015) Open domain short text conceptualization: a generative + descriptive modeling approach. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3820–3826Google Scholar
  63. Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res (JMLR) 3:583–617MathSciNetzbMATHGoogle Scholar
  64. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 58–64Google Scholar
  65. Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the international world wide web conference (WWW), pp 697–706Google Scholar
  66. Sun Y, Han J, Yan X, Yu PS, Tianyi W (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow (PVLDB) 4(11):992–1003Google Scholar
  67. Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2012) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1348–1356Google Scholar
  68. Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2013) Pathselclus: integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Trans Knowl Discov Data (TKDD) 7(3):11:1–11:23Google Scholar
  69. Tang J, Qu M, Mei Q (2015) PTE: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1165–1174Google Scholar
  70. Traverso I, Vidal M-E, Kämpgen B, Sure-Vetter Y (2016) GADES: a graph-based semantic similarity measure. In: International conference on semantic systems (SEMANTiCS), pp 101–104Google Scholar
  71. Tu K, Cui P, Wang X, Wang F, Zhu W (2018) Structural deep embedding for hyper-networks. In: Proceedings of the association for the advancement of artificial intelligence (AAAI)Google Scholar
  72. Wan X, Peng Y (2005) The earth mover’s distance as a semantic measure for document similarity. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 301–302Google Scholar
  73. Wang C, Duan N, Zhou M, Zhang M (2013) Paraphrasing adaptation for web search ranking. In: Annual meeting of the association for computational linguistics (ACL), pp 41–46Google Scholar
  74. Wang C, Song Y, El-Kishky A, Roth D, Zhang M, Han J (2015a) Incorporating world knowledge to document clustering via heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1215–1224Google Scholar
  75. Wang C, Song Y, Li H, Zhang M, Han J (2015b) Knowsim: a document similarity measure on structured heterogeneous information networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 506–513Google Scholar
  76. Wang C, Song Y, Roth D, Wang C, Han J, Ji H, Zhang M (2015c) Constrained information-theoretic tripartite graph clustering to identify semantically similar relations. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3882–3889Google Scholar
  77. Wang C, Song Y, Li H, Zhang M, Han J (2016a) Text classification with heterogeneous information network kernels. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2130–2136Google Scholar
  78. Wang C, Song Y, Roth D, Zhang M, Han J (2016) World knowledge as indirect supervision for document clustering. ACM Trans Knowl Discov Data (TKDD) 11(2):13:1–13:36Google Scholar
  79. Wang C, Song Y, Li H, Sun Y, Zhang M, Han J (2017) Distant meta-path similarities for text-based heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1629–1638Google Scholar
  80. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 713–721Google Scholar
  81. Wang P, Hu J, Zeng H-J, Chen L, Chen Z (2007) Improving text classification by using encyclopedia knowledge. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 332–341Google Scholar
  82. Wang W, Do DB, Lin X (2005) Term graph model for text classification. In: Proceedings of the international conference on advanced data mining and applications (ADMA), pp 19–30CrossRefGoogle Scholar
  83. Ying Y, Campbell C, Girolami M (2009) Analysis of SVM with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 2205–2213Google Scholar
  84. Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. In: Saul LK, Weiss Y, Bottou L (eds) Proceedings of the neural information processing systems (NIPS), pp 1601–1608Google Scholar
  85. Zhang J, Kong X, Yu PS (2013) Predicting social links for new users across aligned heterogeneous social networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 1289–1294Google Scholar
  86. Zhang J, Kong X, Yu PS (2014) Transferring heterogeneous links across location-based social networks. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 303–312Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Amazon AIEast Palo AltoUSA
  2. 2.Department of CSEHKUSTClear Water BayHong Kong
  3. 3.School of EECSPeking UniversityBeijingChina
  4. 4.Department of CSUIUCUrbanaUSA

Personalised recommendations