Data Mining and Knowledge Discovery

, Volume 29, Issue 4, pp 976–998 | Cite as

Mining strong relevance between heterogeneous entities from unstructured biomedical data

  • Ming JiEmail author
  • Qi He
  • Jiawei Han
  • Scott Spangler


Huge volumes of biomedical text data discussing about different biomedical entities are being generated every day. Hidden in those unstructured data are the strong relevance relationships between those entities, which are critical for many interesting applications including building knowledge bases for the biomedical domain and semantic search among biomedical entities. In this paper, we study the problem of discovering strong relevance between heterogeneous typed biomedical entities from massive biomedical text data. We first build an entity correlation graph from data, in which the collection of paths linking two heterogeneous entities offer rich semantic contexts for their relationships, especially those paths following the patterns of top-\(k\) selected meta paths inferred from data. Guided by such meta paths, we design a novel relevance measure to compute the strong relevance between two heterogeneous entities, named \({\mathsf {EntityRel}}\). Our intuition is, two entities of heterogeneous types are strongly relevant if they have strong direct links or they are linked closely to other strongly relevant heterogeneous entities along paths following the selected patterns. We provide experimental results on mining strong relevance between drugs and diseases. More than 20 millions of MEDLINE abstracts and 5 types of biological entities (Drug, Disease, Compound, Target, MeSH) are used to construct the entity correlation graph. A prototype of drug search engine for disease queries is implemented. Extensive comparisons are made against multiple state-of-the-arts in the examples of Drug–Disease relevance discovery.


Biomedical text data Heterogeneous Meta path  Relevance Context-aware 



Research was sponsored in part by the Army Research Lab, under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1017362, IIS-1320617, IIS-1354329, HDTRA1-10-1-0120, and NIH Big Data to Knowledge (BD2K) (U54).


  1. Aleman-Meza B, Halaschek-Wiener C, Arpinar IB, Sheth AP (2003) Context-aware semantic association ranking. In: Semantic Web and Databases, pp. 33–50Google Scholar
  2. Anyanwu K, Maduko A, Sheth AP (2005) Semrank: ranking complex relationship search results on the semantic web. In: WWW, pp. 117–127Google Scholar
  3. Anyanwu K, Sheth AP (2003) P-queries: enabling querying for semantic associations on the semantic web. In: WWW, pp. 690–699Google Scholar
  4. Coulet A, Garten Y, Dumontier M, Altman R, Musen M, Shah N (2011) Integration and publication of heterogeneous text-mined relationships on the semantic web. J Biomed Semant 2(Suppl 2):S10Google Scholar
  5. Eppstein D (1998) Finding the k shortest paths. SIAM J Comput 28(2):652–673zbMATHMathSciNetCrossRefGoogle Scholar
  6. Guan Z, Wang C, Bu J, Chen C, Yang K, Cai D, He X (2010) Document recommendation in social tagging services. In: WWW, pp. 391–400Google Scholar
  7. Gunther E, Stone D, Gerwien R, Bento P, Heyes M (2003) Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci 100(16):9608CrossRefGoogle Scholar
  8. Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: KDD, pp. 538–543Google Scholar
  9. Jeh G, Widom J (2003) Scaling personalized web search. In: WWW, pp. 271–279Google Scholar
  10. Lao N, Cohen WW (2004) Relational retrieval using a combination of path-constrained random walks. Mach Learn 81:53–67MathSciNetCrossRefGoogle Scholar
  11. Lao N, Cohen WW (2010) Fast query execution for retrieval models based on path-constrained random walks. In: KDD, pp. 881–888Google Scholar
  12. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  13. Ramakrishnan C, Mendes P, Wang S, Sheth A (2008) Unsupervised discovery of compound entities for relationship extraction. Knowledge Engineering: Practice and Patterns pp. 146–155Google Scholar
  14. Searls D (2005) Data integration: challenges for drug discovery. Nat Rev Drug Discov 4(1):45–58CrossRefGoogle Scholar
  15. Sen S, Vig J, Riedl J (2009) Tagommenders: connecting users to items through tags. In: WWW, pp. 671–680Google Scholar
  16. Sheth AP, Aleman-Meza B, Arpinar IB, Bertram C, Warke YS, Ramakrishnan C, Halaschek C, Anyanwu K, Avant D, Arpinar FS, Kochut K (2005) Semantic association identification and knowledge discovery for national security applications. J Database Manage 16(1):33–53CrossRefGoogle Scholar
  17. Shi C, Kong X, Yu PS, Xie S, Wu B (2012) Relevance search in heterogeneous networks. In: EDBT, pp. 180–191Google Scholar
  18. Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. PVLDB 4(11):992–1003Google Scholar
  19. Yan S, Spangler WS, Chen Y (2011) Cross media entity extraction and linkage for chemical documents. In: AAAIGoogle Scholar
  20. Yin D, Xue Z, Hong L, Davison B (2010) A probabilistic model for personalized tag prediction. In: KDD, pp. 959–968Google Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.LinkedIn Inc.Mountain ViewUSA
  3. 3.IBM Almaden Research CenterSan JoseUSA

Personalised recommendations