Skip to main content

Towards Creating a New Triple Store for Literature-Based Discovery

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2020)

Abstract

Literature-based discovery (LBD) is a field of research aiming at discovering new knowledge by mining scientific literature. Knowledge bases are commonly used by LBD systems. SemMedDB, created with the use of SemRep information extraction system, is the most frequently used database in LBD. However, new applications of LBD are emerging that go beyond the scope of SemMedDB. In this work, we propose some new discovery patterns that lie in the domain of Natural Products and that are not covered by the existing databases and tools. Our goal thus is to create a new, extended knowledge base, addressing limitations of SemMedDB. Our proposed contribution is three-fold: 1) we add types of entities and relations that are of interest for LBD but are not covered by SemMedDB; 2) we plan to leverage full texts of scientific publications, instead of titles and abstracts only; 3) we envisage using the RDF model for our database, in accordance with Semantic Web standards. To create a new database, we plan to build a distantly supervised entity and relation extraction system, employing a neural networks/deep learning architecture. We describe the methods and tools we plan to employ.

This work is funded by a grant (9710.3.01.5.0001.08) from Health@N, ZHAW.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://arxiv.org/.

  2. 2.

    https://www.biorxiv.org/.

  3. 3.

    https://www.medrxiv.org.

  4. 4.

    https://www.drugbank.ca/.

  5. 5.

    https://www.w3.org/standards/semanticweb/.

  6. 6.

    https://spacy.io/.

  7. 7.

    See https://allenai.github.io/scispacy/.

References

  1. Abacha, A.B., Zweigenbaum, P.: Medical entity recognition: a comparison of semantic and statistical methods. In: BioNLP ACL (2011)

    Google Scholar 

  2. Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the metamap program. In: AMIA Annual Symposium 2001, pp. 17–21, February 2001

    Google Scholar 

  3. Baker, N.C.: Methods in literature-based drug discovery (2010)

    Google Scholar 

  4. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI 2007, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 2670–2676 (2007)

    Google Scholar 

  5. Bodenreider, O.: The unified medical language system (UMLs): integrating biomedical terminology. Nucleic Acids Res. 32(Database issue), D267–D270 (2004)

    Article  Google Scholar 

  6. Bravo, A., Piñero, J., Queralt-Rosinach, N., Rautschka, M., Furlong, L.I.: Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinform. 16, 55 (2015)

    Article  Google Scholar 

  7. Bui, Q.C.: Relation extraction methods for biomedical literature. Ph.D. thesis, Informatics Institute (IVI), University of Amsterdam (2012)

    Google Scholar 

  8. Cairelli, M.J., Miller, C.M., Fiszman, M., Workman, T.E., Rindflesch, T.C.: Semantic MEDLINE for discovery browsing: using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradox. In: AMIA Annual Symposium Proceedings, pp. 164–73 (2013)

    Google Scholar 

  9. Cameron, D., Kavuluru, R., Rindflesch, T.C., Sheth, A.P., Thirunarayan, K., Bodenreider, O.: Context-driven automatic subgraph creation for literature-based discovery. J. Biomed. Inf. 54, 141–157 (2015)

    Article  Google Scholar 

  10. Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform. 5, 147 (2004)

    Article  Google Scholar 

  11. Chichester, C., Digles, D., Siebes, R., Loizou, A., Groth, P., Harland, L.: Drug discovery FAQs: workflows for answering multidomain drug discovery questions. Drug Discovery Today 20(4), 399–405 (2015)

    Article  Google Scholar 

  12. Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections. J. Biomed. Inf. 43(2), 240–256 (2010)

    Article  Google Scholar 

  13. Cohen, T., Whitfield, G.K., Schvaneveldt, R.W., Mukund, K., Rindflesch, T.: EpiphaNet: an interactive tool to support biomedical discoveries. J. Biomed. Discovery Collab. 5, 21–49 (2010)

    Google Scholar 

  14. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  15. Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training (2019)

    Google Scholar 

  16. Federhen, S.: The NCBI taxonomy database. Nucleic Acids Res. 40(D1), D136–D143 (2011)

    Article  Google Scholar 

  17. Gopalakrishnan, V., Jha, K., Jin, W., Zhang, A.: A survey on literature based discovery approaches in biomedical domain. J. Biomed. Inform. 93, 103141 (2019)

    Article  Google Scholar 

  18. Hastings, J., et al.: Chebi in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2015)

    Article  Google Scholar 

  19. Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semantic relations for literature-based discovery. In: AMIA Annual Symposium proceedings, pp. 349–53 (2006)

    Google Scholar 

  20. Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Improving literature based discovery support by genetic knowledge integration (2003)

    Google Scholar 

  21. Hristovski, D., Rindflesch, T., Peterlin, B.: Using literature-based discovery to identify novel therapeutic approaches. Cardiovasc. hematol. Agents Med. Chem. 11(1), 14–24 (2013)

    Article  Google Scholar 

  22. Hui, W., Lau, W.K.: Application of literature-based discovery in nonmedical disciplines: a survey. In: Proceedings of the 2nd International Conference on Computing and Big Data, ICCBD 2019, pp. 7–11. Association for Computing Machinery, New York (2019)

    Google Scholar 

  23. Ijaz, A.Z., Song, M., Lee, D.: MKEM: a multi-level knowledge emergence model for mining undiscovered public knowledge. BMC Bioinform. 11(Suppl 2), S3 (2010)

    Article  Google Scholar 

  24. Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G., Rindflesch, T.C.: SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23), 3158 (2012)

    Article  Google Scholar 

  25. Korbel, J.O., et al.: Systematic association of genes to phenotypes by genome and literature mining. PLoS Biol. 3(5), e134 (2005)

    Article  Google Scholar 

  26. Koroleva, A., Kamath, S., Paroubek, P.: Extracting outcomes from articlesreporting randomized controlled trialsusing pre-trained deep language representations. Assisted authoring for avoiding inadequate claims in scientific reporting, chap. 3, pp. 45–68. Print Service Ede, The Netherlands (2019)

    Google Scholar 

  27. Koroleva, A., Kamath, S., Paroubek, P.: Measuring semantic similarity of clinical trial outcomes using deep pre-trained language representations. J. Biomed. Inf. X 4, 100058 (2019)

    Google Scholar 

  28. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746 (2019)

  29. Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database (2016)

    Google Scholar 

  30. Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach (2019)

    Google Scholar 

  31. Manohar, N., Adam, T., Pakhomov, S., Melton, G., Zhang, R.: Evaluation of herbal and dietary supplement resource term coverage. Stud. Health Technol. Inform. 216, 785–9 (2015)

    Google Scholar 

  32. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)

    Google Scholar 

  33. Mohan, S., Li, D.: Medmentions: a large biomedical corpus annotated with UMLS concepts. In: Proceedings of the 2019 Conference on Automated Knowledge Base Construction (AKBC 2019) (2019)

    Google Scholar 

  34. van Mulligen, E.M., et al.: The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform. 45(5), 879–884 (2012). Text Mining and Natural Language Processing in Pharmacogenomics

    Article  Google Scholar 

  35. Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. Association for Computational Linguistics, Florence, Augst 2019

    Google Scholar 

  36. Nguyen, N.T., Gabud, R.S., Ananiadou, S.: Copious: a gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodivers. Data J. 7, e29626 (2019)

    Article  Google Scholar 

  37. Ozgür, A., Xiang, Z., Radev, D.R., He, Y.: Literature-based discovery of IFN-gamma and vaccine-mediated gene interaction networks. J. Biomed. Biotechnol. 2010, 426479 (2010)

    Article  Google Scholar 

  38. Papanikolaou, Y., Roberts, I., Pierleoni, A.: Deep bidirectional transformers for relation extraction without supervision. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP, DeepLo 2019 (2019). https://doi.org/10.18653/v1/d19-6108

  39. Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers) (2018). https://doi.org/10.18653/v1/n18-1202

  40. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning. Technical report, OpenAI (2018)

    Google Scholar 

  41. Rastegar-Mojarad, M., Elayavilli, R.K., Li, D., Prasad, R., Liu, H.: A new method for prioritizing drug repositioning candidates extracted by literature-based discovery. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 669–674. IEEE, November 2015

    Google Scholar 

  42. Rindflesch, T.C., Fiszman, M.: The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 36(6), 462–477 (2003). Unified Medical Language System, unified Medical Language System

    Article  Google Scholar 

  43. Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J.: SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinform. 19(1), 193 (2018)

    Article  Google Scholar 

  44. Smith, B., et al.: The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotech. 25(11), 1251–1255 (2007)

    Article  Google Scholar 

  45. Song, M., Han, N.G., Kim, Y.H., Ding, Y., Chambers, T.: Discovering implicit entity relation with the gene-citation-gene network. PloS One 8(12), e84639 (2013)

    Article  Google Scholar 

  46. Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30, 7–18 (1986)

    Article  Google Scholar 

  47. Swanson, D.R.: Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31, 526–557 (1988)

    Article  Google Scholar 

  48. Sybrandt, J., Shtutman, M., Safro, I.: MOLIERE: automatic biomedical hypothesis generation system. In: KDD : Proceedings of the International Conference on Knowledge Discovery & Data Mining 2017, pp. 1633–1642, August 2017

    Google Scholar 

  49. Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23(13), 1658–1665 (2007)

    Article  Google Scholar 

  50. Wilkowski, B., et al.: Graph-based methods for discovery browsing with semantic predications. In: AMIA Annual Symposium Proceedings 2011, pp. 1514–1523 (2011)

    Google Scholar 

  51. Williams, A.J., et al.: Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 17(21), 1188–1198 (2012)

    Article  Google Scholar 

  52. Wu, H.Y., et al.: An integrated pharmacokinetics ontology and corpus for text mining. BMC Bioinform. 14, 35 (2013)

    Article  Google Scholar 

  53. Zhang, O.R., Zhang, Y., Xu, J., Roberts, K., Zhang, X.Y., Xu, H.: Interweaving domain knowledge and unsupervised learning for psychiatric stressor extraction from clinical notes. In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol. 10351, pp. 396–406. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60045-1_41

    Chapter  Google Scholar 

  54. Zhang, R., et al.: Exploiting literature-derived knowledge and semantics to identify potential prostate cancer drugs. Cancer Inform 13(s1), 103–111 (2014). https://doi.org/10.4137/CIN.S13889

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Anna Koroleva or Manuel Gil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koroleva, A., Anisimova, M., Gil, M. (2020). Towards Creating a New Triple Store for Literature-Based Discovery. In: Lu, W., Zhu, K.Q. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12237. Springer, Cham. https://doi.org/10.1007/978-3-030-60470-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60470-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60469-1

  • Online ISBN: 978-3-030-60470-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics