Abstract
Literature-based discovery (LBD) is a field of research aiming at discovering new knowledge by mining scientific literature. Knowledge bases are commonly used by LBD systems. SemMedDB, created with the use of SemRep information extraction system, is the most frequently used database in LBD. However, new applications of LBD are emerging that go beyond the scope of SemMedDB. In this work, we propose some new discovery patterns that lie in the domain of Natural Products and that are not covered by the existing databases and tools. Our goal thus is to create a new, extended knowledge base, addressing limitations of SemMedDB. Our proposed contribution is three-fold: 1) we add types of entities and relations that are of interest for LBD but are not covered by SemMedDB; 2) we plan to leverage full texts of scientific publications, instead of titles and abstracts only; 3) we envisage using the RDF model for our database, in accordance with Semantic Web standards. To create a new database, we plan to build a distantly supervised entity and relation extraction system, employing a neural networks/deep learning architecture. We describe the methods and tools we plan to employ.
This work is funded by a grant (9710.3.01.5.0001.08) from Health@N, ZHAW.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abacha, A.B., Zweigenbaum, P.: Medical entity recognition: a comparison of semantic and statistical methods. In: BioNLP ACL (2011)
Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the metamap program. In: AMIA Annual Symposium 2001, pp. 17–21, February 2001
Baker, N.C.: Methods in literature-based drug discovery (2010)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI 2007, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 2670–2676 (2007)
Bodenreider, O.: The unified medical language system (UMLs): integrating biomedical terminology. Nucleic Acids Res. 32(Database issue), D267–D270 (2004)
Bravo, A., Piñero, J., Queralt-Rosinach, N., Rautschka, M., Furlong, L.I.: Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinform. 16, 55 (2015)
Bui, Q.C.: Relation extraction methods for biomedical literature. Ph.D. thesis, Informatics Institute (IVI), University of Amsterdam (2012)
Cairelli, M.J., Miller, C.M., Fiszman, M., Workman, T.E., Rindflesch, T.C.: Semantic MEDLINE for discovery browsing: using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradox. In: AMIA Annual Symposium Proceedings, pp. 164–73 (2013)
Cameron, D., Kavuluru, R., Rindflesch, T.C., Sheth, A.P., Thirunarayan, K., Bodenreider, O.: Context-driven automatic subgraph creation for literature-based discovery. J. Biomed. Inf. 54, 141–157 (2015)
Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform. 5, 147 (2004)
Chichester, C., Digles, D., Siebes, R., Loizou, A., Groth, P., Harland, L.: Drug discovery FAQs: workflows for answering multidomain drug discovery questions. Drug Discovery Today 20(4), 399–405 (2015)
Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections. J. Biomed. Inf. 43(2), 240–256 (2010)
Cohen, T., Whitfield, G.K., Schvaneveldt, R.W., Mukund, K., Rindflesch, T.: EpiphaNet: an interactive tool to support biomedical discoveries. J. Biomed. Discovery Collab. 5, 21–49 (2010)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training (2019)
Federhen, S.: The NCBI taxonomy database. Nucleic Acids Res. 40(D1), D136–D143 (2011)
Gopalakrishnan, V., Jha, K., Jin, W., Zhang, A.: A survey on literature based discovery approaches in biomedical domain. J. Biomed. Inform. 93, 103141 (2019)
Hastings, J., et al.: Chebi in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2015)
Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semantic relations for literature-based discovery. In: AMIA Annual Symposium proceedings, pp. 349–53 (2006)
Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Improving literature based discovery support by genetic knowledge integration (2003)
Hristovski, D., Rindflesch, T., Peterlin, B.: Using literature-based discovery to identify novel therapeutic approaches. Cardiovasc. hematol. Agents Med. Chem. 11(1), 14–24 (2013)
Hui, W., Lau, W.K.: Application of literature-based discovery in nonmedical disciplines: a survey. In: Proceedings of the 2nd International Conference on Computing and Big Data, ICCBD 2019, pp. 7–11. Association for Computing Machinery, New York (2019)
Ijaz, A.Z., Song, M., Lee, D.: MKEM: a multi-level knowledge emergence model for mining undiscovered public knowledge. BMC Bioinform. 11(Suppl 2), S3 (2010)
Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G., Rindflesch, T.C.: SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23), 3158 (2012)
Korbel, J.O., et al.: Systematic association of genes to phenotypes by genome and literature mining. PLoS Biol. 3(5), e134 (2005)
Koroleva, A., Kamath, S., Paroubek, P.: Extracting outcomes from articlesreporting randomized controlled trialsusing pre-trained deep language representations. Assisted authoring for avoiding inadequate claims in scientific reporting, chap. 3, pp. 45–68. Print Service Ede, The Netherlands (2019)
Koroleva, A., Kamath, S., Paroubek, P.: Measuring semantic similarity of clinical trial outcomes using deep pre-trained language representations. J. Biomed. Inf. X 4, 100058 (2019)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746 (2019)
Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database (2016)
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach (2019)
Manohar, N., Adam, T., Pakhomov, S., Melton, G., Zhang, R.: Evaluation of herbal and dietary supplement resource term coverage. Stud. Health Technol. Inform. 216, 785–9 (2015)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)
Mohan, S., Li, D.: Medmentions: a large biomedical corpus annotated with UMLS concepts. In: Proceedings of the 2019 Conference on Automated Knowledge Base Construction (AKBC 2019) (2019)
van Mulligen, E.M., et al.: The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform. 45(5), 879–884 (2012). Text Mining and Natural Language Processing in Pharmacogenomics
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. Association for Computational Linguistics, Florence, Augst 2019
Nguyen, N.T., Gabud, R.S., Ananiadou, S.: Copious: a gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodivers. Data J. 7, e29626 (2019)
Ozgür, A., Xiang, Z., Radev, D.R., He, Y.: Literature-based discovery of IFN-gamma and vaccine-mediated gene interaction networks. J. Biomed. Biotechnol. 2010, 426479 (2010)
Papanikolaou, Y., Roberts, I., Pierleoni, A.: Deep bidirectional transformers for relation extraction without supervision. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP, DeepLo 2019 (2019). https://doi.org/10.18653/v1/d19-6108
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers) (2018). https://doi.org/10.18653/v1/n18-1202
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning. Technical report, OpenAI (2018)
Rastegar-Mojarad, M., Elayavilli, R.K., Li, D., Prasad, R., Liu, H.: A new method for prioritizing drug repositioning candidates extracted by literature-based discovery. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 669–674. IEEE, November 2015
Rindflesch, T.C., Fiszman, M.: The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 36(6), 462–477 (2003). Unified Medical Language System, unified Medical Language System
Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J.: SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinform. 19(1), 193 (2018)
Smith, B., et al.: The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotech. 25(11), 1251–1255 (2007)
Song, M., Han, N.G., Kim, Y.H., Ding, Y., Chambers, T.: Discovering implicit entity relation with the gene-citation-gene network. PloS One 8(12), e84639 (2013)
Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30, 7–18 (1986)
Swanson, D.R.: Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31, 526–557 (1988)
Sybrandt, J., Shtutman, M., Safro, I.: MOLIERE: automatic biomedical hypothesis generation system. In: KDD : Proceedings of the International Conference on Knowledge Discovery & Data Mining 2017, pp. 1633–1642, August 2017
Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23(13), 1658–1665 (2007)
Wilkowski, B., et al.: Graph-based methods for discovery browsing with semantic predications. In: AMIA Annual Symposium Proceedings 2011, pp. 1514–1523 (2011)
Williams, A.J., et al.: Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 17(21), 1188–1198 (2012)
Wu, H.Y., et al.: An integrated pharmacokinetics ontology and corpus for text mining. BMC Bioinform. 14, 35 (2013)
Zhang, O.R., Zhang, Y., Xu, J., Roberts, K., Zhang, X.Y., Xu, H.: Interweaving domain knowledge and unsupervised learning for psychiatric stressor extraction from clinical notes. In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol. 10351, pp. 396–406. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60045-1_41
Zhang, R., et al.: Exploiting literature-derived knowledge and semantics to identify potential prostate cancer drugs. Cancer Inform 13(s1), 103–111 (2014). https://doi.org/10.4137/CIN.S13889
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Koroleva, A., Anisimova, M., Gil, M. (2020). Towards Creating a New Triple Store for Literature-Based Discovery. In: Lu, W., Zhu, K.Q. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12237. Springer, Cham. https://doi.org/10.1007/978-3-030-60470-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-60470-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60469-1
Online ISBN: 978-3-030-60470-7
eBook Packages: Computer ScienceComputer Science (R0)