Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project
Scientific reproducibility is key to the advancement of science as researchers can build on sound and validated results to design new research studies. However, recent studies in biomedical research have highlighted key challenges in scientific reproducibility as more than 70% of researchers in a survey of more than 1500 participants were not able to reproduce results from other groups and 50% of researchers were not able to reproduce their own experiments. Provenance metadata is a key component of scientific reproducibility and as part of the Provenance for Clinical and Health Research (ProvCaRe) project, we have: (1) identified and modeled important provenance terms associated with a biomedical research study in the S3 model (formalized in the ProvCaRe ontology); (2) developed a new natural language processing (NLP) workflow to identify and extract provenance metadata from published articles describing biomedical research studies; and (3) developed the ProvCaRe knowledge repository to enable users to query and explore provenance of research studies using the S3 model. However, a key challenge in this project is the automated classification of provenance metadata extracted by the NLP workflow according to the S3 model and its subsequent querying in the ProvCaRe knowledge repository. In this paper, we describe the development and comparative evaluation of deep learning techniques for multi-class classification of structured provenance metadata extracted from biomedical literature using 12 different categories of provenance terms represented in the S3 model. We describe the application of the Long Term Short Memory (LSTM) network, which has the highest classification accuracy of 86% in our evaluation, to classify more than 48 million provenance triples in the ProvCaRe knowledge repository (available at: https://provcare.case.edu/).
KeywordsScientific reproducibility Semantic provenance Provenance for Clinical and Health Research Provenance triple classification Deep learning
This work is supported in part by the NIH-NIBIB Big Data to Knowledge (BD2 K) 1U01EB020955 grant, NSF grant#1636850, and the NIH-NHLBI R24HL114473 grant.
- 4.National Institutes of Health: Principles and Guidelines for Reporting Preclinical Research (2016). https://www.nih.gov/research-training/rigor-reproducibility/principles-guidelines-reporting-preclinical-research
- 6.Sahoo, S.S., Valdez, J., Rueschman, M.: Scientific reproducibility in biomedical research: provenance metadata ontology for semantic annotation of study description. In: American Medical Informatics Association (AMIA) Annual Symposium, Chicago, pp. 1070–1079 (2016)Google Scholar
- 7.Valdez, J., Kim, M., Rueschman, M., Socrates, V., Redline, S., Sahoo, S.S.: ProvCaRe semantic provenance knowledgebase: evaluating scientific reproducibility of research studies. Presented at the American Medical Informatics Association (AMIA) Annual Conference, Washington DC (2017)Google Scholar
- 8.Moreau, L., Missier, P.: PROV data model (PROV-DM). In: W3C Recommendation, World Wide Web Consortium W3C (2013)Google Scholar
- 9.Valdez, J., Rueschman, M., Kim, M., Redline, S., Sahoo, S.S.: An ontology-enabled natural language processing pipeline for provenance metadata extraction from biomedical text. Presented at the 15th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE) (2016)Google Scholar
- 10.Lebo, T., Sahoo, S.S., McGuinness, D.: PROV-O: the PROV ontology. In: W3C Recommendation, World Wide Web Consortium W3C (2013)Google Scholar
- 11.Herman, I., Adida, B., Sporny, M., Birbeck, M.: RDFa 1.1 primer - second edition. In: W3C Working Group Note, World Wide Web Consortium (W3C) (2013). http://www.w3.org/TR/rdfa-primer/
- 13.Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprin. https://arxiv.org/abs/1408.5882
- 14.TensorFlow. https://www.tensorflow.org/
- 16.Valdez, J., Rueschman, M., Kim, M., Arabyarmohammadi, S., Redline, S., Sahoo, S.S.: An extensible ontology modeling approach using post coordinated expressions for semantic provenance in biomedical research. In: 16th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE), Rhodes, Greece (2017)Google Scholar
- 19.Huang, X., Lin, J., Demner-Fushman, D.: Evaluation of PICO as a knowledge representation for clinical questions. Presented at the AMIA Annual Symposium Proceedings (2006)Google Scholar