Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project

Valdez, Joshua; Kim, Matthew; Rueschman, Michael; Redline, Susan; Sahoo, Satya S.

doi:10.1007/978-3-319-98379-0_3

Joshua Valdez¹⁶,
Matthew Kim¹⁷,
Michael Rueschman¹⁷,
Susan Redline¹⁷ &
…
Satya S. Sahoo¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11017))

Included in the following conference series:

International Provenance and Annotation Workshop

782 Accesses
1 Citations
1 Altmetric

Abstract

Scientific reproducibility is key to the advancement of science as researchers can build on sound and validated results to design new research studies. However, recent studies in biomedical research have highlighted key challenges in scientific reproducibility as more than 70% of researchers in a survey of more than 1500 participants were not able to reproduce results from other groups and 50% of researchers were not able to reproduce their own experiments. Provenance metadata is a key component of scientific reproducibility and as part of the Provenance for Clinical and Health Research (ProvCaRe) project, we have: (1) identified and modeled important provenance terms associated with a biomedical research study in the S3 model (formalized in the ProvCaRe ontology); (2) developed a new natural language processing (NLP) workflow to identify and extract provenance metadata from published articles describing biomedical research studies; and (3) developed the ProvCaRe knowledge repository to enable users to query and explore provenance of research studies using the S3 model. However, a key challenge in this project is the automated classification of provenance metadata extracted by the NLP workflow according to the S3 model and its subsequent querying in the ProvCaRe knowledge repository. In this paper, we describe the development and comparative evaluation of deep learning techniques for multi-class classification of structured provenance metadata extracted from biomedical literature using 12 different categories of provenance terms represented in the S3 model. We describe the application of the Long Term Short Memory (LSTM) network, which has the highest classification accuracy of 86% in our evaluation, to classify more than 48 million provenance triples in the ProvCaRe knowledge repository (available at: https://provcare.case.edu/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Collins, F.S., Tabak, L.A.: Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (2014)
Article Google Scholar
Landis, S.C., et al.: A call for transparent reporting to optimize the predictive value of preclinical research. Nature 490(7419), 187–191 (2012)
Article Google Scholar
Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10(9), 712 (2011)
Article Google Scholar
National Institutes of Health: Principles and Guidelines for Reporting Preclinical Research (2016). https://www.nih.gov/research-training/rigor-reproducibility/principles-guidelines-reporting-preclinical-research
Schulz, K.F., Altman, D.G., Moher, D.: CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. J. Clin. Epidemiol. 63(8), 834–840 (2010). CONSORT Group
Article Google Scholar
Sahoo, S.S., Valdez, J., Rueschman, M.: Scientific reproducibility in biomedical research: provenance metadata ontology for semantic annotation of study description. In: American Medical Informatics Association (AMIA) Annual Symposium, Chicago, pp. 1070–1079 (2016)
Google Scholar
Valdez, J., Kim, M., Rueschman, M., Socrates, V., Redline, S., Sahoo, S.S.: ProvCaRe semantic provenance knowledgebase: evaluating scientific reproducibility of research studies. Presented at the American Medical Informatics Association (AMIA) Annual Conference, Washington DC (2017)
Google Scholar
Moreau, L., Missier, P.: PROV data model (PROV-DM). In: W3C Recommendation, World Wide Web Consortium W3C (2013)
Google Scholar
Valdez, J., Rueschman, M., Kim, M., Redline, S., Sahoo, S.S.: An ontology-enabled natural language processing pipeline for provenance metadata extraction from biomedical text. Presented at the 15th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE) (2016)
Google Scholar
Lebo, T., Sahoo, S.S., McGuinness, D.: PROV-O: the PROV ontology. In: W3C Recommendation, World Wide Web Consortium W3C (2013)
Google Scholar
Herman, I., Adida, B., Sporny, M., Birbeck, M.: RDFa 1.1 primer - second edition. In: W3C Working Group Note, World Wide Web Consortium (W3C) (2013). http://www.w3.org/TR/rdfa-primer/
Sahoo, S.S., Sheth, A., Henson, C.: Semantic provenance for escience: managing the deluge of scientific data. IEEE Internet Comput. 12(4), 46–54 (2008)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprin. https://arxiv.org/abs/1408.5882
TensorFlow. https://www.tensorflow.org/
Rector, A.L., Brandt, S., Schneider, T.: Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications. J. Am. Med. Inform. Assoc. 18(4), 432–440 (2011)
Article Google Scholar
Valdez, J., Rueschman, M., Kim, M., Arabyarmohammadi, S., Redline, S., Sahoo, S.S.: An extensible ontology modeling approach using post coordinated expressions for semantic provenance in biomedical research. In: 16th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE), Rhodes, Greece (2017)
Google Scholar
O’Connor, G.T., et al.: Prospective study of sleep-disordered breathing and hypertension: the sleep heart health study. Am. J. Respir. Crit. Care Med. 179(12), 1159–1164 (2009)
Article Google Scholar
Dean, D.A., et al.: Scaling up scientific discovery in sleep medicine: the national sleep research resource. Sleep 39(5), 1151–1164 (2016)
Article Google Scholar
Huang, X., Lin, J., Demner-Fushman, D.: Evaluation of PICO as a knowledge representation for clinical questions. Presented at the AMIA Annual Symposium Proceedings (2006)
Google Scholar

Download references

Acknowledgement

This work is supported in part by the NIH-NIBIB Big Data to Knowledge (BD2 K) 1U01EB020955 grant, NSF grant#1636850, and the NIH-NHLBI R24HL114473 grant.

Author information

Authors and Affiliations

Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA
Joshua Valdez & Satya S. Sahoo
Department of Medicine, Brigham and Women’s Hospital, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, USA
Matthew Kim, Michael Rueschman & Susan Redline

Authors

Joshua Valdez
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Kim
View author publications
You can also search for this author in PubMed Google Scholar
Michael Rueschman
View author publications
You can also search for this author in PubMed Google Scholar
Susan Redline
View author publications
You can also search for this author in PubMed Google Scholar
Satya S. Sahoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Satya S. Sahoo .

Editor information

Editors and Affiliations

Paris Dauphine University, Paris, France
Khalid Belhajjame
SRI International, Menlo Park, CA, USA
Ashish Gehani
University of Luxembourg, Belvaux, Luxembourg
Pinar Alper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valdez, J., Kim, M., Rueschman, M., Redline, S., Sahoo, S.S. (2018). Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-98379-0_3
Published: 06 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98378-3
Online ISBN: 978-3-319-98379-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics