Skip to main content

Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11017))

Included in the following conference series:

Abstract

Scientific reproducibility is key to the advancement of science as researchers can build on sound and validated results to design new research studies. However, recent studies in biomedical research have highlighted key challenges in scientific reproducibility as more than 70% of researchers in a survey of more than 1500 participants were not able to reproduce results from other groups and 50% of researchers were not able to reproduce their own experiments. Provenance metadata is a key component of scientific reproducibility and as part of the Provenance for Clinical and Health Research (ProvCaRe) project, we have: (1) identified and modeled important provenance terms associated with a biomedical research study in the S3 model (formalized in the ProvCaRe ontology); (2) developed a new natural language processing (NLP) workflow to identify and extract provenance metadata from published articles describing biomedical research studies; and (3) developed the ProvCaRe knowledge repository to enable users to query and explore provenance of research studies using the S3 model. However, a key challenge in this project is the automated classification of provenance metadata extracted by the NLP workflow according to the S3 model and its subsequent querying in the ProvCaRe knowledge repository. In this paper, we describe the development and comparative evaluation of deep learning techniques for multi-class classification of structured provenance metadata extracted from biomedical literature using 12 different categories of provenance terms represented in the S3 model. We describe the application of the Long Term Short Memory (LSTM) network, which has the highest classification accuracy of 86% in our evaluation, to classify more than 48 million provenance triples in the ProvCaRe knowledge repository (available at: https://provcare.case.edu/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Collins, F.S., Tabak, L.A.: Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (2014)

    Article  Google Scholar 

  2. Landis, S.C., et al.: A call for transparent reporting to optimize the predictive value of preclinical research. Nature 490(7419), 187–191 (2012)

    Article  Google Scholar 

  3. Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10(9), 712 (2011)

    Article  Google Scholar 

  4. National Institutes of Health: Principles and Guidelines for Reporting Preclinical Research (2016). https://www.nih.gov/research-training/rigor-reproducibility/principles-guidelines-reporting-preclinical-research

  5. Schulz, K.F., Altman, D.G., Moher, D.: CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. J. Clin. Epidemiol. 63(8), 834–840 (2010). CONSORT Group

    Article  Google Scholar 

  6. Sahoo, S.S., Valdez, J., Rueschman, M.: Scientific reproducibility in biomedical research: provenance metadata ontology for semantic annotation of study description. In: American Medical Informatics Association (AMIA) Annual Symposium, Chicago, pp. 1070–1079 (2016)

    Google Scholar 

  7. Valdez, J., Kim, M., Rueschman, M., Socrates, V., Redline, S., Sahoo, S.S.: ProvCaRe semantic provenance knowledgebase: evaluating scientific reproducibility of research studies. Presented at the American Medical Informatics Association (AMIA) Annual Conference, Washington DC (2017)

    Google Scholar 

  8. Moreau, L., Missier, P.: PROV data model (PROV-DM). In: W3C Recommendation, World Wide Web Consortium W3C (2013)

    Google Scholar 

  9. Valdez, J., Rueschman, M., Kim, M., Redline, S., Sahoo, S.S.: An ontology-enabled natural language processing pipeline for provenance metadata extraction from biomedical text. Presented at the 15th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE) (2016)

    Google Scholar 

  10. Lebo, T., Sahoo, S.S., McGuinness, D.: PROV-O: the PROV ontology. In: W3C Recommendation, World Wide Web Consortium W3C (2013)

    Google Scholar 

  11. Herman, I., Adida, B., Sporny, M., Birbeck, M.: RDFa 1.1 primer - second edition. In: W3C Working Group Note, World Wide Web Consortium (W3C) (2013). http://www.w3.org/TR/rdfa-primer/

  12. Sahoo, S.S., Sheth, A., Henson, C.: Semantic provenance for escience: managing the deluge of scientific data. IEEE Internet Comput. 12(4), 46–54 (2008)

    Article  Google Scholar 

  13. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprin. https://arxiv.org/abs/1408.5882

  14. TensorFlow. https://www.tensorflow.org/

  15. Rector, A.L., Brandt, S., Schneider, T.: Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications. J. Am. Med. Inform. Assoc. 18(4), 432–440 (2011)

    Article  Google Scholar 

  16. Valdez, J., Rueschman, M., Kim, M., Arabyarmohammadi, S., Redline, S., Sahoo, S.S.: An extensible ontology modeling approach using post coordinated expressions for semantic provenance in biomedical research. In: 16th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE), Rhodes, Greece (2017)

    Google Scholar 

  17. O’Connor, G.T., et al.: Prospective study of sleep-disordered breathing and hypertension: the sleep heart health study. Am. J. Respir. Crit. Care Med. 179(12), 1159–1164 (2009)

    Article  Google Scholar 

  18. Dean, D.A., et al.: Scaling up scientific discovery in sleep medicine: the national sleep research resource. Sleep 39(5), 1151–1164 (2016)

    Article  Google Scholar 

  19. Huang, X., Lin, J., Demner-Fushman, D.: Evaluation of PICO as a knowledge representation for clinical questions. Presented at the AMIA Annual Symposium Proceedings (2006)

    Google Scholar 

Download references

Acknowledgement

This work is supported in part by the NIH-NIBIB Big Data to Knowledge (BD2 K) 1U01EB020955 grant, NSF grant#1636850, and the NIH-NHLBI R24HL114473 grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Satya S. Sahoo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Valdez, J., Kim, M., Rueschman, M., Redline, S., Sahoo, S.S. (2018). Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98379-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98378-3

  • Online ISBN: 978-3-319-98379-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics