Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Provenance and credibility in scientific data repositories

Abstract

Despite a long history of rich theoretical work on provenance, empirical research regarding users’ interactions with and judgments based upon provenance information in archives with scientific data is extremely limited. This article focuses on the relationship between provenance and credibility (i.e., trustworthiness and expertise) for scientists. Toward this end, the authors conducted semi-structured interviews with seventeen proteomics researchers who interact with data from ProteomeCommons.org, a large online repository. To analyze the resulting interview data, the authors apply Brian Hilligoss and Soo Young Rieh’s empirically tested theoretical framework for user credibility assessment. Findings from this study suggest that together with other information provided in ProteomeCommons.org and subjects’ own experiences and prior knowledge, provenance allows users to determine the credibility of datasets. Implications of this study stress the importance of the archival perspective of provenance and archival bond for aiding scientists in their credibility assessments of data housed in scientific data repositories.

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    The actual number of users is likely higher, since users only need to register to upload datasets, not to download them.

References

  1. Bazeley P (2007) Qualitative data analysis with NVivo. Sage, Los Angeles

  2. Bearman DA, Lytle RH (1985) The power of the principle of provenance. Archivaria 21:14–27. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/11231. Accessed 28 July 2011

  3. Bertino E, Dai C, Kantarcioglu M (2009) The challenge of assuring data trustworthiness. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. doi:10.1007/978-3-642-00887-0_2

  4. Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37(1):1–28. doi:10.1145/1057977.1057978

  5. Bowers S, McPhillips T, Ludäscher B, Cohen S, Davidson SB (2006) A model for user-oriented data provenance in pipelined scientific workflows. In: Proceedings of the International Provenance and Annotation Workshop. http://repository.upenn.edu/cis_papers/290/. Accessed 28 July 2011

  6. Bowker GC (2005) Memory practices in the sciences. Inside technology. MIT Press, Cambridge, MA

  7. Brothman B (1991) Orders of value: probing the theoretical terms of archival practice. Archivaria 32:78–100. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/11761. Accessed 28 July 2011

  8. Buneman P, Khanna S, Tan WC (2001) why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory. http://portal.acm.org/citation.cfm?id=656274. Accessed 28 July 2011

  9. Buneman P, Chapman A, Cheney J (2006) Provenance management in curated databases. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. doi:10.1145/1142473.1142534

  10. Caplan P (2009) Understanding PREMIS. www.loc.gov/standards/premis/understanding-premis.pdf. Accessed 28 July, 2011

  11. Cook T (1993) The concept of the archival fonds: theory, description, and provenance in the post-custodial era. Archivaria 35:24–37. http://journals.sfu.ca/archivar/index.php/archivaria/article/view/11882/12835. Accessed 28 July 2011

  12. Cook T (2001) Archival science and postmodernism: new formulations for old concepts. Arch Sci 1:3–24. doi:10.1007/BF02435636

  13. Corti L (2007) Re-using archived qualitative data—where, how and why? Arch Sci 7:37–54. doi:10.1007/s10502-006-9038-y

  14. Dai C, Lin D, Bertino E, Kantarcioglu M (2008) An approach to evaluate data trustworthiness based on data provenance. In: Jonker W, Petković M (eds) Secure data management. Lecture notes in computer science 5159:82–89. doi: 10.1007/978-3-540-85259-9_6

  15. Duranti L (1997) The archival bond. Arch Mus Inform 11:213–218. doi:10.1023/A:1009025127463

  16. Duranti L (2001) The impact of digital technology on archival science. Arch Sci 1:39–55. doi:10.1007/BF02435638

  17. Greenwood M, Goble C, Stevens R, Zhao J, Addis M, Marvin D, Moreau L et al (2003) Proceedings of the UK e-Science All Hands Meeting. doi:10.1.1.10.3526

  18. Heinis T, Alonso G (2008) Efficient lineage tracking for scientific workflows. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. doi: 10.1145/1376616.1376716

  19. Hilligoss B, Rieh SY (2008) Developing a unifying framework of credibility assessment: construct, heuristics, and interaction in context. Inform Process Manag 44(4):1467–1484. doi:10.1016/j.ipm.2007.10.001

  20. Lauriault T, Craig B, Taylor D, Pulsifer P (2007) Today’s data are part of tomorrow’s research: archival issues in the sciences. Archivaria 64:123–179. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/13156. Accessed 28 July 2011

  21. PREMIS Editorial Committee (2008) PREMIS data dictionary for preservation metadata version 2.0. Library of Congress, Washington, DC. http://www.loc.gov/standards/premis/v2/premis-2-0.pdf. Accessed 28 July 2011

  22. Rieh SY (2002) Judgment of information quality and cognitive authority in the Web. J Am Soc Inform Sci Technol 53(2):145. doi:10.1002/asi.10017.abs

  23. Rodriguez H, Andrews P, Kinsinger C (2010) Share the (Proteomics) data. Bio-IT World, (September–October 2010). http://www.bio-itworld.com/2010/issues/sept-oct/proteomics.html. Accessed 28 July 2011

  24. Shankar K (2007) Order from chaos: the poetics and pragmatics of scientific recordkeeping. J Am Soc Inform Sci Technol 58(10):1457–1466. doi:10.1002/asi.20625

  25. Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. ACM SIGMOD Rec 34(3):31. doi:10.1145/1084805.1084812

  26. Smit E (2011) Abelard and Héloise: why data and publications belong together. D-Lib Mag 17(1/2). doi:10.1045/january2011-smit

  27. Society of American Archivists (2004) Describing archives: a content standard. Society of American Archivists, Chicago, IL

  28. Taylor CF, Paton NW, Lilley KS et al (2007) The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25:887–893. doi:10.1038/nbt1329

  29. Van House NA (2002) Digital libraries and practices of trust: networked biodiversity information. Soc Epistemol 16(1):99–114. doi:10.1080/02691720210132833

  30. Vardigan M, Whiteman C (2007) ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Arch Sci 7:73–87. doi:10.1007/s10502-006-9037-z

  31. Zimmerman AS (2008) New knowledge from old data: the role of standards in the sharing and reuse of ecological data. Sci Technol Human Values 33(5):631–652. doi:10.1177/0162243907306704

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 090362. The authors would like to thank Ann Zimmerman and Margaret Hedstrom for their guidance on the development of this project, Elizabeth Yakel, and members of the Archives Research Group for their feedback on earlier drafts of this paper, as well as Philip Andrews and the staff of ProteomeCommons.org for their help and support.

Author information

Correspondence to Kathleen Fear.

Appendices

Appendix 1

  • (1) For this first question, I’ll ask you to refer to one of the documents I sent—example.pdf. When researchers write a paper and make the data accessible, they often provide a citation with some basic information about the dataset. This example shows one way of acknowledging a dataset the author used. In this case, it’s longish description, and they’ve included information like the data producer’s names, experimental parameters and the hash code you could use to find the dataset in Tranche. But this is just one example of how to do it. What kinds of information would you include if you were providing a citation for a dataset? Is that information readily available? Is it typically available if you are using data generated outside of your own lab?

  • (2) When you read a paper and there is a data citation, what information do you typically look for? Is there information you would like to see that is not typically included in citations?

  • (3) Imagine you’ve read a paper that cites a dataset that would be of interest in your own research, and you’re considering downloading it and using it yourself. Is the information in the citation sufficient for you to decide whether to use it? If not, what other information would you need? Where would you find that information?

  • (4) When you contribute data to ProteomeCommons, do you include all the information that you would include in a data citation in a paper? Do you provide more information, or less? What kinds of information do you include, and what do you leave out? Do you include all the information that someone else would need to cite your dataset appropriately?

  • (5) Does this information help you gauge the trustworthiness of a dataset? If not, what other information would you need?

Appendix 2

Below, you’ll see a list of different kinds of provenance information. Imagine you have found a dataset that you know has content that is interesting to you, and you’re deciding whether to use it or not.

 

Element name Description
Date stamp The date on which the work described was initiated; given in the standard “YYYY-MM-DD” format (with hyphens).
Responsible person or institutional role The (stable) primary contact person for this dataset; this could be the experimenter, laboratory head, line manager, etc.
Data transformation techniques Includes algorithms used, preparation or processing techniques, normalization techniques.
Analysis tools Includes software name and version, initial input parameters.
Data generation Includes location of raw data, databases queried or specifications of equipment and conditions under which data were produced.

On a scale of 1–5, with 1 being “Not at all confident” and 5 being “Completely confident,” how confident are you making a decision whether or not to use the data based only on the information in front of you?

figurea

What other information would you need to make you completely confident?

Is there any information on this list that you feel is not important or could be left out?

Appendix 3: Demographic survey

Please tell us about yourself.

I. Background

(1) Are you a: (Please shade the appropriate circle)

 Graduate student

 Post-doctoral fellow/researcher

 Faculty member Please specify your rank:

 Lab technician

(2) How long have you been at your current institution?

 Less than 1 year

 1–5 years

 5–10 years

 10 years or more

II. Your Experience Using ProteomeCommons

1) How heavy or light a user of ProteomeCommons are you? The scale below ranges from ‘Light’ to ‘Heavy.’ Mark the point on the scale which best matches your activity level.

figureb

(2) Why did you begin using ProteomeCommons?

 [open ended]

(3) Have you or a project you’ve worked on ever contributed data to ProteomeCommons?

 No

 Yes

(4) When you contribute data, do you upload it yourself, or does someone else do it?

 I upload data myself.

 Another colleague is responsible for uploading data to ProteomeCommons

(5) Have you ever used data from ProteomeCommons in your own research?

 I have used an entire dataset from ProteomeCommons.

 I have used part of a dataset from ProteomeCommons.

 I have never used data from ProeteomeCommons.

(6) Have you ever downloaded data from ProteomeCommons?

 No

 Yes

Another colleague is responsible for downloading data from ProteomeCommons

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Fear, K., Donaldson, D.R. Provenance and credibility in scientific data repositories. Arch Sci 12, 319–339 (2012). https://doi.org/10.1007/s10502-012-9172-7

Download citation

Keywords

  • Provenance
  • Credibility
  • Scientific data
  • Metadata