Despite a long history of rich theoretical work on provenance, empirical research regarding users’ interactions with and judgments based upon provenance information in archives with scientific data is extremely limited. This article focuses on the relationship between provenance and credibility (i.e., trustworthiness and expertise) for scientists. Toward this end, the authors conducted semi-structured interviews with seventeen proteomics researchers who interact with data from ProteomeCommons.org, a large online repository. To analyze the resulting interview data, the authors apply Brian Hilligoss and Soo Young Rieh’s empirically tested theoretical framework for user credibility assessment. Findings from this study suggest that together with other information provided in ProteomeCommons.org and subjects’ own experiences and prior knowledge, provenance allows users to determine the credibility of datasets. Implications of this study stress the importance of the archival perspective of provenance and archival bond for aiding scientists in their credibility assessments of data housed in scientific data repositories.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
The actual number of users is likely higher, since users only need to register to upload datasets, not to download them.
Bazeley P (2007) Qualitative data analysis with NVivo. Sage, Los Angeles
Bearman DA, Lytle RH (1985) The power of the principle of provenance. Archivaria 21:14–27. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/11231. Accessed 28 July 2011
Bertino E, Dai C, Kantarcioglu M (2009) The challenge of assuring data trustworthiness. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. doi:10.1007/978-3-642-00887-0_2
Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37(1):1–28. doi:10.1145/1057977.1057978
Bowers S, McPhillips T, Ludäscher B, Cohen S, Davidson SB (2006) A model for user-oriented data provenance in pipelined scientific workflows. In: Proceedings of the International Provenance and Annotation Workshop. http://repository.upenn.edu/cis_papers/290/. Accessed 28 July 2011
Bowker GC (2005) Memory practices in the sciences. Inside technology. MIT Press, Cambridge, MA
Brothman B (1991) Orders of value: probing the theoretical terms of archival practice. Archivaria 32:78–100. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/11761. Accessed 28 July 2011
Buneman P, Khanna S, Tan WC (2001) why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory. http://portal.acm.org/citation.cfm?id=656274. Accessed 28 July 2011
Buneman P, Chapman A, Cheney J (2006) Provenance management in curated databases. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. doi:10.1145/1142473.1142534
Caplan P (2009) Understanding PREMIS. www.loc.gov/standards/premis/understanding-premis.pdf. Accessed 28 July, 2011
Cook T (1993) The concept of the archival fonds: theory, description, and provenance in the post-custodial era. Archivaria 35:24–37. http://journals.sfu.ca/archivar/index.php/archivaria/article/view/11882/12835. Accessed 28 July 2011
Cook T (2001) Archival science and postmodernism: new formulations for old concepts. Arch Sci 1:3–24. doi:10.1007/BF02435636
Corti L (2007) Re-using archived qualitative data—where, how and why? Arch Sci 7:37–54. doi:10.1007/s10502-006-9038-y
Dai C, Lin D, Bertino E, Kantarcioglu M (2008) An approach to evaluate data trustworthiness based on data provenance. In: Jonker W, Petković M (eds) Secure data management. Lecture notes in computer science 5159:82–89. doi: 10.1007/978-3-540-85259-9_6
Duranti L (1997) The archival bond. Arch Mus Inform 11:213–218. doi:10.1023/A:1009025127463
Duranti L (2001) The impact of digital technology on archival science. Arch Sci 1:39–55. doi:10.1007/BF02435638
Greenwood M, Goble C, Stevens R, Zhao J, Addis M, Marvin D, Moreau L et al (2003) Proceedings of the UK e-Science All Hands Meeting. doi:10.1.1.10.3526
Heinis T, Alonso G (2008) Efficient lineage tracking for scientific workflows. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. doi: 10.1145/1376616.1376716
Hilligoss B, Rieh SY (2008) Developing a unifying framework of credibility assessment: construct, heuristics, and interaction in context. Inform Process Manag 44(4):1467–1484. doi:10.1016/j.ipm.2007.10.001
Lauriault T, Craig B, Taylor D, Pulsifer P (2007) Today’s data are part of tomorrow’s research: archival issues in the sciences. Archivaria 64:123–179. http://journals.sfu.ca/archivar/index.php/archivaria/article/viewArticle/13156. Accessed 28 July 2011
PREMIS Editorial Committee (2008) PREMIS data dictionary for preservation metadata version 2.0. Library of Congress, Washington, DC. http://www.loc.gov/standards/premis/v2/premis-2-0.pdf. Accessed 28 July 2011
Rieh SY (2002) Judgment of information quality and cognitive authority in the Web. J Am Soc Inform Sci Technol 53(2):145. doi:10.1002/asi.10017.abs
Rodriguez H, Andrews P, Kinsinger C (2010) Share the (Proteomics) data. Bio-IT World, (September–October 2010). http://www.bio-itworld.com/2010/issues/sept-oct/proteomics.html. Accessed 28 July 2011
Shankar K (2007) Order from chaos: the poetics and pragmatics of scientific recordkeeping. J Am Soc Inform Sci Technol 58(10):1457–1466. doi:10.1002/asi.20625
Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. ACM SIGMOD Rec 34(3):31. doi:10.1145/1084805.1084812
Smit E (2011) Abelard and Héloise: why data and publications belong together. D-Lib Mag 17(1/2). doi:10.1045/january2011-smit
Society of American Archivists (2004) Describing archives: a content standard. Society of American Archivists, Chicago, IL
Taylor CF, Paton NW, Lilley KS et al (2007) The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25:887–893. doi:10.1038/nbt1329
Van House NA (2002) Digital libraries and practices of trust: networked biodiversity information. Soc Epistemol 16(1):99–114. doi:10.1080/02691720210132833
Vardigan M, Whiteman C (2007) ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Arch Sci 7:73–87. doi:10.1007/s10502-006-9037-z
Zimmerman AS (2008) New knowledge from old data: the role of standards in the sharing and reuse of ecological data. Sci Technol Human Values 33(5):631–652. doi:10.1177/0162243907306704
This material is based upon work supported by the National Science Foundation under Grant No. 090362. The authors would like to thank Ann Zimmerman and Margaret Hedstrom for their guidance on the development of this project, Elizabeth Yakel, and members of the Archives Research Group for their feedback on earlier drafts of this paper, as well as Philip Andrews and the staff of ProteomeCommons.org for their help and support.
(1) For this first question, I’ll ask you to refer to one of the documents I sent—example.pdf. When researchers write a paper and make the data accessible, they often provide a citation with some basic information about the dataset. This example shows one way of acknowledging a dataset the author used. In this case, it’s longish description, and they’ve included information like the data producer’s names, experimental parameters and the hash code you could use to find the dataset in Tranche. But this is just one example of how to do it. What kinds of information would you include if you were providing a citation for a dataset? Is that information readily available? Is it typically available if you are using data generated outside of your own lab?
(2) When you read a paper and there is a data citation, what information do you typically look for? Is there information you would like to see that is not typically included in citations?
(3) Imagine you’ve read a paper that cites a dataset that would be of interest in your own research, and you’re considering downloading it and using it yourself. Is the information in the citation sufficient for you to decide whether to use it? If not, what other information would you need? Where would you find that information?
(4) When you contribute data to ProteomeCommons, do you include all the information that you would include in a data citation in a paper? Do you provide more information, or less? What kinds of information do you include, and what do you leave out? Do you include all the information that someone else would need to cite your dataset appropriately?
(5) Does this information help you gauge the trustworthiness of a dataset? If not, what other information would you need?
Below, you’ll see a list of different kinds of provenance information. Imagine you have found a dataset that you know has content that is interesting to you, and you’re deciding whether to use it or not.
|Date stamp||The date on which the work described was initiated; given in the standard “YYYY-MM-DD” format (with hyphens).|
|Responsible person or institutional role||The (stable) primary contact person for this dataset; this could be the experimenter, laboratory head, line manager, etc.|
|Data transformation techniques||Includes algorithms used, preparation or processing techniques, normalization techniques.|
|Analysis tools||Includes software name and version, initial input parameters.|
|Data generation||Includes location of raw data, databases queried or specifications of equipment and conditions under which data were produced.|
On a scale of 1–5, with 1 being “Not at all confident” and 5 being “Completely confident,” how confident are you making a decision whether or not to use the data based only on the information in front of you?
What other information would you need to make you completely confident?
Is there any information on this list that you feel is not important or could be left out?
Appendix 3: Demographic survey
Please tell us about yourself.
(1) Are you a: (Please shade the appropriate circle)
Faculty member Please specify your rank:
(2) How long have you been at your current institution?
Less than 1 year
10 years or more
II. Your Experience Using ProteomeCommons
1) How heavy or light a user of ProteomeCommons are you? The scale below ranges from ‘Light’ to ‘Heavy.’ Mark the point on the scale which best matches your activity level.
(2) Why did you begin using ProteomeCommons?
(3) Have you or a project you’ve worked on ever contributed data to ProteomeCommons?
(4) When you contribute data, do you upload it yourself, or does someone else do it?
I upload data myself.
Another colleague is responsible for uploading data to ProteomeCommons
(5) Have you ever used data from ProteomeCommons in your own research?
I have used an entire dataset from ProteomeCommons.
I have used part of a dataset from ProteomeCommons.
I have never used data from ProeteomeCommons.
(6) Have you ever downloaded data from ProteomeCommons?
Another colleague is responsible for downloading data from ProteomeCommons
About this article
Cite this article
Fear, K., Donaldson, D.R. Provenance and credibility in scientific data repositories. Arch Sci 12, 319–339 (2012). https://doi.org/10.1007/s10502-012-9172-7
- Scientific data