Reliable Granular References to Changing Linked Data
Abstract
Nanopublications are a concept to represent Linked Data in a granular and provenance-aware manner, which has been successfully applied to a number of scientific datasets. We demonstrated in previous work how we can establish reliable and verifiable identifiers for nanopublications and sets thereof. Further adoption of these techniques, however, was probably hindered by the fact that nanopublications can lead to an explosion in the number of triples due to auxiliary information about the structure of each nanopublication and repetitive provenance and metadata. We demonstrate here that this significant overhead disappears once we take the version history of nanopublication datasets into account, calculate incremental updates, and allow users to deal with the specific subsets they need. We show that the total size and overhead of evolving scientific datasets is reduced, and typical subsets that researchers use for their analyses can be referenced and retrieved efficiently with optimized precision, persistence, and reliability.
Notes
Acknowledgments
We would like to thank Javier D. Fernández for valuable input and discussions on RDF versioning. L.I. Furlong and E. Centeno received support from ISCIII-FEDER (PI13/00082, CP10/00524, CPII16/00026), the EU H2020 Programme 2014-2020 under grant agreements no. 634143 (MedBioinformatics) and no. 676559 (Elixir-Excelerate).
References
- 1.Auer, S., Herre, H.: A versioning and evolution framework for RDF knowledge bases. In: Virbitskaite, I., Voronkov, A. (eds.) PSI 2006. LNCS, vol. 4378, pp. 55–69. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-70881-0_8 CrossRefGoogle Scholar
- 2.Banda, J.M., Kuhn, T., Shah, N.H., Dumontier, M.: Provenance-centered dataset of drug-drug interactions. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 293–300. Springer, Cham (2015). doi: 10.1007/978-3-319-25010-6_18 CrossRefGoogle Scholar
- 3.Belleau, F., Nolin, M.-A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inf. 41(5), 706–716 (2008)CrossRefGoogle Scholar
- 4.Bohler, A., Wu, G., Kutmon, M., Pradhana, L.A., Coort, S.L., Hanspers, K., Haw, R., Pico, A.R., Evelo, C.T.: Reactome from a WikiPathways perspective. PLoS Comput. Biol. 12(5), e1004941 (2016)CrossRefGoogle Scholar
- 5.Chard, K., D’Arcy, M., Heavner, B., Foster, I., Kesselman, C., Madduri, R., Rodriguez, A., Soiland-Reyes, S., Goble, C., Clark, K., et al.: I’ll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets. In: IEEE International Conference on Big Data, pages 319–328. IEEE (2016)Google Scholar
- 6.Chichester, C., Karch, O., Gaudet, P., Lane, L., Mons, B., Bairoch, A.: Converting nextprot into linked data and nanopublications. Semant. Web 6(2), 147–153 (2015)Google Scholar
- 7.Cohen, J.P., Lo, H.Z.: Academic torrents: A community-maintained distributed repository. In: Proceedings of XSEDE 2014, p. 2. ACM (2014)Google Scholar
- 8.Fabregat, A., et al.: The reactome pathway knowledgebase. Nucleic Acids Res. 44(D1), D481–D487 (2016)CrossRefGoogle Scholar
- 9.Fernández, J.D. Polleres, A., Umbrich, J.: Towards efficient archiving of dynamic linked open data. In: DIACRON@ESWC, pp. 34–49 (2015)Google Scholar
- 10.Frommhold, M., Piris, R.N., Arndt, N., Tramp, S., Petersen, N., Martin, M.: Towards versioning of arbitrary RDF data. In: Proceedings of the 12th International Conference on Semantic Systems, pp. 33–40. ACM (2016)Google Scholar
- 11.Graube, M. Hensel, S., Urbas, L.: R43ples: revisions for triples. In Proceedings of the 1st Workshop on Linked Data Quality. Citeseer (2014)Google Scholar
- 12.Groth, P., Gibson, A., Velterop, J.: The anatomy of a nanopublication. Inf. Serv. Use 30(1–2), 51–56 (2010)CrossRefGoogle Scholar
- 13.Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linked data dynamics. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 213–227. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-38288-8_15 CrossRefGoogle Scholar
- 14.Kuhn, T.: Nanopub-java: a java library for nanopublications. In: Proceedings of the 5th Workshop on Linked Science (LISC 2015) (2015)Google Scholar
- 15.Kuhn, T., Barbano, P.E., Nagy, M.L., Krauthammer, M.: Broadening the scope of nanopublications. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 487–501. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-38288-8_33 CrossRefGoogle Scholar
- 16.Kuhn, T., Chichester, C., Krauthammer, M., Dumontier, M.: Publishing Without publishers: a decentralized approach to dissemination, retrieval, and archiving of data. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 656–672. Springer, Cham (2015). doi: 10.1007/978-3-319-25007-6_38 CrossRefGoogle Scholar
- 17.Kuhn, T., Chichester, C., Krauthammer, M., Queralt-Rosinach, N., Verborgh, R., Giannakopoulos, G., Ngomo, A.-C.N., Viglianti, R., Dumontier, M.: Decentralized provenance-aware publishing with nanopublications. PeerJ Comput. Sci. 2, e78 (2016)CrossRefGoogle Scholar
- 18.Kuhn, T., Dumontier, M.: Trusty URIs: verifiable, immutable, and permanent digital artifacts for linked data. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 395–410. Springer, Cham (2014). doi: 10.1007/978-3-319-07443-6_27 CrossRefGoogle Scholar
- 19.Kuhn, T., Dumontier, M.: Making digital artifacts on the web verifiable and reliable. IEEE Trans. Knowl. Data Eng. 27(9), 2390–2400 (2015)CrossRefGoogle Scholar
- 20.Kutmon, M., et al.: WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 44(D1), D488–D494 (2016)CrossRefGoogle Scholar
- 21.Meinhardt, P., Knuth, M., Sack, H.: TailR: a platform for preserving history on the web of data. In: Proceedings of the 11th International Conference on Semantic Systems, pp. 57–64. ACM (2015)Google Scholar
- 22.Miller, A., Juels, A., Shi, E., Parno, B., Katz, J.: Permacoin: repurposing Bitcoin work for data preservation. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 475–490. IEEE (2014)Google Scholar
- 23.Mons, B., et al.: The value of data. Nat. Genet. 43(4), 281–283 (2011)CrossRefGoogle Scholar
- 24.Moreau, L., Groth, P.: Provenance: an introduction to prov. Synth. Lect. Semant. Web Theor. Technol. 3(4), 1–129 (2013)CrossRefGoogle Scholar
- 25.Nanopubs extracted from DisGeNET v2.1.0.0, incremental dataset. Nanopublication index, 9 May 2017. http://purl.org/np/RADYX-ia_TZYAw_eZD0-2oGGA7gnMxOnVj-Gh8wdJgAzI
- 26.Nanopubs extracted from DisGeNET v3.0.0.0, incremental dataset. Nanopublication index, 9 May 2017. http://purl.org/np/RAufQaKzv1pZlMhZo2eBuZtx9vuugLBJsrs4ZkvR53xzw
- 27.Nanopubs extracted from DisGeNET v4.0.0.0, incremental dataset. Nanopublication index, 9 May 2017. http://purl.org/np/RAu0PUrg-M8HxkOiYRXkTg7r9fgOIzFZNINj8q7ywNrdM
- 28.Nanopublications extracted from WikiPathways, incremental dataset, 20170510. Nanopublication index, 11 May 2017. http://purl.org/np/RAKz0OQ3Dq8dDWqF7SIY4TgYcZRX4d2TnmLUEbOwnaGmQ
- 29.Task Group on Data Citation Standards and Practices.: Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data. In: Data Sci. J. 12, pp. CIDCR1-CIDCR75 (2013)Google Scholar
- 30.Piñero, J., Bravo, À., Queralt-Rosinach, N., Gutiérrez-Sacristán, A., Deu-Pons, J., Centeno, E., García-García, J., Sanz, F., Furlong, L.I.: DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2016)CrossRefGoogle Scholar
- 31.Queralt-Rosinach, N., Kuhn, T., Chichester, C., Dumontier, M., Sanz, F., Furlong, L.I.: Publishing DisGeNET as nanopublications. Semant. Web 7(5), 519–528 (2016)CrossRefGoogle Scholar
- 32.Queralt-Rosinach, N., Piñero, J., Bravo, À., Sanz, F., Furlong, L.I.: DisGeNET-RDF: harnessing the innovative power of the semantic web to explore the genetic basis of diseases. Bioinformatics 32, 2236–2238 (2016)CrossRefGoogle Scholar
- 33.Rauber, A., Asmi, A., van Uytvanck, D., Pröll, S.: Identification of reproducible subsets for data citation, sharing and re-use. Bull. IEEE Tech. Comm. Digit. Libr. 12(1), 6–15 (2016)Google Scholar
- 34.Schandl, B.: Replication and versioning of partial RDF graphs. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 31–45. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-13486-9_3 CrossRefGoogle Scholar
- 35.Silvello, G.: A methodology for citing linked open data subsets. D-Lib Magazine, 21(1/2) (2015)Google Scholar
- 36.Tzitzikas, Y., Theoharis, Y., Andreou, D.: On storage policies for semantic web repositories that support versioning. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 705–719. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-68234-9_51 CrossRefGoogle Scholar
- 37.Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S.: An HTTP-based versioning mechanism for linked data (2010). arXiv:1003.3661
- 38.Vander Sande, M., Colpaert, P., Verborgh, R., Coppens, S., Mannens, E., Van de Walle, R.: R&Wbase: git for triples. In: LDOW (2013)Google Scholar
- 39.Volkel, M., Winkler, W., Sure, Y., Kruk, S.R., Synak, M.: Semversion: A versioning system for RDF and ontologies. In: Proceedings of ESWC (2005)Google Scholar
- 40.Waagmeester, A., Kutmon, M., Riutta, A., Miller, R., Willighagen, E.L., Evelo, C.T., Pico, A.R.: Using the semantic web for rapid integration of WikiPathways with other biological online data resources. PLoS Comput. Biol. 12(6), e1004989 (2016)CrossRefGoogle Scholar
- 41.Wilkinson, M.D., Dumontier, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. data 3, 160018 (2016)CrossRefGoogle Scholar