Abstract
Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of July 2023, it has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. In this chapter, we describe the Software Heritage ecosystem, focusing on research and open science use cases.
On the one hand, Software Heritage supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code. This giant graph of source code artifacts (files, directories, and commits) can be used –and has been used– to study repository forks, open source contributors, vulnerability propagation, software provenance tracking, source code indexing, and more.
On the other hand, Software Heritage ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments, contributing to making research reproducible. The source code used in scientific experiments can be archived –e.g., via integration with open-access repositories – referenced using persistent identifiers that allow downstream integrity checks and linked to/from other scholarly digital artifacts.
Chapter PDF
References
Abramatic, J.F., Di Cosmo, R., Zacchiroli, S.: Building the universal archive of source code. Commun. ACM 61(10), 29–31 (2018). https://doi.org/10.1145/3183558
Allançon, T., Pietri, A., Zacchiroli, S.: The software heritage filesystem (SwhFS): integrating source code archival with development. In: International Conference on Software Engineering (ICSE). IEEE, Piscataway (2021). https://doi.org/10.1109/ICSE-Companion52605.2021.00032
Allen, A., Schmidt, J.: Looking before leaping: creating a software registry. J. Open Res. Softw. 3(e15) (2015). https://doi.org/10.5334/jors.bv
Alliez, P., Di Cosmo, R., Guedj, B., Girault, A., Hacid, M.S., Legrand, A., Rougier, N.: Attributing and referencing (research) software: best practices and outlook from INRIA. Comput. Sci. Eng. 22(1), 39–52 (2020). https://doi.org/10.1109/MCSE.2019.2949413. Available from https://hal.archives-ouvertes.fr/hal-02135891
Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): Generic syntax. RFC 3986, RFC Editor (2005)
Bhattacharjee, A., Nath, S.S., Zhou, S., Chakroborti, D., Roy, B., Roy, C.K., Schneider, K.A.: An exploratory study to find motives behind cross-platform forks from software heritage dataset. In: International Conference on Mining Software Repositories (MSR), pp. 11–15. ACM, New York (2020). https://doi.org/10.1145/3379597.3387512
Boldi, P., Pietri, A., Vigna, S., Zacchiroli, S.: Ultra-large-scale repository analysis via graph compression. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 184–194. IEEE, Piscataway (2020). https://doi.org/10.1109/SANER48275.2020.9054827
Bönisch, S., Brickenstein, M., Chrapary, H., Greuel, G., Sperber, W.: swMATH - a new information service for mathematical software. In: MKM/Calculemus/DML. Lecture Notes in Computer Science, vol. 7961, pp. 369–373. Springer, Berlin (2013)
Borgman, C.L., Wallis, J.C., Mayernik, M.S.: Who’s got the data? Interdependencies in science and technology collaborations. In: Computer Supported Cooperative Work (CSCW), vol. 21, pp. 485–523 (2012). https://doi.org/10.1007/s10606-012-9169-z
Childers, B.R., Fursin, G., Krishnamurthi, S., Zeller, A.: Artifact evaluation for publications (Dagstuhl Perspectives Workshop 15452). Dagstuhl Rep. 5(11), 29–35 (2016). https://doi.org/10.4230/DagRep.5.11.29
Di Cosmo, R.: Archiving and referencing source code with software heritage. In: International Conference on Mathematical Software (ICMS). Lecture Notes in Computer Science, vol. 12097, pp. 362–373. Springer, Berlin (2020). https://doi.org/10.1007/978-3-030-52200-1_36
Di Cosmo, R., Zacchiroli, S.: Software Heritage: Why and how to preserve software source code. In: International Conference on Digital Preservation (iPRES) (2017)
Di Cosmo, R., Gruenpeter, M., Zacchiroli, S.: Identifiers for digital objects: the case of software source code preservation. In: International Conference on Digital Preservation (iPRES) (2018). https://doi.org/10.17605/OSF.IO/KDE56
Di Cosmo, R., Gruenpeter, M., Marmol, B.P., Monteil, A., Romary, L., Sadowska, J.: Curated Archiving of Research Software Artifacts: lessons learned from the French open archive (HAL) (2019). Presented at the International Digital Curation Conference. Submitted to IJDC
Di Cosmo, R., Gruenpeter, M., Zacchiroli, S.: Referencing source code artifacts: a separate concern in software citation. Comput. Sci. Eng. 22(2), 33–43 (2020). https://doi.org/10.1109/MCSE.2019.2963148
Di Cosmo, R., Lopez, J.B.G., Abramatic, J.F., Graf, K., Colom, M., Manghi, P., Harrison, M., Barborini, Y., Tenhunen, V., Wagner, M., Dalitz, W., Maassen, J., Martinez-Ortiz, C., Ronchieri, E., Yates, S., Schubotz, M., Candela, L., Fenner, M., Jeangirard, E.: Scholarly Infrastructures for Research Software. European Commission. Directorate General for Research and Innovation (2020). https://doi.org/10.2777/28598
Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: International Conference on Software Engineering (ICSE), pp. 422–431 (2013)
Episciences. https://www.episciences.org. Accessed 15 April 2023
FAIRCORE4EOSC project. https://faircore4eosc.eu. Accessed 15 April 2023
FIZ Karlsruhe GmbH: swMATH mathematical software. https://swmath.org (2023). Accessed 15 April 2023
French Ministry of Research and Higher Education: French National Plan for Open Science. https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-la-science-ouverte-les-resultats-de-la-recherche-scientifique-ouverts-tous-49241 (2018)
French Ministry of Research and Higher Education: French second national plan for open science: Support and opportunities for universities’ open infrastructures and practices. https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-la-science-ouverte-2021-2024-vers-une-generalisation-de-la-science-ouverte-en-48525 (2021)
French Ministry of Research and Higher Education: Feuille de route nationale des infrastructures de recherche. https://www.enseignementsup-recherche.gouv.fr/fr/feuille-de-route-nationale-des-infrastructures-de-recherche (2022)
Heckman, J.: Varieties of selection bias. Am Eco Rev 80(2), 313–318 (1990)
Hinsen, K.: Software development for reproducible research. Comput. Sci. Eng. 15(4), 60–63 (2013). https://doi.org/10.1109/MCSE.2013.91
Howison, J., Bullard, J.: Software in the scientific literature: problems with seeing, finding, and using software mentioned in the biology literature. J. Assoc. Inf. Sci. Technol. 67(9), 2137–2155 (2016). https://doi.org/10.1002/asi.23538
Hunter, J.D.: Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55
Invenio: InvenioRDM. https://inveniosoftware.org/products/rdm/. Accessed 15 April 2023
Ivie, P., Thain, D.: Reproducibility in scientific computing. ACM Comput. Surv. 51(3), 63:1–63:36 (2018). https://doi.org/10.1145/3186266
Lamprecht, A.L., Garcia, L., Kuzak, M., Martinez, C., Arcila, R., Martin Del Pico, E., Dominguez Del Angel, V., van de Sandt, S., Ison, J., Martinez, P.A., McQuilton, P., Valencia, A., Harrow, J., Psomopoulos, F., Gelpi, J.L., Chue Hong, N., Goble, C., Capella-Gutierrez, S.: Towards FAIR principles for research software. Data Sci. 3(1), 37–59 (2020). https://doi.org/10.3233/DS-190026
Ma, Y., Bogart, C., Amreen, S., Zaretzki, R., Mockus, A.: World of code: an infrastructure for mining the universe of open source VCS data. In: International Conference on Mining Software Repositories (MSR), pp. 143–154. IEEE, Piscataway (2019). https://doi.org/10.1109/MSR.2019.00031
Merkle, R.C.: A digital signature based on a conventional encryption function. In: Advances in Cryptology (CRYPTO), pp. 369–378 (1987). https://doi.org/10.1007/3-540-48184-2%5C_32
Messerschmitt, D.G., Szyperski, C.: Software Ecosystem: Understanding an Indispensable Technology and Industry. MIT Press, Cambridge (2003)
Mockus, A.: Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: International Working Conference on Mining Software Repositories (MSR), pp. 11–20. IEEE, Piscataway (2009). https://doi.org/10.1109/MSR.2009.5069476
nexB: ScanCode. https://www.aboutcode.org/projects/scancode.html. Accessed 15 April 2023
Openaire. https://www.openaire.eu. Accessed 15 April 2023
Pietri, A.: Organizing the graph of public software development for large-scale mining. (organisation du graphe de développement logiciel pour l’analyse à grande échelle). Ph.D. Thesis, University of Paris (2021)
Pietri, A., Spinellis, D., Zacchiroli, S.: The Software Heritage graph dataset: public software development under one roof. In: International Conference on Mining Software Repositories (MSR), pp. 138–142 (2019). https://doi.org/10.1109/MSR.2019.00030
Quinlan, S., Dorward, S.: Venti: a new approach to archival data storage. In: Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley (2002). https://www.usenix.org/conference/fast-02/venti-new-approach-archival-data-storage
Rossi, D., Zacchiroli, S.: Geographic diversity in public code contributions: an exploratory large-scale study over 50 years. In: International Conference on Mining Software Repositories (MSR), pp. 80–85. ACM, New York (2022). https://doi.org/10.1145/3524842.3528471
Rossi, D., Zacchiroli, S.: Worldwide gender differences in public code contributions (and how they have been affected by the COVID-19 pandemic). In: International Conference on Software Engineering – Software Engineering in Society Track (ICSE-SEIS), pp. 172–183. ACM, New York (2022). https://doi.org/10.1109/ICSE-SEIS55304.2022.9794118
Rousseau, G., Di Cosmo, R., Zacchiroli, S.: Software provenance tracking at the scale of public source code. Empirical Software Eng. 25(4), 2930–2959 (2020). https://doi.org/10.1007/s10664-020-09828-5
Schloss Dagstuhl. https://www.dagstuhl.de. Accessed 15 April 2023
Smith, A.M., Katz, D.S., Niemeyer, K.E.: Software citation principles. PeerJ Comput. Sci. 2, e86 (2016). https://doi.org/10.7717/peerj-cs.86
Stewart, K., Odence, P., Rockett, E.: Software package data exchange (SPDX) specification. IFOSS L. Rev. 2, 191 (2010)
Stodden, V., LeVeque, R.J., Mitchell, I.: Reproducible research for scientific computing: tools and strategies for changing the culture. Comput. Sci. Eng. 14(4), 13–17 (2012). https://doi.org/10.1109/MCSE.2012.38
The Dataverse Project. https://dataverse.org. Accessed 15 April 2023
Wellenzohn, K., Böhlen, M.H., Helmer, S., Pietri, A., Zacchiroli, S.: Robust and scalable content-and-structure indexing. VLDB J. (2022). https://doi.org/10.1007/s00778-022-00764-y
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J., Groth, P., Goble, C., Grethe, J.S., Heringa, J., ’t Hoen, P.A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M.A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Zacchiroli, S.: Gender differences in public code contributions: a 50-year perspective. IEEE Softw. 38(2), 45–50 (2021). https://doi.org/10.1109/MS.2020.3038765
Zacchiroli, S.: A large-scale dataset of (open source) license text variants. In: International Conference on Mining Software Repositories (MSR), pp. 757–761. ACM, New York (2022). https://doi.org/10.1145/3524842.3528491
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Cosmo, R.D., Zacchiroli, S. (2023). The Software Heritage Open Science Ecosystem. In: Mens, T., De Roover, C., Cleve, A. (eds) Software Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-36060-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-36060-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36059-6
Online ISBN: 978-3-031-36060-2
eBook Packages: Computer ScienceComputer Science (R0)