Skip to main content
Log in

The software heritage license dataset (2022 edition)

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Context:

When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories

Objective:

To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis

Method:

Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses

Results:

The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums

Conclusions:

Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. License Inclusion Principles:

    https://github.com/spdx/license-list-XML/blob/main/DOCS/license-inclusion-principles.md

  2. The version of the dataset discussed in this paper is available at https://annex.softwareheritage.org/public/dataset/license-blobs/2022-04-25/; other versions of the dataset (both past versions and future ones) are available starting from https://annex.softwareheritage.org/public/dataset/license-blobs/

  3. Software Heritage is an archival project established in 2015 with the stated goal of: collect, preserve forever, and make publicly available the entire body of software, in the preferred form for making modifications to it. A detailed description of the project if out-of-scope for this paper, therefore we refer the interested reader to: previous publications about the project Di Cosmo and Zacchiroli (2017); Abramatic et al. (2018), its homepage at https://www.softwareheritage.org, and the archive status page at https://archive.softwareheritage.org (accessed 2022-10-20) where one can find an up-to-date view of the software origins that are periodically crawled to populate the archive.

  4. OSI (Open Source Initiative): https://opensource.org

  5. OSI Approved licenses: https://opensource.org/licenses-draft (accessed on 2022-10-30)

  6. SPDX license list: https://spdx.org/licenses/ (accessed on 2022-10-30)

  7. ScanCode LicenseDB:

    https://scancode-licensedb.aboutcode.org/ (accessed on 2022-10-30)

  8. https://annex.softwareheritage.org/public/dataset/license-blobs/2019-03-21/ (accessed 2022-11-10)

  9. https://annex.softwareheritage.org/public/dataset/license-blobs/2021-03-23/ (accessed 2022-11-10)

  10. https://annex.softwareheritage.org/public/dataset/license-blobs/2022-04-25/ (accessed 2022-11-10)

  11. All dataset versions are available starting from https://annex.softwareheritage.org/public/dataset/license-blobs/

  12. See the “How to apply the Apache License to your work” part of the Apache 2.0 license for an example of a license reference: https://www.apache.org/licenses/LICENSE-2.0 (accessed 2022-11-10).

  13. If the document was found under several different filenames, as it could happen, it will appear in the index once for each different filename

  14. Version used: ScanCode 31.2.1.

  15. Details about the JSON schema:

    https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/output-format.html (accessed 2022-11-09)

  16. The complete SQL query is available as part of the dataset replication package Gonzalez-Barahona et al. (2023), in the replication-package.tar.gz file.

  17. https://scancode-toolkit.readthedocs.io/, accessed 2022-11-09

  18. https://docs.softwareheritage.org/devel/swh-graph/api.html#leaves, accessed 2022-11-09

  19. SWHID swh:1:cnt:36406a1eee032e80a284d3ed9f5176bba67be064

  20. SWHID swh:1:cnt:cdc98c898b1d257ddb4752ee7a1c85ed3ddf5673

  21. SWHID swh:1:cnt:2e26bf237427aaa56f99846acb1aeb94198119e9

  22. SWHID swh:1:cnt:606a3bce98a4ade7d80c2761b8458d79438a3c6f

  23. SWHID swh:1:cnt:78ec4db8002adeae4fcbfa5f56b3c1e51bfaf8c5

  24. https://en.wikipedia.org/wiki/Nessus_Attack_Scripting_Language

  25. SWHID: swh:1:cnt:c7f43dd49cbedb819fc247b3bfe5ae45841738dc

  26. SWHID swh:1:cnt:9ea952f4a37478f17f2a2aafb45ced7a4df67de2

  27. SWHID swh:1:cnt:aa3157cb23f7de5d062ab5d0bf0ffb44bb719df9

  28. SWHID swh:1:cnt:509b6082ee6debe85c005d80f047668d70dd1cb8

  29. SWHID swh:1:cnt:f961852cee6ee9e9a0b8a25af5d090ddb6abe6a8

  30. SWHID swh:1:cnt:711ded4ae27c43ba18a71ad05e9466a268e4387a

  31. SWHID swh:1:cnt:46ae7b2bee342168dc48d6ca7fa1753b98e525d8

  32. SWHID swh:1:cnt:62319023a68b04f23ea30931bb1a7c1a3e741fba

  33. SWHID swh:1:cnt:eb9ed7bfc458af9796b59426d54d0f97a199078f

  34. SWHID swh:1:cnt:b864764d9fc4d55eb09e123e42ede11519556d18

  35. SWHID swh:1:cnt:9bffa2d5a63151c8c9bf3d68e9f9445558273612

  36. SWHID swh:1:cnt:c53a6c27009183d8304d26a213b1321bdfc0cb8d

  37. SWHID swh:1:cnt:41a6fc531459dde48d1752f24eae007047361709

  38. SWHID swh:1:cnt:4e5eebfdbebefe990e309ecbdd83842035d3852c

  39. SWHID swh:1:cnt:105961e3702324fadaa808457338a984101d6028

  40. SWHID swh:1:cnt:f3932de6d7f19b26afaa7bc8502c800476c2f0a5

  41. SWHID swh:1:cnt:fed8329964dd68adcd3dc98dd405950e53614282

  42. SWHID swh:1:cnt:60ff9a40c14915b25d265f2bdfb508274b6782fe

  43. SWHID swh:1:cnt:ace0bbb7fe0a8677ef5ae001b5da076b2aa666a5

  44. SWHID swh:1:cnt:9392142a987ee04c3f0d303a58b19df818df86b3

  45. SWHID swh:1:cnt:eb531dc6990ca433ccde3100633780ad55aed22b

  46. licen and licens are Python modules for dealing with the Document Collection.

  47. path_from_filename is a function returning the path of a document in the collection, given its name (SHA1)

  48. For a full, ready-to-work program, check the file truth/random_forest.py in the dataset

  49. Software Heritage archive changelog page:

    https://docs.softwareheritage.org/devel/archive-changelog.html (accessed 2022-11-10)

  50. https://spdx.org/licenses/, accessed 2022-11-10

  51. Debian Copyright Review Tools: https://wiki.debian.org/CopyrightReviewTools

References

  • Abramatic JF, Di Cosmo R, Zacchiroli S (2018) Building the universal archive of source code. Communications of the ACM 61(10):29–31

    Article  Google Scholar 

  • Allançon T, A Pietri, S Zacchiroli (2021) The software heritage filesystem (swhfs): Integrating source code archival with development. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2021, Madrid, Spain, May 25-28, 2021, pages 45–48. IEEE

  • Bird S (2006) NLTK: the natural language toolkit. In Nicoletta Calzolari, Claire Cardie, and Pierre Isabelle, editors, ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguistics

  • Boldi P, Pietri A, Vigna S, Zacchiroli S (2020) Ultra-large-scale repository analysis via graph compression. In SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2020

  • Caneill M, Germán DM, Zacchiroli S (2017) The debsources dataset: Two decades of free and open source software. Empirical Software Engineering 22:1405–1437

    Article  Google Scholar 

  • ClearlyDefined (2023) ClearlyDefined, 2023. https://clearlydefined.io. Accessed 2023-05-08

  • Collet Y (2022) RFC 8878 - Zstandard compression and the “application/zstd” media type, 2021. Accessed 2022-01-24

  • Di Cosmo R, Gruenpeter M, Zacchiroli S (2018) Identifiers for digital objects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA

  • Di Cosmo R, Zacchiroli S (2017) Software Heritage: Why and how to preserve software source code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017

  • Di Penta M, German DM, Gaël Guéhéneuc Y, Antoniol G (2010) An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, page 145-154, New York, NY, USA, 2010. Association for Computing Machinery

  • Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015) Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw Eng Methodol 25(1):7:1–7:34

  • Flint SW, Chauhan J, Dyer R (2021) Escaping the time pit: Pitfalls and guidelines for using time-based git data. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021 85–96. IEEE, 2021

  • Gandhi RA, Germonprez M, GJP Link (2018) Open data standards for open source software risk management routines: An examination of SPDX. In Forte A, Prilla M, Vivacqua AS, Müller C, and Lionel P. Robert Jr., editors, Proceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP 2018, Sanibel Island, FL, USA, January 07 - 10, pages 219–229. ACM, 2018

  • German DM, Di Penta M, Davies J (2010) Understanding and auditing the licensing of open source software distributions. In 2010 IEEE 18th International Conference on Program Comprehension 84–93

  • German DM, González-Barahona JM (2009) An empirical study of the reuse of software licensed under the GNU General Public License. In Boldyreff C, Crowston K, Lundell B, and Wasserman AI, editors, Open Source Ecosystems: Diverse Communities Interacting, pages 185–198, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg

  • German DM, Hassan AE (2009) License integration patterns: Addressing license mismatches in component-based development. In 2009 IEEE 31st International Conference on Software Engineering 188–198

  • Germán DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In Pecheur C, Andrews J, and Di Nitto E, editors, ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, pages 437–446. ACM, 2010

  • Germán DM, Di Penta M (2012) A method for open source license compliance of java applications. IEEE Softw 29(3):58–63

  • GitHub. Licensee (2023). https://licensee.github.io/licensee/. Accessed 2023-05-08

  • Gobeille R (2008) The fossology project. In Hassan AE, Lanza M, and Godfrey MW, editors, Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, Proceedings 47–50. ACM

  • Gomulkiewicz RW (2009) Open source license proliferation: Helpful diversity or hopeless confusion. Wash. UJL & Pol’y 30:261

  • Gonzalez-Barahona JM, Montes-Leon S, Robles G, Zacchiroli S (2023) The Software Heritage License Dataset (2022 Edition). https://doi.org/10.5281/zenodo.8200352

  • Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In Lanza M, Di Penta M, and Xie T, editors, 9th IEEE Working Conference of Mining Software Repositories, MSR, pages 12–21. IEEE Computer Society, 2012

  • Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81

  • Libraries.io. Libraries.io (2023). https://libraries.io. Accessed 2023-05-08

  • Lindberg V (2008) Intellectual property and open source: a practical guide to protecting. O’Reilly Media, Inc., 2008

  • Ma Y, Dey T, Bogart C, Amreen S, Valiev M, Tutko A, Kennard D, Zaretzki R, Mockus A (2021) World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Softw Eng 26(2):22

  • Manabe Y, German DM, Inoue K (2014) Analyzing the relationship between the license of packages and their files in free and open source software. In Corral L, Sillitti A, Succi G, Vlasenko J, and Wasserman AI, editors, Open Source Software: Mobile Open Source Technologies 51–60, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg

  • Manabe Y, Hayase Y, Inoue K (2010) Evolutional analysis of licenses in FOSS. In Andrea Capiluppi, Anthony Cleve, and Naouel Moha, editors, Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010, pages 83–87. ACM, 2010

  • Maryka T, Germán DM, Poo-Caamaño G (2015) On the variability of the BSD and MIT licenses. In Ernesto Damiani, Fulvio Frati, Dirk Riehle, and Anthony I. Wasserman, editors, Open Source Systems: Adoption and Impact - 11th IFIP WG 2.13 International Conference, OSS 2015, Florence, Italy, May 16-17, 2015, Proceedings, volume 451 of IFIP Advances in Information and Communication Technology 146–156. Springer, 2015

  • Maryka T, German DM, Poo-Caamaño G (2015) On the variability of the bsd and mit licenses. In: Damiani Ernesto, Frati Fulvio, Riehle Dirk, Wasserman Anthony I (eds) Open Source Systems: Adoption and Impact (OSS 2015). pp. Springer International Publishing, Cham, pp 146–156

  • McKinney W et al (2011) Pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing 14(9):1–9

    Google Scholar 

  • nexB ScanCode (2022) https://www.aboutcode.org/projects/scancode.html. Accessed 2022-01-25

  • nexB. ScanCode LicenseDB (2022). https://scancode-licensedb.aboutcode.org/. Accessed 2022-01-26

  • Philippe Ombredanne (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109

  • Open Source Initiative (2022) Machine readable OSI license information, 2022. https://github.com/OpenSourceOrg/licenses/. Accessed 2022-01-26

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830

  • Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119

  • Pietri A, Spinellis D, Zacchiroli S (2019) The Software Heritage graph dataset: public software development under one roof. In Storey MAD, Adams B, and Haiduc S, editors, Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada., pages 138–142. IEEE / ACM

  • Rosen L (2005) Open source licensing, volume 692. Prentice Hall

  • Rousseau G, Di Cosmo R, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25(4):2930–2959

  • Shafranovich Y (2005) RFC 4180 - common format and MIME type for comma-separated values (CSV) files, 2005. Accessed 2022-01-24

  • SPDX Workgroup (2020) Software package data exchange licence list, 2019. https://spdx.org/license-list, retrieved 30 March 2020

  • Srinivasa-Desikan B (2018) Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018

  • Stewart K, P Odence, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L Rev 2:191

  • The CodeMeta Project (2023) The CodeMeta Project, 2023. https://codemeta.github.io/. Accessed 2023-05-08

  • The Open Group (2018) file: determine file type, 2018. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/file.html. Accessed 2022-01-25

  • Vendome C, Bavota G, Di Penta M, Vásquez ML, Germán DM, Poshyvanyk D (2017) License usage and changes: a large-scale study on GitHub. Empir Softw Eng 22(3):1537–1577

  • Vendome C, Linares-Vásquez M, Bavota G, Di Penta M, German DM, Poshyvanyk D (2015) When and why developers adopt and change software licenses. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) pages 31–40

  • Vendome C, Vásquez ML, Bavota G, Di Penta M, Germán DM, Poshyvanyk D (2017) Machine learning-based detection of open source license exceptions. In Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pages 118–129. IEEE / ACM, 2017

  • Xu S, Gao Y, Fan L, Liu Z, Liu Y, and Ji H (2023) Lidetector: License incompatibility detection for open source software. ACM Trans. Softw Eng Methodol 32(1)

  • Zacchiroli S (2022) A large-scale dataset of (open source) license text variants. In The 2022 Mining Software Repositories Conference (MSR 2022), pages 757–761. ACM, 2022

  • Zhang D, Luo P, Tang W, and Zhou M (2021) Osldetector: Identifying open-source libraries through binary analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, page 1312-1315, New York, NY, USA, 2021. Association for Computing Machinery

Download references

Acknowledgements

This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org. The authors would like to thank Valentin Lorentz from the Software Heritage engineering team for his help in releasing the new version of the license dataset documented in this paper and streamlining the dataset publication process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Zacchiroli.

Ethics declarations

Conflicts of interest

The authors declared that they have no conflict of interest.

Additional information

Communicated by: Nicole Novielli, Shane McIntosh, David Lo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gonzalez-Barahona, J.M., Montes-Leon, S., Robles, G. et al. The software heritage license dataset (2022 edition). Empir Software Eng 28, 147 (2023). https://doi.org/10.1007/s10664-023-10377-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10377-w

Keywords

Navigation